Time delay neural network

For the classification of a temporal pattern (such as speech), the TDNN thus avoids having to determine the beginning and end points of sounds before classifying them.

Because the TDNN recognizes phonemes and their underlying acoustic/phonetic features, independent of position in time, it improved performance over static classification.

[7] In 1990, Yamaguchi et al. used max pooling in TDNNs in order to realize a speaker independent isolated word recognition system.

In speech, time-shift invariant training was shown to learn weight matrices that are independent of precise positioning of the input.

[9] The precise architecture of TDNNs (time-delays, number of layers) is mostly determined by the designer depending on the classification problem and the most useful context sizes.

TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models.

When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques.

Speech lends itself nicely to TDNNs as spoken sounds are rarely of uniform length and precise segmentation is difficult or impossible.

Integration of TDNNs into large vocabulary speech recognizers is possible by introducing state transitions and search between phonemes that make up a word.

[14] Here, TDNN-based recognizers used visual and acoustic features jointly to achieve improved recognition accuracy, particularly in the presence of noise, where complementary information from an alternate modality could be fused nicely in a neural net.

[4] Video has a temporal dimension that makes a TDNN an ideal solution to analysing motion patterns.

TDNN diagram