Visual temporal attention

As visual spatial attention mechanism allows human and/or computer vision systems to focus more on semantically more substantial regions in space, visual temporal attention modules enable machine learning algorithms to emphasize more on critical video frames in video analytics tasks, such as human action recognition.

In convolutional neural network-based systems, the prioritization introduced by the attention mechanism is regularly implemented as a linear weighting layer with parameters determined by labeled training data.

[2][4] Research in human action recognition has accelerated significantly since the introduction of powerful tools such as Convolutional Neural Networks (CNNs).

Besides, each stream in the proposed ATW CNN framework is capable of end-to-end training, with both network parameters and temporal weights optimized by stochastic gradient descent (SGD) with back-propagation.

Experimental results show that the ATW CNN attention mechanism contributes substantially to the performance gains with the more discriminative snippets by focusing on more relevant video segments.

Video frames of the Parallel Bars action category in the UCF-101 dataset [ 1 ] (a) The highest ranking four frames in video temporal attention weights, in which the athlete is performing on the parallel bars; (b) The lowest ranking four frames in video temporal attention weights, in which the athlete is standing on the ground. All weights are predicted by the ATW CNN algorithm. [ 2 ] The highly weighted video frames generally captures the most distinctive movements relevant to the action category.
ATW CNN architecture. [ 4 ] Three CNN streams are used to process spatial RGB images, temporal optical flow images, and temporal warped optical flow images, respectively. An attention model is employed to assign temporal weights between snippets for each stream/modality. Weighted sum is used to fuse predictions from the three streams/modalities.