Multimodal human-computer interaction involves natural communication with virtual and physical environments.
Multimodal output systems present information through visual and auditory cues, using touch and olfaction.
Then a multimodal system has to recognize the inputs from the different modalities combining them according to temporal and contextual constraints[3] in order to allow their interpretation.
[13][14][15][16][17][18] Finally the system returns to the user outputs through the various modal channels (disaggregated) arranged according to a consistent feedback (fission).
[19] The pervasive use of mobile devices, sensors and web technologies can offer adequate computational resources to manage the complexity implied by the multimodal interaction.
"Using cloud for involving shared computational resources in managing the complexity of multimodal interaction represents an opportunity.
[20] Two major groups of multimodal interfaces have merged, one concerned in alternate input methods and the other in combined input/output.
The first group of interfaces combined various user input modes beyond the traditional keyboard and mouse input/output, such as speech, pen, touch, manual gestures,[21] gaze and head and body movements.
On a mobile device with a small visual interface and keypad, a word may be quite difficult to type but very easy to say (e.g. Poughkeepsie).
Other users will be "situationally impaired" (e.g. wearing gloves in a very noisy environment, driving, or needing to enter a credit card number in a public place) and will simply use the appropriate modalities as desired.
The most common form of input multimodality in the market makes use of the XHTML+Voice (aka X+V) Web markup language, an open specification developed by IBM, Motorola, and Opera Software.
Multimodal biometric systems can obtain sets of information from the same marker (i.e., multiple images of an iris, or scans of the same finger) or information from different biometrics (requiring fingerprint scans and, using voice recognition, a spoken passcode).
Matching-score level fusion consolidates the scores generated by multiple classifiers pertaining to different modalities.
[51] An important step in multimodal interface design is the creation of natural mappings between modalities and the information and tasks.
Examples of auditory feedback include auditory icons in computer operating systems indicating users' actions (e.g. deleting a file, open a folder, error), speech output for presenting navigational guidance in vehicles, and speech output for warning pilots on modern airplane cockpits.
Examples of tactile signals include vibrations of the turn-signal lever to warn drivers of a car in their blind spot, the vibration of auto seat as a warning to drivers, and the stick shaker on modern aircraft alerting pilots to an impending stall.
The process of integrating information from various input modalities and combining them into a complete command is referred as multimodal fusion.
[4][6][59][60][61][62][63][64] The recognition-based fusion (also known as early fusion) consists in merging the outcomes of each modal recognizer by using integration mechanisms, such as, for example, statistical integration techniques, agent theory, hidden Markov models, artificial neural networks, etc.
Examples of decision-based fusion strategies are typed feature structures,[55][60] melting pots,[57][58] semantic frames,[7][11] and time-stamped lattices.
[8] The potential applications for multimodal fusion include learning environments, consumer relations, security/surveillance, computer animation, etc.
[65] In the hybrid multi-level fusion, the integration of input modalities is distributed among the recognition and decision levels.
The hybrid multi-level fusion includes the following three methodologies: finite-state transducers,[60] multimodal grammars[6][59][61][62][63][64][66] and dialogue moves.
[70] The natural mapping between the multimodal input, which is provided by several interaction modalities (visual and auditory channel and sense of touch), and information and tasks imply to manage the typical problems of human-human communication, such as ambiguity.