Subjective video quality tests are psychophysical experiments in which a number of viewers rate a given set of stimuli.
[1] The main idea of measuring subjective video quality is similar to the mean opinion score (MOS) evaluation for audio.
To evaluate the subjective video quality of a video processing system, the following steps are typically taken: Many parameters of the viewing conditions may influence the results, such as room illumination, display type, brightness, contrast, resolution, viewing distance, and the age and educational level of viewers.
For example, one may select excerpts from contents of different genres, such as action movies, news shows, and cartoons.
[3] However, most recommendations for the number of subjects have been designed for measuring video quality encountered by a home television or PC user, where the range and diversity of distortions tend to be limited (e.g., to encoding artifacts only).
Given the large ranges and diversity of impairments that may occur on videos captured with mobile devices and/or transmitted over wireless networks, generally, a larger number of human subjects may be required.
[4] They claim that in order to ensure statistically significant differences when comparing ratings, a larger number of subjects than usually recommended may be needed.
[2] There is an ongoing discussion in the QoE community as to whether a viewer's cultural, social, or economic background has a significant impact on the obtained subjective video quality results.
However, due to possible influence factors from heterogenous contexts, it is typically advised to perform tests in a neutral environment, such as a dedicated laboratory room.
[8] Here, viewers give ratings using their own computer, at home, rather than taking part in a subjective quality test in laboratory rooms.
While this method allows for obtaining more results than in traditional subjective tests at lower costs, the validity and reliability of the gathered responses must be carefully checked.
[2][7] For example, the correlation between a person's individual scores and the overall MOS, evaluated for all sequences, is a good indicator of their reliability in comparison with the remaining test participants.
These may lead to different and inaccurate scoring behavior and consequently result in MOS values that are not representative of the “true quality” of a stimulus.
Some methods may have fewer context effects (i.e. where the order of stimuli influences the results), which are unwanted test biases.
A number of subjective picture and video quality databases based on such studies have been made publicly available by research institutes.