MUSHRA stands for Multiple Stimuli with Hidden Reference and Anchor and is a methodology for conducting a codec listening test to evaluate the perceived quality of the output from lossy audio compression algorithms.
MUSHRA can be used to test audio codecs across a broad spectrum of use cases: music and film consumption, speech for e.g. podcasts and radio, online streaming (in which trade-offs between quality and efficiency of size and computation are paramount), modern digital telephony, and VOIP applications (which require quasi-real-time, low-bitrate encoding that remains intelligible).
Professional, "audiophile", and "prosumer" uses are typically better suited to alternative tests, like the aforementioned ABC/HR, with a base assumption of high-quality, high-resolution audio wherein there will be minimal detectable differences between reference material and the codec output.
The main advantage over the mean opinion score (MOS) methodology (which serves a similar purpose) is that MUSHRA requires fewer participants to obtain statistically significant results.
The easiest and most common is to disqualify, post-hoc, all listeners who rate the hidden reference repeat below 90 MUSHRA points for more than 15% of all test items.
Discriminability analyses a sort of intertest reliability by checking that listeners can distinguish between test signals of different conditions.
It negates the small chance of a complete redo in the rare case in which a sample's results lack sufficient statistical power due to an excessive failure rate discovered after the fact.
At the same time, the test items should be ecologically valid: they should be representative of broadcast material and not mere synthetic signals designed to be difficult to encode at the expense of realism.
[8] However, even when trying to choose stationary items, ecologically valid stimuli (i.e. audio that is likely to appear or similar to that likely to appear in real-world situations such as on radio) will very often have sections that are slightly more critical than the rest of the signal (examples include keywords in a speech or major phrases of music and are dependent on the stimulus type).
A MUSHRA study with Mandarin Chinese and German listeners found no significant difference between rating foreign and native language test items.
Despite the lack of distinction in the end results, listeners did need more time and comparison opportunities (repetitions) to accurately evaluate the foreign language items.