Audio-visual models should understand both what they hear and what they see — but standard benchmarks like the VGGSound [1] can't verify either.
Without per-modality ground truth, a model that ignores audio entirely scores the same as one with true audio-visual understanding.
Most clips contain several classes — many only audible or only visible. Predicting a class the single label misses counts as a mistake, and ~48% of samples are modality-misaligned.
VGGSounder re-annotates the VGGSound's test set to fix both — here is what changes.
Same videos, richer ground truth: every co-occurring label, tagged as audible, visible, or both.
This unlocks: modality profiling modality confusion controlled, confounder-free evaluation
This is the task our Mechanical Turk annotators faced. Watch the clip and decide, for each proposed label, whether it is audible, visible, both, or neither — then check your answers against VGGSounder's ground-truth modality annotations.
Scroll to explore more samples →
A model can answer correctly from audio alone — then lose that answer once it also sees the video. We call this modality confusion. Toggle the input below to see it happen.
VGGSounder's per-modality annotations let us profile every model — and even surface biases in the data they were trained on.
The VGGSound dataset is commonly used to benchmark audio-visual classification, but it suffers from incomplete labelling, partially overlapping classes, and misaligned modalities, which distort evaluations of auditory and visual capabilities. We introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is designed to evaluate audio-visual foundation models. VGGSounder adds detailed per-label modality annotations, enabling precise modality-specific analyses, and we reveal model limitations by measuring the performance degradation that occurs when a second input modality is added — our new modality confusion metric.
@inproceedings{zverevwiedemer2025vggsounder,
author = {Daniil Zverev and Thaddäus Wiedemer and Ameya Prabhu and Matthias Bethge and Wieland Brendel and A. Sophia Koepke},
title = {VGGSounder: Audio-Visual Evaluations for Foundation Models},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}