1Technical University of Munich, MCML 2University of Tübingen 3Tübingen AI Center 4MPI for Intelligent Systems, ELLIS Institute Tübingen
*Equal contributors

Can we tell if a model is truly multimodal?

Audio-visual models should understand both what they hear and what they see — but standard benchmarks like the VGGSound [1] can't verify either.

1

Joint scores can't tell listening from looking

Without per-modality ground truth, a model that ignores audio entirely scores the same as one with true audio-visual understanding.

2

One label per clip marks correct predictions as errors

Most clips contain several classes — many only audible or only visible. Predicting a class the single label misses counts as a mistake, and ~48% of samples are modality-misaligned.

VGGSounder re-annotates the VGGSound's test set to fix both — here is what changes.

VGGSounder vs VGGSound

Same videos, richer ground truth: every co-occurring label, tagged as audible, visible, or both.

VGGSound
  • One single label per clip
  • No modality information
  • ~48% of samples are modality-misaligned
  • No meta-labels for confounders
  • Co-occurring classes counted as errors
  • Has both train and test splits
VGGSounder
  • Multi-label: every co-occurring class
  • Per-label audible / visible / both
  • Modality-aligned subsets you can filter to
  • Meta-labels: background music, voice-over, static image
  • Synonym & superclass merging
  • Test split only (re-annotates VGGSound's test set)

This unlocks: modality profiling modality confusion controlled, confounder-free evaluation

🎬 Can you guess every label in this clip? Try the annotation task Try

This is the task our Mechanical Turk annotators faced. Watch the clip and decide, for each proposed label, whether it is audible, visible, both, or neither — then check your answers against VGGSounder's ground-truth modality annotations.

Dataset Preview

Adding a modality can make models worse

A model can answer correctly from audio alone — then lose that answer once it also sees the video. We call this modality confusion. Toggle the input below to see it happen.

Modality confusion across models

VGGSounder's per-modality annotations let us profile every model — and even surface biases in the data they were trained on.

Many models are vision-centric

Some models overfit to speech

Models see better when things move

Video

Abstract

The VGGSound dataset is commonly used to benchmark audio-visual classification, but it suffers from incomplete labelling, partially overlapping classes, and misaligned modalities, which distort evaluations of auditory and visual capabilities. We introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is designed to evaluate audio-visual foundation models. VGGSounder adds detailed per-label modality annotations, enabling precise modality-specific analyses, and we reveal model limitations by measuring the performance degradation that occurs when a second input modality is added — our new modality confusion metric.

BibTeX

@inproceedings{zverevwiedemer2025vggsounder,
  author    = {Daniil Zverev and Thaddäus Wiedemer and Ameya Prabhu and Matthias Bethge and Wieland Brendel and A. Sophia Koepke},
  title     = {VGGSounder: Audio-Visual Evaluations for Foundation Models},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}