🐗 VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev^*,1, Thaddäus Wiedemer^*,2,3,4, Ameya Prabhu^2,3, Matthias Bethge^2,3, Wieland Brendel^3,4, A. Sophia Koepke^1,2,3

¹Technical University of Munich, MCML ²University of Tübingen ³Tübingen AI Center ⁴MPI for Intelligent Systems, ELLIS Institute Tübingen

^*Equal contributors

Paper arXiv Video Code

Does your model actually listen or just look?

A good audio-visual model should understand what it hears and what it sees equally well. Often it doesn't, many models quietly favour one sense over the other. And standard benchmarks like VGGSound [1] can't catch this, because they never record whether a label can be heard or seen.

✓

VGGSounder labels every class as heard, seen, or both, so you can check exactly what a model sees and hears, and quantify which modality it prefers.

What we found

👁️
Vision over audio. Many foundation models lean on what they see and forget what they hear once video is added.
🎞️
Motion over stills. Many models do noticeably worse on clips that are just a still image than on real video.
🗣️
Speech bias. Several models lean on spoken narration as a shortcut, and stumble when it's gone.

Explore dataset See the evidence ↓

VGGSounder vs VGGSound

Same videos, richer ground truth: every co-occurring label, tagged as audible, visible, or both.

VGGSound

✗ One single label per clip
✗ No modality information
✗ ~48% of samples are modality-misaligned
✗ No meta-labels for confounders
✗ Co-occurring classes counted as errors
✓ Has both train and test splits

VGGSounder

✓ Multi-label: every co-occurring class
✓ Per-label audible / visible / both
✓ Modality-aligned subsets you can filter to
✓ Meta-labels: background music, voice-over, static image
✓ Synonym & superclass merging
✗ Test split only (re-annotates VGGSound's test set)

What this makes measurable: modality profiling modality confusion controlled, confounder-free evaluation

🎬 Can you guess every label in this clip? Try the annotation task Try ▾

This is the task our Mechanical Turk annotators faced. Watch the clip and decide, for each proposed label, whether it is audible, visible, both, or neither, then check your answers against VGGSounder's ground-truth modality annotations.

Dataset Preview

Scroll to explore more samples →

Audible only:

male speech, man speaking playing bass guitar playing drum kit playing electric guitar background music Meta voice over Meta

Visible only:

sea waves

Audible & Visible:

sloshing water splashing water spraying water motorboat, speedboat acceleration Original

Audible only:

male speech, man speaking people sniggering

Audible & Visible:

fireworks banging Original footsteps on snow

Audible only:

people belly laughing people sniggering

Visible only:

splashing water

Audible & Visible:

dog barking Original dog bow-wow goat bleating sea lion barking

Audible only:

air horn male speech, man speaking voice over Meta

Audible & Visible:

people cheering people crowd playing hockey playing lacrosse

Audible only:

people belly laughing voice over Meta

Audible & Visible:

child speech, kid speaking children shouting male speech, man speaking people cheering people crowd people screaming sloshing water Original swimming people sniggering

Audible only:

orchestra playing drum kit playing flute playing trombone playing trumpet playing violin, fiddle background music Meta

Audible & Visible:

dog barking dog bow-wow Original dog baying dog growling dog whimpering

Audible only:

cricket chirping female speech, woman speaking owl hooting Original wind noise voice over Meta

Audible & Visible:

fire crackling

Audible only:

people screaming voice over Meta

Audible & Visible:

engine accelerating, revving, vroom motorboat, speedboat acceleration Original splashing water

Adding a modality can make models worse

A model can answer correctly from audio alone, then lose that answer once it also sees the video. We call this modality confusion. Toggle the input below to see it happen.

Modality confusion across models

VGGSounder's per-modality annotations let us profile every model, and even surface biases in the data they were trained on.

Video

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multimodal understanding. The VGGSound dataset is commonly used as a benchmark for evaluating audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

BibTeX

@inproceedings{zverevwiedemer2025vggsounder,
  author    = {Daniil Zverev and Thaddäus Wiedemer and Ameya Prabhu and Matthias Bethge and Wieland Brendel and A. Sophia Koepke},
  title     = {VGGSounder: Audio-Visual Evaluations for Foundation Models},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}

🐗 VGGSounder: Audio-Visual Evaluations for Foundation Models

Does your model actually listen or just look?

What we found

VGGSounder vs VGGSound

Dataset Preview

Adding a modality can make models worse

Modality confusion across models

Many models are vision-centric

Some models overfit to speech

Models see better when things move

Video

Abstract

BibTeX