VGGSounder: Audio-Visual Evaluations for Foundation Models

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

Samples

🔊 Audible only:

male speech, man speaking playing bass guitar playing drum kit playing electric guitar background music Meta voice over Meta

📹 Audible & Visible:

motorboat, speedboat acceleration Original sloshing water splashing water

🔊 Audible only:

people sniggering

📹 Audible & Visible:

fireworks banging footsteps on snow lighting firecrackers Original male speech, man speaking

🔊 Audible only:

people belly laughing people sniggering

👁️ Visible only:

splashing water

📹 Audible & Visible:

dog barking dog bow-wow goat bleating sea lion barking

🔊 Audible only:

air horn voice over Meta

📹 Audible & Visible:

male speech, man speaking people cheering people crowd playing hockey

📹 Audible & Visible:

child speech, kid speaking children shouting male speech, man speaking people belly laughing people cheering people crowd people screaming people sniggering sloshing water Original swimming

🔊 Audible only:

orchestra playing drum kit playing flute playing trombone playing trumpet playing violin, fiddle background music Meta

📹 Audible & Visible:

dog barking dog bow-wow Original

🔊 Audible only:

cricket chirping owl hooting Original wind noise wind rustling leaves

📹 Audible & Visible:

fire crackling female speech, woman speaking

📹 Audible & Visible:

motorboat, speedboat acceleration Original people screaming rowboat, canoe, kayak rowing splashing water

BibTeX

@inproceedings{zverevwiedemer2025vggsounder, author = {Daniil Zverev and Thaddäus Wiedemer and Ameya Prabhu and Matthias Bethge and Wieland Brendel and A. Sophia Koepke}, title = {VGGSounder: Audio-Visual Evaluations for Foundation Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025} }

🐗 VGGSounder: Audio-Visual Evaluations for Foundation Models

Abstract

Samples

Video

BibTeX