Please confirm that you will use this dataset only for research or educational evaluation and will not redistribute the video clips as standalone media.

JointAVBench contains short, low-resolution clips derived from public online videos for audio-visual reasoning research. By requesting access, you agree to use the dataset for research or educational purposes, not to redistribute the clips as standalone media, and to respect valid takedown requests.

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Overview

JointAVBench is a benchmark for evaluating omni-modal large language models on joint audio-visual reasoning tasks. Each multiple-choice question is designed to require both visual and auditory information.

This repository contains the audited release of JointAVBench under the roverx12345 namespace. The benchmark keeps the original 2,853-question split while refining answer labels through post-audit majority voting to reduce residual annotation inconsistencies.

Key Features

2,853 MCQs across 15 task types.
5 cognitive dimensions: Temporal, Spatial, Long-form, Emotion, and Plot understanding.
4 audio information types: Speech, Sound events, Music, and Vocal traits.
3 scene spans: Single-scene, Multi-scene, and Full-scene reasoning.
Audited answer labels: answer labels are refined with an additional independent LLM-audit and majority-vote procedure.

Dataset Structure

JointAVBench/
├── jointavbench.json           # Audited benchmark questions
├── video_annotations.json      # Video/audio caption annotations
├── subtitle.zip                # Subtitle/transcription files
├── videos/
│   └── <qid>.mp4               # Low-resolution benchmark clips
└── README.md

Data Format

Each item in jointavbench.json follows this format:

{
  "qid": "-CEDoGn0w1s_task1_0",
  "video_name": "-CEDoGn0w1s",
  "task": "STL",
  "question": "Which objects are mentioned only in the dialogue but not clearly shown in the video, and when does the first object appear in the dialogue?",
  "correct_answer": "The broom, mentioned at around 6.34s",
  "explanation": "The object \"broom\" is mentioned in the dialogue but does not appear in the video description. It is the first object mentioned in the dialogue, appearing at around 6.34s.",
  "options": [
    "The shovel, mentioned at around 6.34s",
    "The keys, mentioned at around 3.36s",
    "The hat, mentioned at around 12.76s",
    "The broom, mentioned at around 6.34s"
  ],
  "video_url": "https://www.youtube.com/watch?v=-CEDoGn0w1s",
  "segment_timestamp": [653.444, 699.657],
  "answer_label": "D",
  "clip_path": "videos/-CEDoGn0w1s_task1_0.mp4"
}

The answer_label field is included in this audited release to make option-based evaluation unambiguous. The video_url field gives the original YouTube source URL, while clip_path points to the corresponding benchmark clip in this repository.

Video Clips and Source Links

The repository provides short benchmark clips under videos/ for reproducing audio-visual reasoning evaluation. The original source link for each clip is retained in the video_url field; if you need the original 1080p high-resolution videos, please refer to the original video links.

Video resolution may vary between the released clips and the original source videos. The video_url field provides the original source link for users who require the high-resolution version.

Please do not redistribute the clips as standalone media or use them as a substitute for the original videos. If you are a rights holder and have concerns about a clip, contact the maintainers and we will review removal requests promptly.

Audit Note

After manual verification, we perform an additional answer-label audit. For each retained MCQ, an independent strong LLM auditor selects one answer from the given question and options without using outputs from evaluated models. The auditor answer is combined with the original annotation via majority voting. If no majority is reached, the original human-verified label is retained. This process refines labels without removing samples.

License

This dataset is released under the CC BY-SA 4.0 license.

Downloads last month: 629

Total file size:

17.9 GB

Paper for roverx12345/jointavbench

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

Paper • 2512.12772 • Published Dec 14, 2025