MMMU

A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue*^†,1, Yuansheng Ni*² , Kai Zhang*³ , Tianyu Zheng*⁴,
Ruoqi Liu³, Ge Zhang², Samuel Stevens³, Dongfu Jiang², Weiming Ren², Yuxuan Sun⁴, Cong Wei², Botao Yu³, Ruibin Yuan⁵, Renliang Sun², Ming Yin⁷, Boyuan Zheng³, Zhenzhu Yang⁴, Yibo Liu⁶, Wenhao Huang⁴,
Huan Sun*³ , Yu Su*^†,3 , Wenhu Chen*^†,2

¹IN.AI Research, ²University of Waterloo, ³The Ohio State University, ⁴Independent,
⁵Carnegie Mellon University, ⁶University of Victoria, ⁷Princeton University

*Core Contributors
†Corresponding to: xiangyue.work@gmail.com, su.809@osu.edu, wenhuchen@uwaterloo.ca

arXiv

🤗

HF Paper

🤗

Dataset Code Leaderboard EvalAI Twitter Examples

Overview of the MMMU dataset. MMMU presents four challenges: 1) comprehensiveness: 11.5K college-level problems across six broad disciplines and 30 college subjects; 2) highly heterogeneous image types; 3) interleaved text and images; 4) expert-level perception and reasoning rooted in deep subject knowledge.

🔔News

🚀[2024-01-31]: We added Human Expert performance on the Leaderboard!🌟

🔥[2023-12-04]: Our evaluation server for the test set is now available on EvalAI. We welcome all submissions and look forward to your participation! 😆

Introduction

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

Overview

We introduce the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark, a novel benchmark meticulously curated to assess the expert-level multimodal understanding capability of foundation models across a broad scope of tasks. Covering subjects across disciplines, including Art, Business, Health & Medicine, Science, Humanities & Social Science, and Tech & Engineering, and over subfields. The detailed subject coverage and statistics are detailed in the figure. The questions in our benchmark were manually collected by a team of college students (including coauthors) from various disciplines and subjects, drawing from online sources, textbooks, and lecture materials.

MMMU is designed to measure three essential skills in LMMs: perception, knowledge, and reasoning. Our aim is to evaluate how well these models can not only perceive and understand information across different modalities but also apply reasoning with subject-specific knowledge to derive the solution.

Our MMMU benchmark introduces key challenges to multimodal foundation models, as detailed in a figure. Among these, we particularly highlight the challenge stemming from the requirement for both expert-level visual perceptual abilities and deliberate reasoning with subject-specific knowledge. This challenge is vividly illustrated through our tasks, which not only demand the processing of various heterogeneous image types but also necessitate a model's adeptness in using domain-specific knowledge to deeply understand both the text and images and to reason. This goes significantly beyond basic visual perception, calling for an advanced approach that integrates advanced multimodal analysis with domain-specific knowledge.

Comparisons with Existing Benchmarks

To further distinguish the difference between dataset and other existing ones, we elaborate the benchmark details in Figure. From the breadth perspective, the prior benchmarks are heavily focused on daily knowledge and common sense. The covered image format is also limited. Our benchmark aims to cover college-level knowledge with 30 image formats including diagrams, tables, charts, chemical structures, photos, paintings, geometric shapes, music sheets, medical images, etc. In the depth aspect, the previous benchmarks normally require commonsense knowledge or simple physical or temporal reasoning. In contrast, our benchmark requires deliberate reasoning with college-level subject knowledge.

Sampled MMMU examples from each discipline. The questions and images need expert-level knowledge to understand and reason.

Statistics

Sampled MMMU examples from each discipline. The questions and images need expert-level knowledge to understand and reason.

Key statistics of the MMMU benchmark

Distribution of image types in the MMMU dataset

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark. For all models, we use the default prompt provided by each model for multi-choice or open QA, if available. If models do not provide prompts for task types in MMMU, we conduct prompt engineering on the validation set and use the most effective prompt for the later zero-shot experiment.

Human Expert Open-Source Proprietary

Reset	Overall	Art & Design	Business	Science	Health & Medicine	Human. & Social Sci.	Tech & Eng.
Human Expert (Best)	88.6	89.2	90.7	90.0	87.3	89.2	86.2
Human Expert (Medium)	82.6	84.2	86.0	84.7	78.8	85.0	79.1
Human Expert (Worst)	76.2	80.8	78.0	78.0	73.3	74.2	74.3
Gemini Ultra*	59.4	70.0	56.7	48.0	67.3	78.3	47.1
Claude 3 Opus*	59.4	67.5	67.2	48.9	61.1	70.0	50.6
GPT-4V(ision) (Playground)	56.8	65.8	59.3	54.7	64.7	72.5	36.7
Reka Core*	56.3	75.9	47.3	49.3	58.0	75.0	44.2
SenseChat-Vision-0423-Preview*	54.6	66.7	54.0	45.3	53.3	75.0	43.8
Reka Flash*	53.3	61.7	42.7	47.3	59.3	74.2	44.3
Claude 3 Sonnet*	53.1	61.7	58.2	37.1	57.1	68.7	45.0
HPT Pro*	52.0	66.7	43.3	42.7	50.7	72.5	43.8
VILA1.5*	51.9	60.8	43.3	36.0	57.3	73.3	48.1
InternVL-Chat-V1.2*	51.6	62.5	40.7	39.3	58.7	70.0	46.2
Qwen-VL-MAX*	51.4	72.5	43.3	40.0	58.0	69.2	38.6
LLaVA-1.6-34B *	51.1	67.5	46.0	39.3	52.0	67.5	43.8
Claude 3 Haiku*	50.2	60.8	52.5	37.1	52.3	66.0	41.5
Adept Fuyu-Heavy*	48.3	53.4	46.3	33.7	51.3	72.2	44.0
Gemini Pro*	47.9	-	-	-	-	-	-
Marco-VL-Plus*	46.2	60.8	37.3	35.3	48.7	69.2	37.1
Yi-VL-34B*	45.9	59.2	36.0	33.3	51.3	62.5	41.0
Qwen-VL-PLUS*	45.2	60.0	35.3	37.3	46.7	65.8	36.7
HPT Air*	44.0	63.3	31.3	34.7	45.3	59.2	42.9
InternLM-XComposer2-VL*	43.0	60.0	34.0	34.7	46.0	62.5	32.4
Reka Edge*	42.8	52.5	36.0	42.7	41.3	59.2	33.8
Marco-VL*	41.2	57.5	30.0	28.0	45.3	65.8	32.4
OmniLMM-12B*	41.1	58.3	34.0	27.3	44.0	62.5	31.9
InfiMM-Zephyr-7B*	39.4	55.8	28.0	33.3	42.7	59.2	29.0
Yi-VL-6B*	39.1	52.5	30.7	31.3	38.0	53.3	35.7
InternVL-Chat-V1.1*	39.1	56.7	34.7	31.3	39.3	57.5	27.1
Bunny-3B*	38.2	49.2	30.7	30.7	40.7	45.0	37.1
SVIT*	38.0	52.5	27.3	28.0	42.0	51.7	33.8
MiniCPM-V*	37.2	55.8	33.3	28.0	32.7	58.3	27.1
MiniCPM-V-2*	37.1	63.3	28.7	30.0	30.0	56.7	27.1
LLaVA-1.5-13B	36.4	51.7	22.7	29.3	38.7	53.3	31.4
Emu2-Chat*	36.3	55.0	30.0	28.7	28.7	46.7	35.2
Qwen-VL-7B-Chat	35.9	51.7	29.3	29.3	33.3	45.0	32.9
InstructBLIP-T5-XXL	35.7	44.2	24.0	30.7	35.3	49.2	35.2
BLIP-2 FLAN-T5-XXL	35.4	41.7	30.0	34.7	32.0	50.8	30.0
BLIP-2 FLAN-T5-XL	34.4	44.2	26.7	30.7	35.3	50.0	27.6
InstructBLIP-T5-XL	32.9	40.0	28.0	32.7	28.7	47.5	27.1
SPHINX*	32.9	48.3	24.7	26.7	30.7	50.0	26.2
mPLUG-OWL2*	32.7	45.8	24.7	22.7	32.0	45.8	31.0
Gemini Nano2*	32.6	-	-	-	-	-
Otter	32.2	37.5	24.0	34.7	30.7	41.7	29.0
CogVLM	32.1	40.8	25.3	28.0	32.0	45.0	27.6
LLaMA-Adapter2-7B	29.8	29.2	25.3	30.7	30.7	33.3	30.0
OpenFlamingo2-9B	28.7	40.0	28.0	23.3	27.3	30.8	26.2
Adept Fuyu-8B	27.9	36.7	32.0	22.0	28.0	32.5	21.4
MiniGPT4-Vicuna-13B	26.8	29.2	21.3	28.7	30.7	29.2	23.8
Frequent Choice	26.8	23.3	29.3	27.3	30.0	25.8	24.8
Kosmos2	24.4	25.0	18.0	19.3	28.0	30.0	26.7
Random Choice	22.1	29.2	24.7	18.0	20.7	20.0	21.4

Reset	Overall	Art & Design	Business	Science	Health & Medicine	Human. & Social Sci.	Tech & Eng.
GPT-4V(ision) (Playground)	55.7	65.3	64.3	48.4	63.5	76.3	41.7
SenseChat-Vision-0423-Preview*	50.3	62.7	44.1	42.3	55.7	74.7	43.5
VILA1.5*	46.9	62.1	40.6	37.7	51.7	74.0	39.5
Qwen-VL-MAX*	46.8	64.2	39.8	36.3	52.5	70.4	40.7
InternVL-Chat-V1.2*	46.2	62.5	37.6	37.9	49.7	70.1	40.8
LLaVA-1.6-34B *	44.7	58.6	39.9	36.0	51.2	70.2	36.3
Marco-VL-Plus*	44.3	57.4	34.7	38.5	48.7	72.2	36.7
Yi-VL-34B*	41.6	56.1	33.3	32.9	45.9	66.5	36.0
Qwen-VL-PLUS*	40.8	59.9	34.5	32.8	43.7	65.5	32.9
Marco-VL*	40.4	56.5	31.0	31.0	46.9	66.5	33.8
Weitu-VL-1.0-15B*	38.4	56.6	30.5	31.1	38.4	59.0	34.2
InternLM-XComposer2-VL*	38.2	56.8	32.8	30.1	39.8	60.7	31.8
Yi-VL-6B*	37.8	53.4	30.3	30.0	39.3	58.5	34.1
InfiMM-Zephyr-7B*	35.5	50.0	29.6	28.2	37.5	54.6	31.1
InternVL-Chat-V1.1*	35.3	53.7	31.7	28.2	36.5	56.4	28.0
SVIT*	34.1	48.9	28.0	26.8	35.5	50.9	30.7
Emu2-Chat*	34.1	50.6	27.7	28.0	32.4	50.3	31.3
BLIP-2 FLAN-T5-XXL	34.0	49.2	28.6	27.3	33.7	51.5	30.4
InstructBLIP-T5-XXL	33.8	48.5	30.6	27.6	33.6	49.8	29.4
LLaVA-1.5-13B	33.6	49.8	28.2	25.9	34.9	54.7	28.3
Bunny-3B*	33.0	44.3	29.5	26.8	34.5	50.5	28.7
Qwen-VL-7B-Chat	32.9	47.7	29.8	25.6	33.6	45.3	30.2
SPHINX*	32.9	50.9	27.2	25.3	34.1	51.2	27.8
mPLUG-OWL2*	32.1	48.5	25.6	24.9	32.8	46.7	29.6
BLIP-2 FLAN-T5-XL	31.0	43.0	25.6	25.1	31.8	48.0	27.8
InstructBLIP-T5-XL	30.6	43.3	25.2	25.2	29.3	45.8	28.6
CogVLM	30.1	38.0	25.6	25.1	31.2	41.5	28.9
Otter	29.1	37.4	24.0	24.1	29.6	35.9	30.2
LLaMA-Adapter2-7B	27.7	35.2	25.4	25.6	30.0	29.1	25.7
MiniGPT4-Vicuna-13B	27.6	30.2	27.0	26.2	26.9	30.9	27.2
Adept Fuyu-8B	27.4	29.9	27.0	25.6	27.0	32.5	26.4
Kosmos2	26.6	28.8	23.7	26.6	27.2	26.3	26.8
OpenFlamingo2-9B	26.3	31.7	23.5	26.3	26.3	27.9	25.1
Frequent Choice	25.8	26.7	28.4	24.0	24.4	25.2	26.5
Random Choice	23.9	24.1	24.9	21.6	25.3	22.8	24.8

Overall results of different models on the MMMU test set. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.

Different Image Types

We compare the performance of various models across top frequent image types. Across all types, GPT-4V consistently outperforms the other models by a huge margin. Open-source models demonstrate relatively strong performance in categories like Photos and Paintings, which are more frequently seen during training. However, for less common image categories like Geometric shapes, Music sheets and Chemical structures, all models obtain very low scores (some are close to random guesses). This indicates that the existing models are generalizing poorly towards these image types.

Fuyu-8B Qwen-VL-7B LLaVA-1.5-13B InstructBLIP-T5-XXL BLIP-2 FLAN-T5-XXL GPT-4V

Diagrams (3184)

Tables (2267)

Plots and Charts (840)

Chemical Structures (573)

Photographs (770)

Paintings (453)

Geometric Shapes (336)

Sheet Music (335)

Medical Images (272)

Pathological Images (253)

Microscopic Images (226)

MRI, CT scans, and X-rays (198)

Sketches and Drafts (184)

Maps (170)

Technical Blueprints (162)

Trees and Graphs (146)

Mathematical Notations (133)

Comics and Cartoons (131)

Sculpture (117)

Portraits (91)

Screenshots (70)

Other(60)

Poster(57)

Icons and Symbols (42)

Historical Timelines (30)

3D Renderings (21)

DNA Sequences (20)

Landscapes (16)

Logos and Branding(14)

Advertisements (10)

Selected models' performance on 30 different image types. Note that a single image may have multiple image types.

Different Difficulty Levels

we compares the performance of selected models across three difficulty levels. GPT-4V demonstrates a significantly higher proficiency, with a success rate of 76.1%, compared to opensource models in the “Easy” category. When it comes to the “Medium” category, while the gap narrows, GPT-4V still leads at 55.6%. The further diminishing performance gap in the “Hard” category across models indicates that as the complexity of tasks increases, the advantage of more advanced models like GPT-4V almost disappears. This might reflect a current limitation in handling expert-level challenging queries even for the most advanced models.

Result decomposition across question difficulty levels.

Error Analysis

We delve into the analysis of errors by GPT-4V, a pivotal aspect for understanding its operational capabilities and limitations. This analysis serves not only to identify the model's current shortcomings but also to guide future enhancements in its design and training. We meticulously examine 150 randomly sampled error instances from GPT-4V's predictions. These instances are analyzed by expert annotators who identify the root causes of mispredictions based on their knowledge and the golden explanations if available. The distribution of these errors is illustrated in Figure, and a selection of 100 notable cases, along with detailed analyses, is included in the Appendix.

Error distribution over 150 annotated GPT-4V errors.

BibTeX


      @inproceedings{yue2023mmmu,
        title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
        author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen},
        booktitle={Proceedings of CVPR},
        year={2024},
      }