Skip to content

VisionXLab/GRADE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal 520 Samples 10 Domains

GRADE: Grounded Reasoning Assessment for Discipline-informed Editing

Mingxin Liu1,*, Ziqian Fan2,*, Zhaokai Wang1,*,†, Leyao Gu1,*, Zirun Zhu1,*, Yiguo He1, Yuchen Yang3, Changyao Tian4, Xiangyu Zhao1, Ning Liao1, Shaofeng Zhang5, Qibing Ren1, Zhihang Zhong1, Xuanhe Zhou1, Junchi Yan1, Xue Yang1,†

1 Shanghai Jiao Tong University    2 South China University of Technology    3 Fudan University
4 The Chinese University of Hong Kong    5 University of Science and Technology of China

* Equal Contribution    Corresponding Author

arXiv PDF data img/data project page

Overview of GRADE: 520 discipline-informed image editing samples across 10 academic domains.

🧠 Introduction

GRADE is the first benchmark for evaluating discipline-informed knowledge and reasoning in image editing. It comprises 520 carefully curated samples across 10 academic domains — from natural science to social science — and provides a multi-dimensional automated evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability.

Evaluation pipeline: (A) Discipline Reasoning via question-guided VQA, (B) Visual Consistency with task-specific prompts, (C) Logical Readability for clarity and correctness.


🔥 Leaderboard

Model Reasoning Consistency Readability Accuracy
Closed Source Models
Nano Banana Pro 77.5 89.5 95.8 46.2
Nano Banana 2 72.6 86.4 95.9 39.6
Seedream 5.0 64.1 87.5 90.6 24.7
GPT-Image-1.5 54.5 82.3 90.7 16.0
FLUX.2 Max 47.8 67.2 68.6 11.9
Nano Banana 42.2 75.1 82.0 9.0
Seedream 4.5 41.3 55.6 82.1 6.9
GPT-Image-1.0 44.0 65.2 82.3 6.0
FLUX.2 Pro 38.9 55.5 70.3 4.4
Seedream 4.0 32.4 53.2 77.0 3.1
Open Source Models
Qwen-Edit-2511 18.6 45.2 52.1 2.7
Step-1x (think+reflect) 19.2 57.2 66.9 2.3
Step-1x (think) 17.6 56.3 68.2 1.4
DreamOmni 17.4 83.2 89.1 1.0
Step-1x 17.3 52.8 63.7 1.0
Bagel 15.2 58.6 69.8 0.6
Bagel (think) 15.6 54.8 67.8 0.2
ICEdit 9.8 33.2 56.5 0.2
FLUX.2 dev 11.3 17.6 49.6 0.2
OmniGen 9.7 33.6 51.6 0.0

🎯 Quick Start

1. Install dependencies

pip install openai simplejson tqdm

2. Prepare the dataset

Download from Hugging Face

data.json is the core file that stores all metadata for evaluation.
Before running evaluation, please organize the result.json in the following format:

[
  {
    "image_path":   "path/to/original.png", // Input image
    "editing_path": "path/to/edited.png",  // Model result 
    "gt":           "path/to/ground_truth.png", //GT
    "text":         "Shift the AD curve to the right",  //Editing prompt
    "task_id":      "eco_macro_001",
    "consistency":  "overall",          // "overall" | "style" | "none"
    "sub_task":     "Macroeconomics",
    "questions": [
      { "id": "Q1", "question": "Is the AD curve shifted right?", "score": 0.5 },
      { "id": "Q2", "question": "Is the new equilibrium labeled?",  "score": 0.5 }
    ]
  }
]

3. Configure & Run

Please configure eval.py as follows:

data_json = "/path/to/your/result.json"   # path to the model's result.json
BASE_URL  = "https://your-api-endpoint"    # OpenAI-compatible API endpoint
API_KEY   = "your-api-key"
WORKERS   = 20

Then run the following command to start evaluation:

python eval.py

4. Obtain the Final Score

All outputs are written to the same directory as result.json:

your_model_dir/
├── result.json                      # Input data
├── gemini_flash_eval_1.json         # Merged Discipline Reasoning results
├── gemini_flash_consis_4.json       # Merged Visual Consistency results
├── gemini_flash_read_4.json         # Merged Logical Readability results
├── full_result_gemini_flash.json    # Complete per-task breakdown
└── domain_score.json                # Final Accuracy & Relax Score by domain

5. Resumability

Each evaluation stage checks for existing result files before processing. If a run is interrupted, simply re-run python eval.py — completed tasks will be skipped automatically.


Citation

@misc{liu2026gradebenchmarkingdisciplineinformedreasoning,
      title={GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing}, 
      author={Mingxin Liu and Ziqian Fan and Zhaokai Wang and Leyao Gu and Zirun Zhu and Yiguo He and Yuchen Yang and Changyao Tian and Xiangyu Zhao and Ning Liao and Shaofeng Zhang and Qibing Ren and Zhihang Zhong and Xuanhe Zhou and Junchi Yan and Xue Yang},
      year={2026},
      eprint={2603.12264},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.12264}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages