Mingxin Liu1,*, Ziqian Fan2,*, Zhaokai Wang1,*,†, Leyao Gu1,*, Zirun Zhu1,*, Yiguo He1, Yuchen Yang3, Changyao Tian4, Xiangyu Zhao1, Ning Liao1, Shaofeng Zhang5, Qibing Ren1, Zhihang Zhong1, Xuanhe Zhou1, Junchi Yan1, Xue Yang1,†
1 Shanghai Jiao Tong University
2 South China University of Technology
3 Fudan University
4 The Chinese University of Hong Kong
5 University of Science and Technology of China
* Equal Contribution † Corresponding Author
GRADE is the first benchmark for evaluating discipline-informed knowledge and reasoning in image editing. It comprises 520 carefully curated samples across 10 academic domains — from natural science to social science — and provides a multi-dimensional automated evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability.
1. Install dependencies
pip install openai simplejson tqdm2. Prepare the dataset
Download from Hugging Face
data.json is the core file that stores all metadata for evaluation.
Before running evaluation, please organize the result.json in the following format:
3. Configure & Run
Please configure eval.py as follows:
data_json = "/path/to/your/result.json" # path to the model's result.json
BASE_URL = "https://your-api-endpoint" # OpenAI-compatible API endpoint
API_KEY = "your-api-key"
WORKERS = 20Then run the following command to start evaluation:
python eval.py4. Obtain the Final Score
All outputs are written to the same directory as result.json:
your_model_dir/
├── result.json # Input data
├── gemini_flash_eval_1.json # Merged Discipline Reasoning results
├── gemini_flash_consis_4.json # Merged Visual Consistency results
├── gemini_flash_read_4.json # Merged Logical Readability results
├── full_result_gemini_flash.json # Complete per-task breakdown
└── domain_score.json # Final Accuracy & Relax Score by domain
5. Resumability
Each evaluation stage checks for existing result files before processing. If a run is interrupted, simply re-run python eval.py — completed tasks will be skipped automatically.
@misc{liu2026gradebenchmarkingdisciplineinformedreasoning,
title={GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing},
author={Mingxin Liu and Ziqian Fan and Zhaokai Wang and Leyao Gu and Zirun Zhu and Yiguo He and Yuchen Yang and Changyao Tian and Xiangyu Zhao and Ning Liao and Shaofeng Zhang and Qibing Ren and Zhihang Zhong and Xuanhe Zhou and Junchi Yan and Xue Yang},
year={2026},
eprint={2603.12264},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.12264},
}

[ { "image_path": "path/to/original.png", // Input image "editing_path": "path/to/edited.png", // Model result "gt": "path/to/ground_truth.png", //GT "text": "Shift the AD curve to the right", //Editing prompt "task_id": "eco_macro_001", "consistency": "overall", // "overall" | "style" | "none" "sub_task": "Macroeconomics", "questions": [ { "id": "Q1", "question": "Is the AD curve shifted right?", "score": 0.5 }, { "id": "Q2", "question": "Is the new equilibrium labeled?", "score": 0.5 } ] } ]