GRADE: Grounded Reasoning Assessment for Discipline-informed Editing

Mingxin Liu^1,*, Ziqian Fan^2,*, Zhaokai Wang^1,*,†, Leyao Gu^1,*, Zirun Zhu^1,*, Yiguo He¹, Yuchen Yang³, Changyao Tian⁴, Xiangyu Zhao¹, Ning Liao¹, Shaofeng Zhang⁵, Qibing Ren¹, Zhihang Zhong¹, Xuanhe Zhou¹, Junchi Yan¹, Xue Yang^1,†

¹ Shanghai Jiao Tong University ² South China University of Technology ³ Fudan University
⁴ The Chinese University of Hong Kong ⁵ University of Science and Technology of China

^* Equal Contribution ^† Corresponding Author

🧠 Introduction

GRADE is the first benchmark for evaluating discipline-informed knowledge and reasoning in image editing. It comprises 520 carefully curated samples across 10 academic domains — from natural science to social science — and provides a multi-dimensional automated evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability.

🔥 Leaderboard

Model	Reasoning	Consistency	Readability	Accuracy
Closed Source Models
Nano Banana Pro	77.5	89.5	95.8	46.2
Nano Banana 2	72.6	86.4	95.9	39.6
Seedream 5.0	64.1	87.5	90.6	24.7
GPT-Image-1.5	54.5	82.3	90.7	16.0
FLUX.2 Max	47.8	67.2	68.6	11.9
Nano Banana	42.2	75.1	82.0	9.0
Seedream 4.5	41.3	55.6	82.1	6.9
GPT-Image-1.0	44.0	65.2	82.3	6.0
FLUX.2 Pro	38.9	55.5	70.3	4.4
Seedream 4.0	32.4	53.2	77.0	3.1
Open Source Models
Qwen-Edit-2511	18.6	45.2	52.1	2.7
Step-1x (think+reflect)	19.2	57.2	66.9	2.3
Step-1x (think)	17.6	56.3	68.2	1.4
DreamOmni	17.4	83.2	89.1	1.0
Step-1x	17.3	52.8	63.7	1.0
Bagel	15.2	58.6	69.8	0.6
Bagel (think)	15.6	54.8	67.8	0.2
ICEdit	9.8	33.2	56.5	0.2
FLUX.2 dev	11.3	17.6	49.6	0.2
OmniGen	9.7	33.6	51.6	0.0

🎯 Quick Start

1. Install dependencies

pip install openai simplejson tqdm

2. Prepare the dataset

Download from Hugging Face

data.json is the core file that stores all metadata for evaluation.
Before running evaluation, please organize the result.json in the following format:

[
  {
    "image_path":   "path/to/original.png", // Input image
    "editing_path": "path/to/edited.png",  // Model result 
    "gt":           "path/to/ground_truth.png", //GT
    "text":         "Shift the AD curve to the right",  //Editing prompt
    "task_id":      "eco_macro_001",
    "consistency":  "overall",          // "overall" | "style" | "none"
    "sub_task":     "Macroeconomics",
    "questions": [
      { "id": "Q1", "question": "Is the AD curve shifted right?", "score": 0.5 },
      { "id": "Q2", "question": "Is the new equilibrium labeled?",  "score": 0.5 }
    ]
  }
]

3. Configure & Run

Please configure eval.py as follows:

data_json = "/path/to/your/result.json"   # path to the model's result.json
BASE_URL  = "https://your-api-endpoint"    # OpenAI-compatible API endpoint
API_KEY   = "your-api-key"
WORKERS   = 20

Then run the following command to start evaluation:

python eval.py

4. Obtain the Final Score

All outputs are written to the same directory as result.json:

your_model_dir/
├── result.json                      # Input data
├── gemini_flash_eval_1.json         # Merged Discipline Reasoning results
├── gemini_flash_consis_4.json       # Merged Visual Consistency results
├── gemini_flash_read_4.json         # Merged Logical Readability results
├── full_result_gemini_flash.json    # Complete per-task breakdown
└── domain_score.json                # Final Accuracy & Relax Score by domain

5. Resumability

Each evaluation stage checks for existing result files before processing. If a run is interrupted, simply re-run python eval.py — completed tasks will be skipped automatically.

Citation

@misc{liu2026gradebenchmarkingdisciplineinformedreasoning,
      title={GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing}, 
      author={Mingxin Liu and Ziqian Fan and Zhaokai Wang and Leyao Gu and Zirun Zhu and Yiguo He and Yuchen Yang and Changyao Tian and Xiangyu Zhao and Ning Liao and Shaofeng Zhang and Qibing Ren and Zhihang Zhong and Xuanhe Zhou and Junchi Yan and Xue Yang},
      year={2026},
      eprint={2603.12264},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.12264}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
.gitignore		.gitignore
README.md		README.md
consis.py		consis.py
eval.py		eval.py
read.py		read.py
reasoning.py		reasoning.py
sum.py		sum.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRADE: Grounded Reasoning Assessment for Discipline-informed Editing

🧠 Introduction

🔥 Leaderboard

🎯 Quick Start

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GRADE: Grounded Reasoning Assessment for Discipline-informed Editing

🧠 Introduction

🔥 Leaderboard

🎯 Quick Start

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages