From a94c0343a6a8d6783c1441eab80e22729efb6aba Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Thu, 11 Sep 2025 22:25:45 +0000
Subject: [PATCH] feat: Add documentation for Codebuff evaluation system

Co-authored-by: emmanuelm <emmanuelm@gmail.com>
---
 evals/docs/README.md                       | 147 +++++
 evals/docs/agents-relationship-analysis.md | 238 ++++++++
 evals/docs/codebuff-evals-overview.md      | 361 +++++++++++
 evals/docs/dependencies-analysis.md        | 269 +++++++++
 evals/docs/json-file-analysis.md           | 335 +++++++++++
 evals/docs/replication-guide.md            | 664 +++++++++++++++++++++
 6 files changed, 2014 insertions(+)
 create mode 100644 evals/docs/README.md
 create mode 100644 evals/docs/agents-relationship-analysis.md
 create mode 100644 evals/docs/codebuff-evals-overview.md
 create mode 100644 evals/docs/dependencies-analysis.md
 create mode 100644 evals/docs/json-file-analysis.md
 create mode 100644 evals/docs/replication-guide.md

diff --git a/evals/docs/README.md b/evals/docs/README.md
new file mode 100644
index 0000000000..f000c8a3f9
--- /dev/null
+++ b/evals/docs/README.md
@@ -0,0 +1,147 @@
+# Codebuff Evaluation System Documentation
+
+This directory contains comprehensive documentation for the Codebuff Evaluation Framework - a novel system for evaluating AI coding agents through **Git Commit Reimplementation**.
+
+## 📚 Documentation Index
+
+### Core Documentation
+
+1. **[Codebuff Evals Overview](./codebuff-evals-overview.md)**
+   - Complete system architecture and components
+   - Data sources and process flows
+   - Function diagrams and usage instructions
+
+2. **[JSON File Analysis](./json-file-analysis.md)**
+   - Detailed structure analysis of evaluation data files
+   - Complete TypeScript/Zod schemas for all JSON formats
+   - Sample data insights and file size analysis
+
+3. **[Dependencies Analysis](./dependencies-analysis.md)**
+   - Complete dependency mapping on Codebuff codebase
+   - Integration points and coupling analysis
+   - External library requirements
+
+### Implementation Guides
+
+4. **[Replication Guide](./replication-guide.md)**
+   - Step-by-step roadmap for building a standalone evaluation system
+   - Complete implementation phases and timelines
+   - Alternative architecture without Codebuff dependencies
+
+5. **[Agents Relationship Analysis](./agents-relationship-analysis.md)**
+   - Deep analysis of how .agents folder relates to evaluation research
+   - Feedback loop between evaluation results and agent development
+   - Evolution of agent architectures based on eval insights
+
+## 🎯 Key Insights
+
+### Revolutionary Evaluation Methodology
+
+The Codebuff evaluation system introduces a **Commit Reconstruction Methodology** that:
+
+- **Tests real-world scenarios** using actual git commits from production codebases
+- **Enables interactive evaluation** through multi-turn conversations with a prompting agent
+- **Provides comprehensive scoring** across completion, efficiency, and code quality dimensions
+- **Scales to enterprise codebases** with sophisticated token management and parallel execution
+
+### Data Scale and Scope
+
+The evaluation dataset includes:
+
+| Repository | File Size | Commits | Focus |
+|------------|-----------|---------|-------|
+| **Saleor** | 37MB | Large set | E-commerce enterprise scenarios |
+| **Manifold** | 8.1MB | Medium set | Prediction market business logic |
+| **Codebuff** | 7.6MB | 13+ commits | Internal development patterns |
+| **Plane** | 4.5MB | Medium set | Project management workflows |
+
+### Architecture Excellence
+
+The system demonstrates sophisticated engineering:
+
+- **Multi-agent orchestration** with prompting agents guiding coding agents
+- **Robust judging system** using multiple AI judges with median selection
+- **Process isolation** with proper cleanup and error handling
+- **Token-aware processing** with intelligent context truncation
+- **Comprehensive metrics** tracking cost, performance, and quality
+
+## 🔄 Research Feedback Loop
+
+The documentation reveals a sophisticated feedback loop:
+
+```mermaid
+graph TB
+    A[Agent Definitions] --> B[Evaluation System]
+    B --> C[Performance Analysis]
+    C --> D[Agent Improvements]
+    D --> A
+    
+    style A fill:#e1f5fe
+    style B fill:#f3e5f5
+    style C fill:#e8f5e8
+    style D fill:#fff3e0
+```
+
+This creates continuous improvement where:
+- Evaluation failures inform prompt engineering
+- Performance patterns guide architecture decisions
+- Real-world scenarios ensure practical relevance
+
+## 🛠️ Technical Implementation
+
+### Core Technologies
+- **AI Models**: Claude, GPT-4, Gemini for different components
+- **Schema Validation**: Zod for runtime type safety
+- **Concurrency**: p-limit for controlled parallel execution
+- **Git Operations**: Direct git command-line interface
+- **Token Management**: tiktoken for context length control
+
+### Integration Points
+- **Backend**: LLM APIs, token counting, user input management
+- **Common**: Utilities, model configurations, agent constants
+- **SDK**: Codebuff client for agent execution
+- **NPM App**: Agent loading, credential management
+
+## 🚀 Replication Roadmap
+
+For those looking to replicate this system:
+
+| Phase | Duration | Components |
+|-------|----------|------------|
+| **Infrastructure** | 2-3 weeks | AI model integration, token counting |
+| **Agent System** | 3-4 weeks | Tool framework, agent runners |
+| **Repository Management** | 1-2 weeks | Git operations, isolation |
+| **Orchestration** | 2-3 weeks | Evaluation pipeline, judging |
+| **Testing & Refinement** | 2-3 weeks | Integration, optimization |
+
+**Total Estimated Effort**: 11-17 weeks for a complete standalone system
+
+## 📊 Research Value
+
+This evaluation framework represents significant value for AI research:
+
+1. **Novel Methodology**: First comprehensive commit reconstruction evaluation system
+2. **Real-world Relevance**: Tests on actual production code scenarios  
+3. **Comprehensive Metrics**: Multi-dimensional scoring with AI judge analysis
+4. **Scalable Architecture**: Handles enterprise-scale codebases
+5. **Research Insights**: Direct feedback loop for agent improvement
+
+## 🔍 Key Files Reference
+
+- **Orchestration**: `run-git-evals.ts`, `run-eval-set.ts`
+- **Agent Integration**: `runners/codebuff.ts`, `runners/claude.ts`
+- **Judging**: `judge-git-eval.ts` with multi-judge robustness
+- **Repository Management**: `setup-test-repo.ts` with authentication
+- **Data Generation**: `pick-commits.ts`, `gen-evals.ts`
+- **Analysis**: `post-eval-analysis.ts` for aggregate insights
+
+## 📈 Future Directions
+
+The evaluation system enables research into:
+- **Agent architecture optimization** through systematic testing
+- **Tool usage pattern analysis** via comprehensive logging
+- **Error pattern identification** for targeted improvements
+- **Cost-performance optimization** through detailed metrics
+- **Scaling behavior analysis** across different codebase sizes
+
+This documentation provides a complete picture of a sophisticated AI evaluation system that bridges the gap between research and practical AI coding assistance.
\ No newline at end of file
diff --git a/evals/docs/agents-relationship-analysis.md b/evals/docs/agents-relationship-analysis.md
new file mode 100644
index 0000000000..70df160b81
--- /dev/null
+++ b/evals/docs/agents-relationship-analysis.md
@@ -0,0 +1,238 @@
+# Relationship Between .agents Folder and Evals Research
+
+This document analyzes the relationship between the prompts and agents defined in the `.agents` folder and the research conducted through the evaluation system.
+
+## Overview
+
+The `.agents` folder contains the actual agent definitions that are evaluated by the evals system, creating a direct feedback loop between agent development and evaluation research.
+
+## Agent Architecture in Context
+
+### Agent Types Being Evaluated
+
+Based on the evals system analysis and agent definitions:
+
+1. **Base Agents** (`base.ts`, `base2.ts`)
+   - Primary general-purpose coding agents
+   - Core subjects of evaluation research
+   - Use Claude 4 Sonnet as the underlying model
+
+2. **Specialized Agents** (`git-committer.ts`, `reviewer.ts`, etc.)
+   - Task-specific agents for specialized workflows
+   - Secondary evaluation targets
+   - Test specific capabilities and behaviors
+
+### Direct Evaluation Relationships
+
+#### 1. **Agent Selection in Evals**
+
+From the evaluation runners, we can see the evals system directly references agents by ID:
+
+```typescript
+// From runners/codebuff.ts
+export class CodebuffRunner implements Runner {
+  constructor(runState: RunState, agent?: string) {
+    this.agent = agent ?? 'base'  // Default to 'base' agent
+  }
+}
+```
+
+The evaluation system tests these specific agent types:
+- `base` - Primary general coding agent
+- `base2` - Enhanced version with improved architecture  
+- `base-lite` - Lighter weight version
+- Custom agents from `.agents` directory
+
+#### 2. **Agent Loading Integration**
+
+```typescript
+// From runners/codebuff.ts  
+const agentsPath = path.join(__dirname, '../../../.agents')
+const localAgentDefinitions = Object.values(
+  await loadLocalAgents({ agentsPath })
+)
+```
+
+The evals system dynamically loads agent definitions from the `.agents` folder, making it easy to:
+- Test new agent iterations
+- Compare different agent approaches
+- Evaluate specialized vs. general-purpose agents
+
+## Research-Informed Agent Development
+
+### 1. **Prompting Strategies Influenced by Evals**
+
+The evaluation results directly inform agent prompt engineering. For example, the base prompts include specific guidance likely derived from eval insights:
+
+```typescript
+// From base-prompts.ts - Testing guidance
+'**Testing:** If you create a unit test, you should run it using `run_terminal_command` to see if it passes, and fix it if it doesn\'t.'
+
+// Package management best practices  
+'**Package Management:** When adding new packages, use the run_terminal_command tool to install the package rather than editing the package.json file...'
+```
+
+These specific instructions likely emerged from evaluation findings showing agents making common mistakes.
+
+### 2. **Tool Usage Patterns**
+
+The evaluation system tests how agents use tools, and this feedback influences tool selection and usage patterns in agent definitions:
+
+```typescript
+// From git-committer.ts
+toolNames: ['read_files', 'run_terminal_command', 'add_message', 'end_turn']
+```
+
+The careful selection of tools and their usage patterns in the `handleSteps` function reflects lessons learned from evaluation research.
+
+### 3. **Multi-Agent Architecture Evolution**
+
+The evolution from `base` to `base2` demonstrates research-driven agent improvement:
+
+```typescript
+// base2 uses a factory pattern with specialized sub-agents
+import { base2 } from './base2-factory'
+
+// base2-factory.ts includes guidance like:
+'Don\'t mastermind the task. Rely on your agents\' judgement to plan, implement, and review the code.'
+```
+
+This architectural change likely resulted from evaluation findings about task decomposition and agent coordination.
+
+## Evaluation-Driven Design Patterns
+
+### 1. **Structured Agent Workflows**
+
+The `git-committer` agent demonstrates a structured workflow that was likely refined through evaluation:
+
+```typescript
+handleSteps: function* ({ agentState, prompt, params }: AgentStepContext) {
+  // Step 1: Run git diff and git log to analyze changes
+  yield { toolName: 'run_terminal_command', input: { command: 'git diff' } }
+  
+  // Step 2: Read relevant files for context
+  yield { toolName: 'add_message', ... }
+  
+  // Step 3: Let AI generate next step
+  yield 'STEP'
+  
+  // Step 4: Create commit
+  yield 'STEP_ALL'
+}
+```
+
+This systematic approach to breaking down tasks reflects insights from evaluation research about agent decision-making patterns.
+
+### 2. **Error Prevention Strategies**
+
+Agent prompts include specific error prevention guidance derived from evaluation findings:
+
+```typescript
+// From base prompts - addressing common eval failures
+'You must base your future write_file/str_replace edits off of the latest changes. You must try to accommodate the changes that the user has made...'
+
+'Always run hooks for TypeScript/JavaScript changes, test file changes, or when the changes could affect compilation/tests'
+```
+
+### 3. **Quality Assurance Integration**
+
+The emphasis on testing and verification in agent prompts reflects evaluation insights:
+
+```typescript
+// From base prompts
+'Check the knowledge files to see if the user has specified a further protocol for what terminal commands should be run to verify edits. For example, a `knowledge.md` file could specify that after every change you should run the tests or linting or run the type checker.'
+```
+
+## Research Feedback Loop
+
+```mermaid
+graph TB
+    A[Agent Definitions in .agents/] --> B[Evaluation System]
+    B --> C[Performance Metrics]
+    C --> D[Analysis of Failures]
+    D --> E[Prompt Engineering Insights]
+    E --> F[Updated Agent Definitions]
+    F --> A
+
+    B --> G[Conversation Traces]
+    G --> H[Tool Usage Patterns]
+    H --> I[Workflow Optimization]
+    I --> F
+    
+    C --> J[Scoring Dimensions]
+    J --> K[Quality Metrics]
+    K --> L[Agent Architecture Changes]
+    L --> F
+```
+
+## Specific Evaluation Insights Reflected in Agents
+
+### 1. **File Handling Patterns**
+
+Agent prompts include detailed guidance on file operations, likely informed by evaluation failures:
+
+```typescript
+// Emphasis on reading before editing
+'Analyze surrounding code, tests, and configuration first'
+
+// Careful handling of user modifications  
+'You must base your future write_file/str_replace edits off of the latest changes'
+```
+
+### 2. **Terminal Command Best Practices**
+
+Specific terminal usage patterns reflect evaluation learnings:
+
+```typescript
+// Package installation best practices
+'use the run_terminal_command tool to install the package rather than editing the package.json file with a guess at the version number'
+
+// Command chaining for verification
+'you should run them all using \'&&\' to concatenate them into one commands, e.g. `npm run lint && npm run test`'
+```
+
+### 3. **Context Management**
+
+The evolution of context handling in agents reflects evaluation insights about information management:
+
+```typescript
+// From context-pruner.test.ts - sophisticated context management
+'removes old terminal command results while keeping recent 5'
+'removes large tool results'
+'performs message-level pruning when other passes are insufficient'
+```
+
+## Evaluation Impact on Agent Evolution
+
+### Base → Base2 Evolution
+
+The progression from `base` to `base2` demonstrates evaluation-driven improvement:
+
+1. **Architecture**: Moved to factory pattern with specialized sub-agents
+2. **Tool Usage**: More sophisticated tool selection and usage patterns
+3. **Workflow**: Better task decomposition and coordination
+4. **Error Handling**: Improved error prevention and recovery
+
+### Specialized Agent Development
+
+Specialized agents like `git-committer` reflect evaluation insights about:
+- When to use structured workflows vs. free-form responses
+- How to break complex tasks into manageable steps
+- The importance of context gathering before action
+
+## Conclusion
+
+The relationship between the `.agents` folder and the evaluation research is a **direct, iterative feedback loop**:
+
+1. **Agents are evaluated** using the commit reconstruction methodology
+2. **Performance data and failure modes** inform agent improvements
+3. **Updated prompts and architectures** are implemented in the `.agents` folder
+4. **New agent versions** are evaluated to measure improvement
+
+This creates a continuous improvement cycle where:
+- **Evaluation research drives agent development**
+- **Agent performance informs evaluation methodology refinements**  
+- **Real-world coding scenarios** (from git commits) ensure practical relevance
+- **Systematic measurement** enables evidence-based agent improvement
+
+The evaluation system serves as both a research tool for understanding AI coding capabilities and a development tool for improving Codebuff's agent implementations. This tight integration between evaluation and development represents a sophisticated approach to AI agent research and improvement.
\ No newline at end of file
diff --git a/evals/docs/codebuff-evals-overview.md b/evals/docs/codebuff-evals-overview.md
new file mode 100644
index 0000000000..ab9cd961c4
--- /dev/null
+++ b/evals/docs/codebuff-evals-overview.md
@@ -0,0 +1,361 @@
+# Codebuff Evals System Documentation
+
+This document provides a comprehensive overview of the Codebuff Evaluation Framework, a novel system for testing AI coding agents through **Git Commit Reimplementation Evaluation**.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Architecture](#architecture)
+3. [Data Sources](#data-sources)
+4. [Process Flow](#process-flow)
+5. [Components](#components)
+6. [Dependencies](#dependencies)
+7. [JSON File Structure](#json-file-structure)
+8. [Usage](#usage)
+
+## Overview
+
+The Codebuff Evals system takes a fundamentally different approach from traditional coding benchmarks like SWE Bench or Terminal Bench. Instead of passing predefined tests, the evaluations challenge coding agents to **reimplement real git commits** from open source projects through interactive, multi-turn conversations.
+
+### Core Innovation: Commit Reconstruction Methodology
+
+The evaluation framework centers around having coding agents reconstruct actual git commits from open source repositories through an interactive process:
+
+- **Real-world relevance**: Uses actual commits from production codebases
+- **Multi-turn interaction**: Up to 5 conversational rounds guided by a prompting agent
+- **Comprehensive scoring**: AI judge provides nuanced evaluation across multiple dimensions
+- **Diverse scenarios**: Tests on different types of changes, project types, and complexity levels
+
+## Architecture
+
+```mermaid
+graph TB
+    subgraph "Data Sources"
+        A[Open Source Repos]
+        B[Picked Commits]
+        C[Generated Specs]
+    end
+
+    subgraph "Evaluation Pipeline"
+        D[Orchestrator]
+        E[Test Repo Setup]
+        F[Prompting Agent]
+        G[Coding Agent]
+        H[AI Judge]
+    end
+
+    subgraph "Results"
+        I[Evaluation Logs]
+        J[Performance Metrics]
+        K[Analysis Reports]
+    end
+
+    A --> B
+    B --> C
+    C --> D
+    D --> E
+    E --> F
+    F --> G
+    G --> H
+    H --> I
+    I --> J
+    I --> K
+```
+
+### System Components
+
+#### 1. **Evaluation Orchestration** (`run-git-evals.ts`, `run-eval-set.ts`)
+- Manages the complete evaluation pipeline
+- Handles concurrency and process management
+- Coordinates between all system components
+- Provides timeout and error handling
+
+#### 2. **Agent Runners** (`runners/`)
+- **Codebuff Runner**: Integrates with local Codebuff installation
+- **Claude Runner**: Integrates with Anthropic's Claude Code
+- **Runner Interface**: Common abstraction for all coding agents
+
+#### 3. **Prompting Agent** (`prompting-agent.ts`)
+- Acts as the "human developer" in the loop
+- Analyzes conversation history and decides next actions
+- Generates follow-up prompts to guide the coding agent
+- Makes decisions: `continue`, `complete`, or `halt`
+
+#### 4. **Judging System** (`judge-git-eval.ts`)
+- Uses AI (Gemini 2.5 Pro) to score implementations
+- Compares agent output against ground truth git diffs
+- Provides detailed scoring across multiple dimensions
+- Runs 3 judges in parallel and takes median for robustness
+
+#### 5. **Test Repository Management** (`setup-test-repo.ts`)
+- Clones and manages git repositories for testing
+- Handles commit checkout and environment setup
+- Provides isolated testing environments
+- Supports both public and private repositories
+
+## Data Sources
+
+The system operates on several types of data:
+
+### Primary Data Sources
+1. **Open Source Git Repositories**
+   - Real production codebases from GitHub
+   - Diverse programming languages and project types
+   - Examples: Codebuff, Manifold, Saleor, Plane
+
+2. **Selected Commits**
+   - Commits are intelligently picked using AI selection
+   - Focus on substantial, self-contained changes
+   - Filtered for quality and implementability
+
+3. **Generated Specifications**
+   - Natural language descriptions of what needs to be implemented
+   - Derived from commit diffs and messages
+   - Written to be implementable without seeing the original code
+
+### Data Flow
+
+```mermaid
+sequenceDiagram
+    participant Repo as Git Repository
+    participant Picker as Commit Picker
+    participant Generator as Spec Generator
+    participant Evaluator as Evaluation System
+
+    Repo->>Picker: Raw commit history
+    Picker->>Picker: AI-powered commit analysis
+    Picker->>Generator: Selected commits
+    Generator->>Generator: Generate specifications
+    Generator->>Evaluator: Evaluation data file
+```
+
+## Process Flow
+
+### Complete Evaluation Workflow
+
+```mermaid
+sequenceDiagram
+    participant Orchestrator as Eval Orchestrator
+    participant PromptAgent as Prompting Agent
+    participant CodingAgent as Coding Agent (Codebuff/Claude)
+    participant Judge as AI Judge
+    participant Repo as Test Repository
+
+    Orchestrator->>Repo: Setup repo at commit^ (before target)
+    Orchestrator->>PromptAgent: Start with spec
+
+    loop Up to 5 attempts
+        PromptAgent->>PromptAgent: Analyze conversation history
+        PromptAgent->>CodingAgent: Send implementation prompt
+        CodingAgent->>Repo: Make code changes via tools
+        CodingAgent->>PromptAgent: Return conversation trace
+        PromptAgent->>PromptAgent: Decide: continue/complete/halt
+    end
+
+    Orchestrator->>Judge: Compare output vs ground truth
+    Judge->>Orchestrator: Return detailed scores & analysis
+```
+
+### Key Process Steps
+
+1. **Repository Setup**
+   - Clone target repository
+   - Checkout to commit parent (before target changes)
+   - Apply any initialization commands
+
+2. **Interactive Implementation**
+   - Prompting agent analyzes specification
+   - Sends focused prompt to coding agent
+   - Coding agent makes changes using available tools
+   - Process repeats up to 5 times based on progress
+
+3. **Evaluation and Judging**
+   - Generate git diff of agent's changes
+   - Compare against ground truth diff
+   - AI judge provides comprehensive scoring
+   - Results include metrics, strengths, and weaknesses
+
+## Components
+
+### Core Files and Their Roles
+
+| Component | File | Purpose |
+|-----------|------|---------|
+| **Orchestration** | `run-git-evals.ts` | Main evaluation pipeline |
+| | `run-eval-set.ts` | Batch evaluation runner |
+| | `run-single-eval.ts` | CLI for individual evals |
+| **Agent Integration** | `runners/runner.ts` | Common runner interface |
+| | `runners/codebuff.ts` | Codebuff agent integration |
+| | `runners/claude.ts` | Claude agent integration |
+| **Guidance** | `prompting-agent.ts` | Interactive prompting logic |
+| **Assessment** | `judge-git-eval.ts` | AI-powered evaluation |
+| **Infrastructure** | `setup-test-repo.ts` | Repository management |
+| | `scaffolding.ts` | Test environment utilities |
+| **Data Generation** | `pick-commits.ts` | Intelligent commit selection |
+| | `gen-evals.ts` | Specification generation |
+| | `gen-repo-eval.ts` | End-to-end eval creation |
+| **Analysis** | `post-eval-analysis.ts` | Aggregate analysis |
+
+## Dependencies
+
+### Codebuff Codebase Dependencies
+
+The evals system has several dependencies on the broader Codebuff codebase:
+
+#### Backend Dependencies
+- `@codebuff/backend/live-user-inputs` - User input management
+- `@codebuff/backend/llm-apis/vercel-ai-sdk/ai-sdk` - AI model integration
+- `@codebuff/backend/util/token-counter` - Token counting utilities
+
+#### Common Utilities
+- `@codebuff/common/old-constants` - Model definitions and constants
+- `@codebuff/common/constants/agents` - Agent configuration
+- `@codebuff/common/util/promise` - Promise utilities
+- `@codebuff/common/util/string` - String manipulation utilities
+
+#### NPM App Integration
+- `@codebuff/npm-app/agents/load-agents` - Local agent loading
+- `@codebuff/npm-app/credentials` - Authentication management
+
+#### SDK Integration
+- Uses the Codebuff SDK (`../../../sdk/src/index`) for running agents
+
+### External Dependencies
+- `zod/v4` - Schema validation
+- `lodash` - Utility functions
+- `p-limit` - Concurrency control
+- `diff` - Diff generation for judging
+
+## JSON File Structure
+
+The evaluation system uses several JSON file formats. Here are the Zod schemas that define their structure:
+
+### Evaluation Data File Schema
+
+```typescript
+import { z } from 'zod';
+
+const FileStateSchema = z.object({
+  path: z.string(),
+  preContent: z.string(), // Content before the commit
+  postContent: z.string() // Content after the commit
+});
+
+const EvalCommitSchema = z.object({
+  sha: z.string(),
+  spec: z.string(), // Natural language specification
+  fileStates: z.array(FileStateSchema),
+  // Additional metadata (found in JSON files)
+  author: z.string().optional(),
+  date: z.string().optional(),
+  message: z.string().optional(),
+  selectionReason: z.string().optional(),
+  stats: z.object({
+    deletions: z.number(),
+    filesChanged: z.number(),
+    insertions: z.number()
+  }).optional()
+});
+
+const EvalDataSchema = z.object({
+  repoUrl: z.string(),
+  testRepoName: z.string().optional(),
+  generationDate: z.string(),
+  initCommand: z.string().optional(),
+  evalCommits: z.array(EvalCommitSchema)
+});
+```
+
+### Evaluation Results Schema
+
+```typescript
+const AgentStepSchema = z.object({
+  response: z.string(),
+  toolCalls: z.array(z.any()),
+  toolResults: z.array(z.any())
+});
+
+const CodebuffTraceSchema = z.object({
+  prompt: z.string(),
+  steps: z.array(AgentStepSchema)
+});
+
+const JudgingAnalysisSchema = z.object({
+  analysis: z.string(),
+  strengths: z.array(z.string()),
+  weaknesses: z.array(z.string()),
+  metrics: z.object({
+    completionScore: z.number().min(0).max(10),
+    codeQualityScore: z.number().min(0).max(10),
+    overallScore: z.number().min(0).max(10)
+  })
+});
+
+const EvalRunJudgedSchema = z.object({
+  eval_commit: EvalCommitSchema,
+  trace: z.array(CodebuffTraceSchema),
+  error: z.string().optional(),
+  gitDiff: z.string(),
+  durationMs: z.number(),
+  costUsd: z.number(),
+  judging_results: JudgingAnalysisSchema,
+  computed_metrics: z.object({
+    runtime_sec: z.number(),
+    cost_usd: z.number()
+  })
+});
+
+const FullEvalLogSchema = z.object({
+  test_repo_name: z.string(),
+  generation_date: z.string(),
+  eval_runs: z.array(EvalRunJudgedSchema),
+  overall_metrics: z.object({
+    average_runtime_sec: z.number(),
+    average_cost_usd: z.number(),
+    average_completion: z.number(),
+    average_code_quality: z.number(),
+    average_overall: z.number(),
+    average_duration_ms: z.number(),
+    total_runs: z.number(),
+    successful_runs: z.number(),
+    failed_runs: z.number()
+  })
+});
+```
+
+## Usage
+
+### Running Evaluations
+
+#### Single Evaluation
+```bash
+bun run evals/git-evals/run-single-eval.ts \
+  --eval-file eval-codebuff.json \
+  --commit-index 0 \
+  --agent base2
+```
+
+#### Batch Evaluations
+```bash
+bun run evals/git-evals/run-eval-set.ts
+```
+
+### Creating New Evaluations
+
+#### 1. Pick Commits from Repository
+```bash
+bun run evals/git-evals/pick-commits.ts \
+  https://github.com/user/repo \
+  ./picked-commits.json \
+  300
+```
+
+#### 2. Generate Evaluation File
+```bash
+bun run evals/git-evals/gen-repo-eval.ts \
+  https://github.com/user/repo \
+  ./picked-commits.json \
+  ./eval-output.json
+```
+
+This comprehensive framework provides a robust foundation for evaluating AI coding agents through realistic, interactive scenarios that mirror real-world software development challenges.
\ No newline at end of file
diff --git a/evals/docs/dependencies-analysis.md b/evals/docs/dependencies-analysis.md
new file mode 100644
index 0000000000..684cb8cbc4
--- /dev/null
+++ b/evals/docs/dependencies-analysis.md
@@ -0,0 +1,269 @@
+# Codebuff Evals Dependencies Analysis
+
+This document provides a comprehensive analysis of how the evals system depends on the broader Codebuff codebase and external packages.
+
+## Overview
+
+The evals system is tightly integrated with the Codebuff ecosystem, importing functionality from multiple packages within the monorepo. This analysis identifies all dependencies and their purposes.
+
+## Dependency Categories
+
+### 1. Backend Dependencies
+
+#### Core LLM Integration
+```typescript
+// From: judge-git-eval.ts, run-git-evals.ts
+import { promptAiSdkStructured } from '@codebuff/backend/llm-apis/vercel-ai-sdk/ai-sdk'
+```
+- **Purpose**: Interface with AI models for judging and prompting agents
+- **Usage**: Critical for both evaluation judging and prompting agent decisions
+- **Dependency Level**: High - Core functionality
+
+#### User Input Management  
+```typescript
+// From: run-git-evals.ts
+import { disableLiveUserInputCheck } from '@codebuff/backend/live-user-inputs'
+```
+- **Purpose**: Disable interactive prompts during automated evaluation runs
+- **Usage**: Ensures evals run without human intervention
+- **Dependency Level**: Medium - Operational
+
+#### Token Management
+```typescript
+// From: judge-git-eval.ts
+import { countTokens } from '@codebuff/backend/util/token-counter'
+```
+- **Purpose**: Track and limit token usage for AI models
+- **Usage**: Critical for managing context length in judging prompts
+- **Dependency Level**: High - Performance critical
+
+### 2. Common Utilities
+
+#### Model Definitions
+```typescript
+// From: multiple files
+import { models } from '@codebuff/common/old-constants'
+import { API_KEY_ENV_VAR } from '@codebuff/common/old-constants'
+```
+- **Purpose**: Centralized model configurations and API key management
+- **Usage**: Model selection for judging, prompting, and agent running
+- **Dependency Level**: High - Configuration
+
+#### Agent Constants
+```typescript
+// From: runners/codebuff.ts
+import { MAX_AGENT_STEPS_DEFAULT } from '@codebuff/common/constants/agents'
+```
+- **Purpose**: Default configuration for agent execution
+- **Usage**: Controls agent step limits during evaluation runs
+- **Dependency Level**: Medium - Configuration
+
+#### Utility Functions
+```typescript
+// From: multiple files
+import { withTimeout } from '@codebuff/common/util/promise'
+import { generateCompactId } from '@codebuff/common/util/string'
+```
+- **Purpose**: Promise utilities and ID generation
+- **Usage**: Timeout management and unique identifier creation
+- **Dependency Level**: Medium - Utility
+
+### 3. NPM App Integration
+
+#### Agent Loading
+```typescript
+// From: runners/codebuff.ts
+import { loadLocalAgents } from '@codebuff/npm-app/agents/load-agents'
+```
+- **Purpose**: Load local agent definitions from .agents directory
+- **Usage**: Enables custom agents to be used in evaluations
+- **Dependency Level**: High - Core functionality
+
+#### Authentication
+```typescript
+// From: runners/codebuff.ts
+import { getUserCredentials } from '@codebuff/npm-app/credentials'
+```
+- **Purpose**: Retrieve user authentication tokens
+- **Usage**: Authenticate with Codebuff API during agent runs
+- **Dependency Level**: High - Security
+
+### 4. SDK Integration
+
+#### Codebuff Client
+```typescript
+// From: runners/codebuff.ts, prompting-agent.ts
+import { CodebuffClient } from '../../../sdk/src/index'
+```
+- **Purpose**: Primary interface to Codebuff's agent execution system
+- **Usage**: Execute agent runs with full tool access
+- **Dependency Level**: Critical - Core functionality
+
+### 5. External Dependencies
+
+#### Schema Validation
+```typescript
+// From: types.ts
+import { z } from 'zod/v4'
+```
+- **Purpose**: Runtime type validation and schema definition
+- **Usage**: Validate AI responses and data structures
+- **Dependency Level**: High - Data integrity
+
+#### Utility Libraries
+```typescript
+// From: various files
+import { cloneDeep } from 'lodash'
+import pLimit from 'p-limit'
+import { createPatch } from 'diff'
+```
+- **Purpose**: Data manipulation, concurrency control, diff generation
+- **Usage**: Object cloning, parallel execution limits, code diff creation
+- **Dependency Level**: Medium - Operational
+
+## Dependency Graph
+
+```mermaid
+graph TD
+    A[Evals System] --> B[Backend]
+    A --> C[Common]
+    A --> D[NPM App]  
+    A --> E[SDK]
+    A --> F[External]
+
+    B --> B1[LLM APIs]
+    B --> B2[Live User Inputs]
+    B --> B3[Token Counter]
+
+    C --> C1[Old Constants]
+    C --> C2[Agent Constants]
+    C --> C3[Utilities]
+
+    D --> D1[Agent Loading]
+    D --> D2[Credentials]
+
+    E --> E1[Codebuff Client]
+
+    F --> F1[Zod]
+    F --> F2[Lodash]
+    F --> F3[p-limit]
+    F --> F4[diff]
+
+    style A fill:#f9f,stroke:#333,stroke-width:4px
+    style E1 fill:#bbf,stroke:#333,stroke-width:2px
+    style B1 fill:#bbf,stroke:#333,stroke-width:2px
+```
+
+## Critical Dependencies Analysis
+
+### 1. **Essential for Core Functionality** 
+- **Codebuff SDK**: Cannot run agents without this
+- **LLM APIs**: Required for AI judging and prompting  
+- **Agent Loading**: Needed to access custom agents
+- **Model Constants**: Required for model selection
+
+### 2. **Important for Operations**
+- **Token Counter**: Prevents context overflow
+- **Credentials**: Enables authenticated API calls
+- **Promise Utilities**: Timeout and error handling
+- **Concurrency Control**: Manages parallel execution
+
+### 3. **Configuration and Utilities**
+- **Agent Constants**: Default configuration values  
+- **String Utilities**: ID generation and formatting
+- **Data Manipulation**: Object cloning and processing
+- **Schema Validation**: Data integrity and type safety
+
+## Replication Requirements
+
+To replicate this evaluation system outside the Codebuff codebase, you would need to:
+
+### 1. **Replace Core Dependencies**
+```typescript
+// Instead of @codebuff/backend/llm-apis/vercel-ai-sdk/ai-sdk
+// Implement direct AI model integration:
+import { openai } from '@ai-sdk/openai'
+import { generateObject } from 'ai'
+
+// Instead of @codebuff/backend/util/token-counter  
+// Use tiktoken or similar:
+import { encode } from 'tiktoken'
+```
+
+### 2. **Replace Agent System**
+```typescript
+// Instead of CodebuffClient
+// Implement custom agent runner that can:
+// - Execute prompts with tool access
+// - Track conversation history  
+// - Manage file operations
+// - Handle timeouts and errors
+```
+
+### 3. **Replace Configuration System**
+```typescript
+// Instead of @codebuff/common/old-constants
+// Define model configurations directly:
+const models = {
+  'gpt-4': { provider: 'openai', model: 'gpt-4' },
+  'claude-3': { provider: 'anthropic', model: 'claude-3-sonnet' }
+}
+```
+
+### 4. **Implement Missing Utilities**
+```typescript
+// Recreate utility functions:
+export const generateCompactId = () => Math.random().toString(36).substr(2, 9)
+export const withTimeout = <T>(promise: Promise<T>, ms: number) => 
+  Promise.race([promise, new Promise<never>((_, reject) => 
+    setTimeout(() => reject(new Error('Timeout')), ms))])
+```
+
+## Coupling Analysis
+
+### **High Coupling Areas**
+1. **Agent Execution**: Deeply integrated with Codebuff's agent system
+2. **Authentication**: Relies on Codebuff's credential management
+3. **Model Integration**: Uses Codebuff's LLM abstraction layer
+
+### **Medium Coupling Areas**  
+1. **Utility Functions**: Could be easily replaced with equivalents
+2. **Configuration**: Centralized but replaceable constants
+3. **Schema Validation**: Standard Zod usage, not Codebuff-specific
+
+### **Low Coupling Areas**
+1. **External Libraries**: Standard npm packages
+2. **File System Operations**: Standard Node.js APIs
+3. **Git Operations**: Standard git command-line interface
+
+## Recommendations for Decoupling
+
+If the goal is to make the evals system more standalone:
+
+### 1. **Create Abstraction Layers**
+```typescript
+interface AIModelClient {
+  generateStructured<T>(prompt: string, schema: Schema<T>): Promise<T>
+  generateText(prompt: string): Promise<string>
+}
+
+interface AgentRunner {
+  run(prompt: string): Promise<AgentRunResult>
+}
+```
+
+### 2. **Extract Configuration**
+```typescript
+interface EvalConfig {
+  models: ModelConfig[]
+  timeouts: TimeoutConfig
+  concurrency: ConcurrencyConfig
+}
+```
+
+### 3. **Minimize Backend Dependencies**
+- Replace token counter with tiktoken
+- Replace LLM APIs with direct AI SDK usage
+- Replace credential management with environment variables
+
+This analysis shows that while the evals system has significant dependencies on the Codebuff ecosystem, the core evaluation logic could be extracted and adapted for standalone use with moderate effort.
\ No newline at end of file
diff --git a/evals/docs/json-file-analysis.md b/evals/docs/json-file-analysis.md
new file mode 100644
index 0000000000..9de1b29df1
--- /dev/null
+++ b/evals/docs/json-file-analysis.md
@@ -0,0 +1,335 @@
+# JSON File Structure Analysis
+
+This document provides a detailed analysis of the large JSON files in the `evals/git-evals/` directory, their structure, purpose, and schema definitions.
+
+## File Overview
+
+Based on analysis using command-line tools, here are the evaluation JSON files and their characteristics:
+
+| File | Size | Lines | Purpose |
+|------|------|-------|---------|
+| `eval-codebuff.json` | 7.6MB | 841 | Codebuff project evaluations |
+| `eval-codebuff2.json` | 7.4MB | 2,429 | Extended Codebuff evaluations |
+| `eval-manifold.json` | 3.1MB | 941 | Manifold prediction market evals |
+| `eval-manifold2.json` | 8.1MB | 1,293 | Extended Manifold evaluations |
+| `eval-plane.json` | 4.5MB | 1,667 | Plane project management evals |
+| `eval-saleor.json` | 37MB | 1,476 | Saleor e-commerce platform evals |
+| `eval-result-codebuff-mock.json` | 2.1MB | 842 | Sample evaluation results |
+
+## File Type Analysis
+
+### 1. Evaluation Data Files (`eval-*.json`)
+
+These files contain the test cases for evaluation. They follow the `EvalData` schema:
+
+#### Structure Analysis (using `eval-codebuff.json` as example):
+
+```bash
+# Top-level structure
+$ jq 'keys' eval-codebuff.json
+[
+  "evalCommits",
+  "generationDate", 
+  "repoUrl"
+]
+
+# Basic metadata
+$ node -e "const data=JSON.parse(require('fs').readFileSync('eval-codebuff.json')); console.log(JSON.stringify({repoUrl: data.repoUrl, generationDate: data.generationDate, evalCommitsCount: data.evalCommits.length}, null, 2))"
+{
+  "repoUrl": "https://github.com/CodebuffAI/codebuff",
+  "generationDate": "2025-05-19T02:52:35.503Z",
+  "evalCommitsCount": 13
+}
+```
+
+#### Individual Commit Structure:
+
+```bash
+# Commit structure
+$ jq '.evalCommits[0] | keys' eval-codebuff.json
+[
+  "author",
+  "date", 
+  "fileStates",
+  "message",
+  "selectionReason",
+  "sha",
+  "spec",
+  "stats"
+]
+
+# File states structure
+$ jq '.evalCommits[0].fileStates[0] | keys' eval-codebuff.json
+[
+  "path",
+  "postContent",
+  "preContent"
+]
+
+# Stats structure  
+$ jq '.evalCommits[0].stats | keys' eval-codebuff.json
+[
+  "deletions",
+  "filesChanged", 
+  "insertions"
+]
+```
+
+### 2. Evaluation Results Files (`eval-result-*.json`)
+
+These files contain the actual evaluation run results and judging:
+
+```bash
+# Results file structure
+$ jq 'keys' eval-result-codebuff-mock.json
+[
+  "eval_runs",
+  "generation_date",
+  "overall_metrics", 
+  "test_repo_name"
+]
+
+# Individual run structure
+$ jq '.eval_runs[0] | keys' eval-result-codebuff-mock.json
+[
+  "durationMs",
+  "eval_commit",
+  "fileStates", 
+  "judging_results",
+  "trace"
+]
+
+# Judging results structure
+$ jq '.eval_runs[0].judging_results | keys' eval-result-codebuff-mock.json
+[
+  "analysis",
+  "metrics",
+  "strengths", 
+  "weaknesses"
+]
+
+# Trace structure
+$ jq '.eval_runs[0].trace[0] | keys' eval-result-codebuff-mock.json
+[
+  "prompt",
+  "steps"
+]
+```
+
+## Complete TypeScript Schema Definitions
+
+### Evaluation Data Schema
+
+```typescript
+import { z } from 'zod';
+
+// File state represents before/after content for a single file
+export const FileStateSchema = z.object({
+  path: z.string(),
+  preContent: z.string(),  // Content before the commit
+  postContent: z.string()  // Content after the commit  
+});
+
+// Statistics about the commit changes
+export const CommitStatsSchema = z.object({
+  deletions: z.number(),
+  filesChanged: z.number(),
+  insertions: z.number()
+});
+
+// Individual evaluation commit with all metadata
+export const EvalCommitSchema = z.object({
+  sha: z.string(),           // Git commit SHA
+  spec: z.string(),          // Natural language specification
+  fileStates: z.array(FileStateSchema),
+  
+  // Additional metadata from commit selection
+  author: z.string().optional(),
+  date: z.string().optional(),
+  message: z.string().optional(),
+  selectionReason: z.string().optional(),
+  stats: CommitStatsSchema.optional()
+});
+
+// Complete evaluation data file
+export const EvalDataSchema = z.object({
+  repoUrl: z.string(),              // Source repository URL
+  generationDate: z.string(),       // When evaluation was created
+  testRepoName: z.string().optional(), // Optional repo name override
+  initCommand: z.string().optional(),  // Optional setup command
+  evalCommits: z.array(EvalCommitSchema)
+});
+
+// Sample usage:
+export type EvalData = z.infer<typeof EvalDataSchema>;
+export type EvalCommit = z.infer<typeof EvalCommitSchema>;
+export type FileState = z.infer<typeof FileStateSchema>;
+```
+
+### Evaluation Results Schema
+
+```typescript
+// Agent interaction step
+export const AgentStepSchema = z.object({
+  response: z.string(),
+  toolCalls: z.array(z.any()),   // Tool calls made by agent
+  toolResults: z.array(z.any())  // Results returned from tools
+});
+
+// Conversation trace between prompting agent and coding agent  
+export const CodebuffTraceSchema = z.object({
+  prompt: z.string(),                    // Prompt sent to coding agent
+  steps: z.array(AgentStepSchema)        // Agent's response steps
+});
+
+// AI judge scoring metrics
+export const JudgingMetricsSchema = z.object({
+  completionScore: z.number().min(0).max(10),    // How complete vs ground truth
+  codeQualityScore: z.number().min(0).max(10),   // Code structure and quality
+  overallScore: z.number().min(0).max(10)        // Combined assessment
+});
+
+// Complete judging analysis
+export const JudgingAnalysisSchema = z.object({
+  analysis: z.string(),                    // Detailed analysis text
+  strengths: z.array(z.string()),         // List of implementation strengths
+  weaknesses: z.array(z.string()),        // List of implementation weaknesses
+  metrics: JudgingMetricsSchema
+});
+
+// Individual evaluation run with judging
+export const EvalRunJudgedSchema = z.object({
+  eval_commit: EvalCommitSchema,           // Original evaluation task
+  trace: z.array(CodebuffTraceSchema),     // Conversation history
+  error: z.string().optional(),            // Any execution errors
+  gitDiff: z.string(),                     // Agent's actual changes as git diff
+  durationMs: z.number(),                  // Execution time
+  costUsd: z.number(),                     // API costs incurred
+  judging_results: JudgingAnalysisSchema,
+  fileStates: z.string().optional(),       // May include final file states
+  computed_metrics: z.object({
+    runtime_sec: z.number(),
+    cost_usd: z.number()
+  }).optional()
+});
+
+// Overall metrics across all runs
+export const OverallMetricsSchema = z.object({
+  average_runtime_sec: z.number(),
+  average_cost_usd: z.number(),
+  average_completion: z.number(),
+  average_code_quality: z.number(),
+  average_overall: z.number(),
+  average_duration_ms: z.number(),
+  total_runs: z.number(),
+  successful_runs: z.number(),
+  failed_runs: z.number()
+});
+
+// Complete evaluation results log
+export const FullEvalLogSchema = z.object({
+  test_repo_name: z.string(),
+  generation_date: z.string(),
+  eval_runs: z.array(EvalRunJudgedSchema),
+  overall_metrics: OverallMetricsSchema
+});
+
+// Type exports
+export type EvalRunJudged = z.infer<typeof EvalRunJudgedSchema>;
+export type FullEvalLog = z.infer<typeof FullEvalLogSchema>;
+export type JudgingAnalysis = z.infer<typeof JudgingAnalysisSchema>;
+export type CodebuffTrace = z.infer<typeof CodebuffTraceSchema>;
+```
+
+## Data Insights from Analysis
+
+### Repository Distribution
+
+Based on the JSON files, the evaluation system covers:
+
+1. **Codebuff** (`eval-codebuff.json`, `eval-codebuff2.json`)
+   - Internal project evaluations
+   - 13+ evaluation commits
+   - Focus on Codebuff's own development patterns
+
+2. **Manifold** (`eval-manifold.json`, `eval-manifold2.json`)  
+   - Prediction market platform
+   - TypeScript/React codebase
+   - Complex business logic scenarios
+
+3. **Plane** (`eval-plane.json`)
+   - Project management tool
+   - Modern web application
+   - UI and backend integration
+
+4. **Saleor** (`eval-saleor.json`)
+   - Large e-commerce platform  
+   - 37MB evaluation file (largest dataset)
+   - Comprehensive enterprise-level scenarios
+
+### Sample Evaluation Commit
+
+From the analysis, here's what a typical evaluation looks like:
+
+```json
+{
+  "sha": "ce2badebbee89b6016ae30c3c507fb130da0bb7e",
+  "spec": "Update the `run_terminal_command` tool to accurately reflect and report the current working directory (CWD). First, modify the tool's description in `backend/src/tools.ts` to inform the LLM that commands execute in the user's CWD, which persists after `cd` commands, rather than always resetting to the project root. Second, adjust the terminal command execution logic in `npm-app/src/utils/terminal.ts`: the `handleChangeDirectory` function must return the new CWD path as a string upon a successful user `cd` command, the current CWD if `cd` is attempted outside the project root, or null otherwise...",
+  "author": "Charles Lien", 
+  "message": "notify llm of cwd after each command",
+  "fileStatesCount": 2
+}
+```
+
+### Judging Metrics Example
+
+Sample judging results from the mock data:
+
+```json
+{
+  "metrics": {
+    "completionScore": 1,
+    "codeQualityScore": 2.5, 
+    "overallScore": 1.5
+  }
+}
+```
+
+### Overall Performance Metrics
+
+Sample aggregate metrics:
+
+```json
+{
+  "average_completion": 3,
+  "average_code_quality": 3.5,
+  "average_overall": 3.67,
+  "average_duration_ms": 165672.67,
+  "total_runs": 3,
+  "successful_runs": 3, 
+  "failed_runs": 0
+}
+```
+
+## Usage Notes
+
+### File Size Considerations
+
+- **Saleor** (37MB): Contains extensive file content and many evaluation scenarios
+- **Manifold** (8.1MB): Extended test cases with complex business logic
+- **Codebuff** (7.6MB): Internal evaluation scenarios
+
+### Performance Implications
+
+- Large files require careful memory management during processing
+- Token counting is crucial for AI judge input (1M token limit)
+- Trace truncation may be necessary for large conversation histories
+
+### Schema Evolution
+
+The schema includes optional fields to handle:
+- Legacy data formats
+- Extended metadata added over time
+- Different repository types and structures
+
+This analysis shows the evaluation system handles substantial, real-world codebases with comprehensive metadata and scoring across multiple dimensions.
\ No newline at end of file
diff --git a/evals/docs/replication-guide.md b/evals/docs/replication-guide.md
new file mode 100644
index 0000000000..5b560745e7
--- /dev/null
+++ b/evals/docs/replication-guide.md
@@ -0,0 +1,664 @@
+# How to Replicate the Codebuff Evaluation System
+
+This guide explains how to replicate the Codebuff evaluation framework without depending on the Codebuff codebase. It provides a complete roadmap for building a standalone commit reconstruction evaluation system.
+
+## Overview
+
+The Codebuff evaluation system can be replicated by implementing several key components:
+
+1. **AI Model Integration Layer**
+2. **Agent Execution System** 
+3. **Repository Management**
+4. **Evaluation Orchestration**
+5. **Judging and Analysis**
+
+## Architecture for Standalone System
+
+```mermaid
+graph TB
+    subgraph "Standalone Eval System"
+        A[Eval Orchestrator]
+        B[Model Client]
+        C[Agent Runner]
+        D[Repo Manager]
+        E[Judge System]
+        F[Results Store]
+    end
+
+    subgraph "External Services"
+        G[OpenAI API]
+        H[Anthropic API]
+        I[Git Repositories]
+        J[File System]
+    end
+
+    A --> B
+    A --> C
+    A --> D
+    A --> E
+    B --> G
+    B --> H
+    C --> B
+    D --> I
+    E --> B
+    F --> J
+```
+
+## Implementation Roadmap
+
+### Phase 1: Core Infrastructure
+
+#### 1. AI Model Integration
+
+Replace Codebuff's LLM APIs with direct AI SDK integration:
+
+```typescript
+// models/client.ts
+import { openai } from '@ai-sdk/openai'
+import { anthropic } from '@ai-sdk/anthropic'
+import { generateObject, generateText } from 'ai'
+
+export interface ModelConfig {
+  provider: 'openai' | 'anthropic'
+  model: string
+  temperature?: number
+  maxTokens?: number
+}
+
+export class ModelClient {
+  async generateStructured<T>(
+    prompt: string, 
+    schema: z.ZodSchema<T>,
+    config: ModelConfig
+  ): Promise<T> {
+    const provider = config.provider === 'openai' ? openai : anthropic
+    
+    const result = await generateObject({
+      model: provider(config.model),
+      prompt,
+      schema,
+      temperature: config.temperature,
+      maxTokens: config.maxTokens
+    })
+    
+    return result.object
+  }
+
+  async generateText(prompt: string, config: ModelConfig): Promise<string> {
+    const provider = config.provider === 'openai' ? openai : anthropic
+    
+    const result = await generateText({
+      model: provider(config.model),
+      prompt,
+      temperature: config.temperature,
+      maxTokens: config.maxTokens
+    })
+    
+    return result.text
+  }
+}
+```
+
+#### 2. Token Counting
+
+Replace Codebuff's token counter:
+
+```typescript
+// utils/tokens.ts
+import { encoding_for_model } from 'tiktoken'
+
+export function countTokens(text: string, model: string = 'gpt-4'): number {
+  try {
+    const encoder = encoding_for_model(model as any)
+    const tokens = encoder.encode(text)
+    encoder.free()
+    return tokens.length
+  } catch {
+    // Fallback estimation: ~4 chars per token
+    return Math.ceil(text.length / 4)
+  }
+}
+```
+
+#### 3. Configuration Management
+
+Create centralized configuration:
+
+```typescript
+// config/index.ts
+export interface EvalConfig {
+  models: {
+    judge: ModelConfig
+    promptingAgent: ModelConfig
+  }
+  timeouts: {
+    singleEval: number
+    judging: number
+    agentStep: number
+  }
+  concurrency: {
+    maxEvals: number
+  }
+  paths: {
+    testRepos: string
+    results: string
+  }
+}
+
+export const defaultConfig: EvalConfig = {
+  models: {
+    judge: { provider: 'anthropic', model: 'claude-3-5-sonnet-20241022' },
+    promptingAgent: { provider: 'openai', model: 'gpt-4' }
+  },
+  timeouts: {
+    singleEval: 30 * 60 * 1000, // 30 minutes
+    judging: 10 * 60 * 1000,    // 10 minutes  
+    agentStep: 5 * 60 * 1000    // 5 minutes
+  },
+  concurrency: {
+    maxEvals: 5
+  },
+  paths: {
+    testRepos: './test-repos',
+    results: './results'
+  }
+}
+```
+
+### Phase 2: Agent System
+
+#### 1. Agent Runner Interface
+
+Create a generic agent runner interface:
+
+```typescript
+// agents/runner.ts
+export interface AgentStep {
+  response: string
+  toolCalls: ToolCall[]
+  toolResults: ToolResult[]
+}
+
+export interface AgentRunResult {
+  steps: AgentStep[]
+  totalCostUsd: number
+  error?: string
+}
+
+export interface AgentRunner {
+  run(prompt: string): Promise<AgentRunResult>
+}
+```
+
+#### 2. Tool System
+
+Implement basic tools needed for code editing:
+
+```typescript
+// tools/index.ts
+export interface ToolCall {
+  id: string
+  type: string
+  function: {
+    name: string
+    arguments: string
+  }
+}
+
+export interface ToolResult {
+  tool_call_id: string
+  output: string
+}
+
+export class ToolRegistry {
+  private tools = new Map<string, Tool>()
+
+  register(tool: Tool) {
+    this.tools.set(tool.name, tool)
+  }
+
+  async execute(toolCall: ToolCall): Promise<ToolResult> {
+    const tool = this.tools.get(toolCall.function.name)
+    if (!tool) {
+      throw new Error(`Unknown tool: ${toolCall.function.name}`)
+    }
+
+    const args = JSON.parse(toolCall.function.arguments)
+    const output = await tool.execute(args)
+
+    return {
+      tool_call_id: toolCall.id,
+      output: JSON.stringify(output)
+    }
+  }
+}
+
+// Essential tools
+export const readFileTool: Tool = {
+  name: 'read_file',
+  description: 'Read a file from the filesystem',
+  parameters: {
+    type: 'object',
+    properties: {
+      file_path: { type: 'string' }
+    }
+  },
+  execute: async ({ file_path }) => {
+    return fs.readFileSync(file_path, 'utf-8')
+  }
+}
+
+export const writeFileTool: Tool = {
+  name: 'write_file', 
+  description: 'Write content to a file',
+  parameters: {
+    type: 'object',
+    properties: {
+      file_path: { type: 'string' },
+      content: { type: 'string' }
+    }
+  },
+  execute: async ({ file_path, content }) => {
+    fs.writeFileSync(file_path, content, 'utf-8')
+    return { success: true }
+  }
+}
+```
+
+#### 3. Basic Agent Implementation
+
+Create a simple agent that can use tools:
+
+```typescript
+// agents/basic-agent.ts
+export class BasicAgent implements AgentRunner {
+  constructor(
+    private modelClient: ModelClient,
+    private toolRegistry: ToolRegistry,
+    private config: ModelConfig
+  ) {}
+
+  async run(prompt: string): Promise<AgentRunResult> {
+    const steps: AgentStep[] = []
+    const maxSteps = 10
+    let currentPrompt = prompt
+
+    for (let i = 0; i < maxSteps; i++) {
+      const response = await this.modelClient.generateText(
+        this.buildPrompt(currentPrompt, steps),
+        this.config
+      )
+
+      const toolCalls = this.extractToolCalls(response)
+      const toolResults: ToolResult[] = []
+
+      // Execute tool calls
+      for (const toolCall of toolCalls) {
+        const result = await this.toolRegistry.execute(toolCall)
+        toolResults.push(result)
+      }
+
+      steps.push({
+        response,
+        toolCalls,
+        toolResults
+      })
+
+      // Check if agent wants to continue
+      if (this.shouldStop(response, toolCalls)) {
+        break
+      }
+
+      currentPrompt = this.buildContinuationPrompt(toolResults)
+    }
+
+    return {
+      steps,
+      totalCostUsd: this.estimateCost(steps)
+    }
+  }
+
+  private buildPrompt(userPrompt: string, previousSteps: AgentStep[]): string {
+    let prompt = `You are a coding assistant. Help implement the following request:\n\n${userPrompt}\n\n`
+    
+    // Add conversation history
+    for (const step of previousSteps) {
+      prompt += `Previous response: ${step.response}\n`
+      if (step.toolResults.length > 0) {
+        prompt += `Tool results: ${JSON.stringify(step.toolResults)}\n`
+      }
+    }
+
+    prompt += `Available tools: ${Array.from(this.toolRegistry.tools.keys()).join(', ')}\n\n`
+    prompt += `Respond with your analysis and any tool calls needed.`
+
+    return prompt
+  }
+
+  // ... implementation details
+}
+```
+
+### Phase 3: Repository Management
+
+#### 1. Git Repository Handler
+
+Replace Codebuff's repository setup:
+
+```typescript
+// repos/manager.ts
+export class RepoManager {
+  constructor(private config: EvalConfig) {}
+
+  async setupTestRepo(
+    repoUrl: string,
+    commitSha: string,
+    customName?: string
+  ): Promise<string> {
+    const repoName = customName || this.extractRepoName(repoUrl)
+    const repoDir = path.join(this.config.paths.testRepos, `${repoName}-${commitSha}`)
+
+    // Clean up existing
+    if (fs.existsSync(repoDir)) {
+      fs.rmSync(repoDir, { recursive: true, force: true })
+    }
+
+    // Clone repository
+    execSync(`git clone --no-checkout "${repoUrl}" "${repoDir}"`, {
+      stdio: 'inherit',
+      timeout: 120_000
+    })
+
+    // Checkout to parent commit
+    execSync(`git checkout "${commitSha}^"`, {
+      cwd: repoDir,
+      stdio: 'inherit'
+    })
+
+    return repoDir
+  }
+
+  async getChanges(repoDir: string): Promise<string> {
+    // Stage all changes
+    execSync('git add .', { cwd: repoDir })
+    
+    // Get diff
+    return execSync('git diff --staged', {
+      cwd: repoDir,
+      encoding: 'utf-8'
+    })
+  }
+
+  private extractRepoName(url: string): string {
+    return url.split('/').pop()?.replace('.git', '') || 'unknown'
+  }
+}
+```
+
+### Phase 4: Evaluation Orchestration
+
+#### 1. Main Orchestrator
+
+Create the main evaluation pipeline:
+
+```typescript
+// orchestrator/index.ts
+export class EvalOrchestrator {
+  constructor(
+    private config: EvalConfig,
+    private modelClient: ModelClient,
+    private repoManager: RepoManager
+  ) {}
+
+  async runEvaluation(evalData: EvalData): Promise<FullEvalLog> {
+    const results: EvalRunJudged[] = []
+    
+    // Process commits with concurrency limit
+    const limiter = pLimit(this.config.concurrency.maxEvals)
+    
+    const evalPromises = evalData.evalCommits.map(evalCommit => 
+      limiter(() => this.runSingleEval(evalCommit, evalData.repoUrl))
+    )
+
+    const evalResults = await Promise.allSettled(evalPromises)
+
+    // Process results
+    for (const result of evalResults) {
+      if (result.status === 'fulfilled') {
+        results.push(result.value)
+      } else {
+        console.error('Eval failed:', result.reason)
+      }
+    }
+
+    return {
+      test_repo_name: this.extractRepoName(evalData.repoUrl),
+      generation_date: new Date().toISOString(),
+      eval_runs: results,
+      overall_metrics: this.calculateMetrics(results)
+    }
+  }
+
+  private async runSingleEval(
+    evalCommit: EvalCommit,
+    repoUrl: string
+  ): Promise<EvalRunJudged> {
+    const startTime = Date.now()
+
+    // Setup repository
+    const repoDir = await this.repoManager.setupTestRepo(
+      repoUrl,
+      evalCommit.sha
+    )
+
+    try {
+      // Run prompting agent + coding agent conversation
+      const trace = await this.runAgentConversation(evalCommit.spec, repoDir)
+
+      // Get final changes
+      const gitDiff = await this.repoManager.getChanges(repoDir)
+
+      // Judge the results
+      const judgingResults = await this.judgeResults({
+        eval_commit: evalCommit,
+        trace,
+        gitDiff,
+        durationMs: Date.now() - startTime,
+        costUsd: this.calculateCost(trace)
+      })
+
+      return {
+        eval_commit: evalCommit,
+        trace,
+        gitDiff,
+        durationMs: Date.now() - startTime,
+        costUsd: this.calculateCost(trace),
+        judging_results: judgingResults,
+        computed_metrics: {
+          runtime_sec: (Date.now() - startTime) / 1000,
+          cost_usd: this.calculateCost(trace)
+        }
+      }
+    } finally {
+      // Cleanup
+      if (fs.existsSync(repoDir)) {
+        fs.rmSync(repoDir, { recursive: true, force: true })
+      }
+    }
+  }
+
+  // ... rest of implementation
+}
+```
+
+#### 2. Prompting Agent
+
+Implement the prompting agent logic:
+
+```typescript
+// agents/prompting-agent.ts
+export class PromptingAgent {
+  constructor(private modelClient: ModelClient) {}
+
+  async getNextAction(
+    spec: string,
+    conversationHistory: string,
+    attemptsRemaining: number
+  ): Promise<{ decision: 'continue' | 'complete' | 'halt', nextPrompt?: string }> {
+    const prompt = this.buildDecisionPrompt(spec, conversationHistory, attemptsRemaining)
+    
+    const decision = await this.modelClient.generateStructured(
+      prompt,
+      AgentDecisionSchema,
+      { provider: 'openai', model: 'gpt-4' }
+    )
+
+    return decision
+  }
+
+  private buildDecisionPrompt(
+    spec: string,
+    history: string,
+    remaining: number
+  ): string {
+    return `You are managing a coding agent to implement this specification:
+
+<spec>${spec}</spec>
+
+Conversation so far:
+<history>${history}</history>
+
+You have ${remaining} attempts remaining. Decide whether to:
+1. 'continue' - Send another prompt to the coding agent
+2. 'complete' - The implementation is done
+3. 'halt' - Stop due to being off track
+
+If continuing, provide the next prompt to send.`
+  }
+}
+```
+
+### Phase 5: Judging System
+
+#### 1. AI Judge
+
+Implement the judging system:
+
+```typescript
+// judging/judge.ts
+export class EvalJudge {
+  constructor(private modelClient: ModelClient) {}
+
+  async judgeEvalRun(evalRun: EvalRunLog): Promise<JudgingAnalysis> {
+    const prompt = this.buildJudgingPrompt(evalRun)
+    
+    // Run multiple judges for robustness
+    const judgePromises = Array.from({ length: 3 }, () =>
+      this.modelClient.generateStructured(
+        prompt,
+        JudgingAnalysisSchema,
+        { provider: 'anthropic', model: 'claude-3-5-sonnet-20241022' }
+      )
+    )
+
+    const results = await Promise.allSettled(judgePromises)
+    const validResults = results
+      .filter((r): r is PromiseFulfilledResult<JudgingAnalysis> => r.status === 'fulfilled')
+      .map(r => r.value)
+
+    if (validResults.length === 0) {
+      throw new Error('All judges failed')
+    }
+
+    // Return median result
+    const sorted = validResults.sort((a, b) => a.metrics.overallScore - b.metrics.overallScore)
+    return sorted[Math.floor(sorted.length / 2)]
+  }
+
+  private buildJudgingPrompt(evalRun: EvalRunLog): string {
+    const groundTruthChanges = evalRun.eval_commit.fileStates
+      .map(state => {
+        const diff = createPatch(state.path, state.preContent, state.postContent)
+        return `File: ${state.path}\n${diff}`
+      })
+      .join('\n\n')
+
+    return `Analyze this coding agent implementation:
+
+SPECIFICATION:
+${evalRun.eval_commit.spec}
+
+GROUND TRUTH CHANGES:
+${groundTruthChanges}
+
+AGENT'S CHANGES:
+${evalRun.gitDiff}
+
+ERROR (if any):
+${evalRun.error || 'None'}
+
+Provide detailed analysis and scores (0-10) for:
+- Completion: How well does it match the ground truth?
+- Code Quality: How well-structured is the code?
+- Overall: Combined assessment
+
+Include strengths, weaknesses, and detailed analysis.`
+  }
+}
+```
+
+## Required External Dependencies
+
+```json
+{
+  "dependencies": {
+    "@ai-sdk/openai": "^0.0.x",
+    "@ai-sdk/anthropic": "^0.0.x", 
+    "ai": "^3.x",
+    "zod": "^3.x",
+    "tiktoken": "^1.x",
+    "diff": "^5.x",
+    "p-limit": "^5.x",
+    "lodash": "^4.x"
+  }
+}
+```
+
+## Estimated Development Timeline
+
+| Phase | Effort | Description |
+|-------|--------|-------------|
+| **Phase 1** | 2-3 weeks | Core infrastructure and AI integration |
+| **Phase 2** | 3-4 weeks | Agent system and tool framework |
+| **Phase 3** | 1-2 weeks | Repository management |
+| **Phase 4** | 2-3 weeks | Evaluation orchestration |
+| **Phase 5** | 1-2 weeks | Judging system |
+| **Testing** | 2-3 weeks | Integration testing and refinement |
+| **Total** | 11-17 weeks | Complete standalone system |
+
+## Key Challenges
+
+### 1. **Agent Tool Integration**
+- Need to implement comprehensive tool system
+- File operations, terminal commands, code analysis
+- Error handling and timeout management
+
+### 2. **Cost Management**
+- Token counting and budget controls
+- Model selection optimization
+- Parallel execution limits
+
+### 3. **Robustness**
+- Error recovery and retry logic
+- Process isolation and cleanup
+- Concurrent execution safety
+
+## Advantages of Standalone System
+
+1. **Independence**: No dependency on Codebuff infrastructure
+2. **Flexibility**: Can integrate with any AI models or coding agents
+3. **Portability**: Can run in any environment
+4. **Customization**: Full control over evaluation logic
+5. **Cost Control**: Direct management of AI API costs
+
+This roadmap provides a complete path to replicating the Codebuff evaluation system while maintaining its core innovation of commit reconstruction methodology.
\ No newline at end of file