cloudbring · cloudbring · Sep 11, 2025
diff --git a/evals/docs/README.md b/evals/docs/README.md
@@ -0,0 +1,147 @@
+# Codebuff Evaluation System Documentation
+
+This directory contains comprehensive documentation for the Codebuff Evaluation Framework - a novel system for evaluating AI coding agents through **Git Commit Reimplementation**.
+
+## 📚 Documentation Index
+
+### Core Documentation
+
+1. **[Codebuff Evals Overview](./codebuff-evals-overview.md)**
+   - Complete system architecture and components
+   - Data sources and process flows
+   - Function diagrams and usage instructions
+
+2. **[JSON File Analysis](./json-file-analysis.md)**
+   - Detailed structure analysis of evaluation data files
+   - Complete TypeScript/Zod schemas for all JSON formats
+   - Sample data insights and file size analysis
+
+3. **[Dependencies Analysis](./dependencies-analysis.md)**
+   - Complete dependency mapping on Codebuff codebase
+   - Integration points and coupling analysis
+   - External library requirements
+
+### Implementation Guides
+
+4. **[Replication Guide](./replication-guide.md)**
+   - Step-by-step roadmap for building a standalone evaluation system
+   - Complete implementation phases and timelines
+   - Alternative architecture without Codebuff dependencies
+
+5. **[Agents Relationship Analysis](./agents-relationship-analysis.md)**
+   - Deep analysis of how .agents folder relates to evaluation research
+   - Feedback loop between evaluation results and agent development
+   - Evolution of agent architectures based on eval insights
+
+## 🎯 Key Insights
+
+### Revolutionary Evaluation Methodology
+
+The Codebuff evaluation system introduces a **Commit Reconstruction Methodology** that:
+
+- **Tests real-world scenarios** using actual git commits from production codebases
+- **Enables interactive evaluation** through multi-turn conversations with a prompting agent
+- **Provides comprehensive scoring** across completion, efficiency, and code quality dimensions
+- **Scales to enterprise codebases** with sophisticated token management and parallel execution
+
+### Data Scale and Scope
+
+The evaluation dataset includes:
+
+| Repository | File Size | Commits | Focus |
+|------------|-----------|---------|-------|
+| **Saleor** | 37MB | Large set | E-commerce enterprise scenarios |
+| **Manifold** | 8.1MB | Medium set | Prediction market business logic |
+| **Codebuff** | 7.6MB | 13+ commits | Internal development patterns |
+| **Plane** | 4.5MB | Medium set | Project management workflows |
+
+### Architecture Excellence
+
+The system demonstrates sophisticated engineering:
+
+- **Multi-agent orchestration** with prompting agents guiding coding agents
+- **Robust judging system** using multiple AI judges with median selection
+- **Process isolation** with proper cleanup and error handling
+- **Token-aware processing** with intelligent context truncation
+- **Comprehensive metrics** tracking cost, performance, and quality
+
+## 🔄 Research Feedback Loop
+
+The documentation reveals a sophisticated feedback loop:
+
+```mermaid
+graph TB
+    A[Agent Definitions] --> B[Evaluation System]
+    B --> C[Performance Analysis]
+    C --> D[Agent Improvements]
+    D --> A
+
+    style A fill:#e1f5fe
+    style B fill:#f3e5f5
+    style C fill:#e8f5e8
+    style D fill:#fff3e0
+```
+
+This creates continuous improvement where:
+- Evaluation failures inform prompt engineering
+- Performance patterns guide architecture decisions
+- Real-world scenarios ensure practical relevance
+
+## 🛠️ Technical Implementation
+
+### Core Technologies
+- **AI Models**: Claude, GPT-4, Gemini for different components
+- **Schema Validation**: Zod for runtime type safety
+- **Concurrency**: p-limit for controlled parallel execution
+- **Git Operations**: Direct git command-line interface
+- **Token Management**: tiktoken for context length control
+
+### Integration Points
+- **Backend**: LLM APIs, token counting, user input management
+- **Common**: Utilities, model configurations, agent constants
+- **SDK**: Codebuff client for agent execution
+- **NPM App**: Agent loading, credential management
+
+## 🚀 Replication Roadmap
+
+For those looking to replicate this system:
+
+| Phase | Duration | Components |
+|-------|----------|------------|
+| **Infrastructure** | 2-3 weeks | AI model integration, token counting |
+| **Agent System** | 3-4 weeks | Tool framework, agent runners |
+| **Repository Management** | 1-2 weeks | Git operations, isolation |
+| **Orchestration** | 2-3 weeks | Evaluation pipeline, judging |
+| **Testing & Refinement** | 2-3 weeks | Integration, optimization |
+
+**Total Estimated Effort**: 11-17 weeks for a complete standalone system
+
+## 📊 Research Value
+
+This evaluation framework represents significant value for AI research:
+
+1. **Novel Methodology**: First comprehensive commit reconstruction evaluation system
+2. **Real-world Relevance**: Tests on actual production code scenarios  
+3. **Comprehensive Metrics**: Multi-dimensional scoring with AI judge analysis
+4. **Scalable Architecture**: Handles enterprise-scale codebases
+5. **Research Insights**: Direct feedback loop for agent improvement
+
+## 🔍 Key Files Reference
+
+- **Orchestration**: `run-git-evals.ts`, `run-eval-set.ts`
+- **Agent Integration**: `runners/codebuff.ts`, `runners/claude.ts`
+- **Judging**: `judge-git-eval.ts` with multi-judge robustness
+- **Repository Management**: `setup-test-repo.ts` with authentication
+- **Data Generation**: `pick-commits.ts`, `gen-evals.ts`
+- **Analysis**: `post-eval-analysis.ts` for aggregate insights
+
+## 📈 Future Directions
+
+The evaluation system enables research into:
+- **Agent architecture optimization** through systematic testing
+- **Tool usage pattern analysis** via comprehensive logging
+- **Error pattern identification** for targeted improvements
+- **Cost-performance optimization** through detailed metrics
+- **Scaling behavior analysis** across different codebase sizes
+
+This documentation provides a complete picture of a sophisticated AI evaluation system that bridges the gap between research and practical AI coding assistance.
diff --git a/evals/docs/agents-relationship-analysis.md b/evals/docs/agents-relationship-analysis.md
@@ -0,0 +1,238 @@
+# Relationship Between .agents Folder and Evals Research
+
+This document analyzes the relationship between the prompts and agents defined in the `.agents` folder and the research conducted through the evaluation system.
+
+## Overview
+
+The `.agents` folder contains the actual agent definitions that are evaluated by the evals system, creating a direct feedback loop between agent development and evaluation research.
+
+## Agent Architecture in Context
+
+### Agent Types Being Evaluated
+
+Based on the evals system analysis and agent definitions:
+
+1. **Base Agents** (`base.ts`, `base2.ts`)
+   - Primary general-purpose coding agents
+   - Core subjects of evaluation research
+   - Use Claude 4 Sonnet as the underlying model
+
+2. **Specialized Agents** (`git-committer.ts`, `reviewer.ts`, etc.)
+   - Task-specific agents for specialized workflows
+   - Secondary evaluation targets
+   - Test specific capabilities and behaviors
+
+### Direct Evaluation Relationships
+
+#### 1. **Agent Selection in Evals**
+
+From the evaluation runners, we can see the evals system directly references agents by ID:
+
+```typescript
+// From runners/codebuff.ts
+export class CodebuffRunner implements Runner {
+  constructor(runState: RunState, agent?: string) {
+    this.agent = agent ?? 'base'  // Default to 'base' agent
+  }
+}
+```
+
+The evaluation system tests these specific agent types:
+- `base` - Primary general coding agent
+- `base2` - Enhanced version with improved architecture  
+- `base-lite` - Lighter weight version
+- Custom agents from `.agents` directory
+
+#### 2. **Agent Loading Integration**
+
+```typescript
+// From runners/codebuff.ts  
+const agentsPath = path.join(__dirname, '../../../.agents')
+const localAgentDefinitions = Object.values(
+  await loadLocalAgents({ agentsPath })
+)
+```
+
+The evals system dynamically loads agent definitions from the `.agents` folder, making it easy to:
+- Test new agent iterations
+- Compare different agent approaches
+- Evaluate specialized vs. general-purpose agents
+
+## Research-Informed Agent Development
+
+### 1. **Prompting Strategies Influenced by Evals**
+
+The evaluation results directly inform agent prompt engineering. For example, the base prompts include specific guidance likely derived from eval insights:
+
+```typescript
+// From base-prompts.ts - Testing guidance
+'**Testing:** If you create a unit test, you should run it using `run_terminal_command` to see if it passes, and fix it if it doesn\'t.'
+
+// Package management best practices  
+'**Package Management:** When adding new packages, use the run_terminal_command tool to install the package rather than editing the package.json file...'
+```
+
+These specific instructions likely emerged from evaluation findings showing agents making common mistakes.
+
+### 2. **Tool Usage Patterns**
+
+The evaluation system tests how agents use tools, and this feedback influences tool selection and usage patterns in agent definitions:
+
+```typescript
+// From git-committer.ts
+toolNames: ['read_files', 'run_terminal_command', 'add_message', 'end_turn']
+```
+
+The careful selection of tools and their usage patterns in the `handleSteps` function reflects lessons learned from evaluation research.
+
+### 3. **Multi-Agent Architecture Evolution**
+
+The evolution from `base` to `base2` demonstrates research-driven agent improvement:
+
+```typescript
+// base2 uses a factory pattern with specialized sub-agents
+import { base2 } from './base2-factory'
+
+// base2-factory.ts includes guidance like:
+'Don\'t mastermind the task. Rely on your agents\' judgement to plan, implement, and review the code.'
+```
+
+This architectural change likely resulted from evaluation findings about task decomposition and agent coordination.
+
+## Evaluation-Driven Design Patterns
+
+### 1. **Structured Agent Workflows**
+
+The `git-committer` agent demonstrates a structured workflow that was likely refined through evaluation:
+
+```typescript
+handleSteps: function* ({ agentState, prompt, params }: AgentStepContext) {
+  // Step 1: Run git diff and git log to analyze changes
+  yield { toolName: 'run_terminal_command', input: { command: 'git diff' } }
+
+  // Step 2: Read relevant files for context
+  yield { toolName: 'add_message', ... }
+
+  // Step 3: Let AI generate next step
+  yield 'STEP'
+
+  // Step 4: Create commit
+  yield 'STEP_ALL'
+}
+```
+
+This systematic approach to breaking down tasks reflects insights from evaluation research about agent decision-making patterns.
+
+### 2. **Error Prevention Strategies**
+
+Agent prompts include specific error prevention guidance derived from evaluation findings:
+
+```typescript
+// From base prompts - addressing common eval failures
+'You must base your future write_file/str_replace edits off of the latest changes. You must try to accommodate the changes that the user has made...'
+
+'Always run hooks for TypeScript/JavaScript changes, test file changes, or when the changes could affect compilation/tests'
+```
+
+### 3. **Quality Assurance Integration**
+
+The emphasis on testing and verification in agent prompts reflects evaluation insights:
+
+```typescript
+// From base prompts
+'Check the knowledge files to see if the user has specified a further protocol for what terminal commands should be run to verify edits. For example, a `knowledge.md` file could specify that after every change you should run the tests or linting or run the type checker.'
+```
+
+## Research Feedback Loop
+
+```mermaid
+graph TB
+    A[Agent Definitions in .agents/] --> B[Evaluation System]
+    B --> C[Performance Metrics]
+    C --> D[Analysis of Failures]
+    D --> E[Prompt Engineering Insights]
+    E --> F[Updated Agent Definitions]
+    F --> A
+
+    B --> G[Conversation Traces]
+    G --> H[Tool Usage Patterns]
+    H --> I[Workflow Optimization]
+    I --> F
+
+    C --> J[Scoring Dimensions]
+    J --> K[Quality Metrics]
+    K --> L[Agent Architecture Changes]
+    L --> F
+```
+
+## Specific Evaluation Insights Reflected in Agents
+
+### 1. **File Handling Patterns**
+
+Agent prompts include detailed guidance on file operations, likely informed by evaluation failures:
+
+```typescript
+// Emphasis on reading before editing
+'Analyze surrounding code, tests, and configuration first'
+
+// Careful handling of user modifications  
+'You must base your future write_file/str_replace edits off of the latest changes'
+```
+
+### 2. **Terminal Command Best Practices**
+
+Specific terminal usage patterns reflect evaluation learnings:
+
+```typescript
+// Package installation best practices
+'use the run_terminal_command tool to install the package rather than editing the package.json file with a guess at the version number'
+
+// Command chaining for verification
+'you should run them all using \'&&\' to concatenate them into one commands, e.g. `npm run lint && npm run test`'
+```
+
+### 3. **Context Management**
+
+The evolution of context handling in agents reflects evaluation insights about information management:
+
+```typescript
+// From context-pruner.test.ts - sophisticated context management
+'removes old terminal command results while keeping recent 5'
+'removes large tool results'
+'performs message-level pruning when other passes are insufficient'
+```
+
+## Evaluation Impact on Agent Evolution
+
+### Base → Base2 Evolution
+
+The progression from `base` to `base2` demonstrates evaluation-driven improvement:
+
+1. **Architecture**: Moved to factory pattern with specialized sub-agents
+2. **Tool Usage**: More sophisticated tool selection and usage patterns
+3. **Workflow**: Better task decomposition and coordination
+4. **Error Handling**: Improved error prevention and recovery
+
+### Specialized Agent Development
+
+Specialized agents like `git-committer` reflect evaluation insights about:
+- When to use structured workflows vs. free-form responses
+- How to break complex tasks into manageable steps
+- The importance of context gathering before action
+
+## Conclusion
+
+The relationship between the `.agents` folder and the evaluation research is a **direct, iterative feedback loop**:
+
+1. **Agents are evaluated** using the commit reconstruction methodology
+2. **Performance data and failure modes** inform agent improvements
+3. **Updated prompts and architectures** are implemented in the `.agents` folder
+4. **New agent versions** are evaluated to measure improvement
+
+This creates a continuous improvement cycle where:
+- **Evaluation research drives agent development**
+- **Agent performance informs evaluation methodology refinements**  
+- **Real-world coding scenarios** (from git commits) ensure practical relevance
+- **Systematic measurement** enables evidence-based agent improvement
+
+The evaluation system serves as both a research tool for understanding AI coding capabilities and a development tool for improving Codebuff's agent implementations. This tight integration between evaluation and development represents a sophisticated approach to AI agent research and improvement.