Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions evals/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Codebuff Evaluation System Documentation

This directory contains comprehensive documentation for the Codebuff Evaluation Framework - a novel system for evaluating AI coding agents through **Git Commit Reimplementation**.

## 📚 Documentation Index

### Core Documentation

1. **[Codebuff Evals Overview](./codebuff-evals-overview.md)**
- Complete system architecture and components
- Data sources and process flows
- Function diagrams and usage instructions

2. **[JSON File Analysis](./json-file-analysis.md)**
- Detailed structure analysis of evaluation data files
- Complete TypeScript/Zod schemas for all JSON formats
- Sample data insights and file size analysis

3. **[Dependencies Analysis](./dependencies-analysis.md)**
- Complete dependency mapping on Codebuff codebase
- Integration points and coupling analysis
- External library requirements

### Implementation Guides

4. **[Replication Guide](./replication-guide.md)**
- Step-by-step roadmap for building a standalone evaluation system
- Complete implementation phases and timelines
- Alternative architecture without Codebuff dependencies

5. **[Agents Relationship Analysis](./agents-relationship-analysis.md)**
- Deep analysis of how .agents folder relates to evaluation research
- Feedback loop between evaluation results and agent development
- Evolution of agent architectures based on eval insights

## 🎯 Key Insights

### Revolutionary Evaluation Methodology

The Codebuff evaluation system introduces a **Commit Reconstruction Methodology** that:

- **Tests real-world scenarios** using actual git commits from production codebases
- **Enables interactive evaluation** through multi-turn conversations with a prompting agent
- **Provides comprehensive scoring** across completion, efficiency, and code quality dimensions
- **Scales to enterprise codebases** with sophisticated token management and parallel execution

### Data Scale and Scope

The evaluation dataset includes:

| Repository | File Size | Commits | Focus |
|------------|-----------|---------|-------|
| **Saleor** | 37MB | Large set | E-commerce enterprise scenarios |
| **Manifold** | 8.1MB | Medium set | Prediction market business logic |
| **Codebuff** | 7.6MB | 13+ commits | Internal development patterns |
| **Plane** | 4.5MB | Medium set | Project management workflows |

### Architecture Excellence

The system demonstrates sophisticated engineering:

- **Multi-agent orchestration** with prompting agents guiding coding agents
- **Robust judging system** using multiple AI judges with median selection
- **Process isolation** with proper cleanup and error handling
- **Token-aware processing** with intelligent context truncation
- **Comprehensive metrics** tracking cost, performance, and quality

## 🔄 Research Feedback Loop

The documentation reveals a sophisticated feedback loop:

```mermaid
graph TB
A[Agent Definitions] --> B[Evaluation System]
B --> C[Performance Analysis]
C --> D[Agent Improvements]
D --> A

style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
```

This creates continuous improvement where:
- Evaluation failures inform prompt engineering
- Performance patterns guide architecture decisions
- Real-world scenarios ensure practical relevance

## 🛠️ Technical Implementation

### Core Technologies
- **AI Models**: Claude, GPT-4, Gemini for different components
- **Schema Validation**: Zod for runtime type safety
- **Concurrency**: p-limit for controlled parallel execution
- **Git Operations**: Direct git command-line interface
- **Token Management**: tiktoken for context length control

### Integration Points
- **Backend**: LLM APIs, token counting, user input management
- **Common**: Utilities, model configurations, agent constants
- **SDK**: Codebuff client for agent execution
- **NPM App**: Agent loading, credential management

## 🚀 Replication Roadmap

For those looking to replicate this system:

| Phase | Duration | Components |
|-------|----------|------------|
| **Infrastructure** | 2-3 weeks | AI model integration, token counting |
| **Agent System** | 3-4 weeks | Tool framework, agent runners |
| **Repository Management** | 1-2 weeks | Git operations, isolation |
| **Orchestration** | 2-3 weeks | Evaluation pipeline, judging |
| **Testing & Refinement** | 2-3 weeks | Integration, optimization |

**Total Estimated Effort**: 11-17 weeks for a complete standalone system

## 📊 Research Value

This evaluation framework represents significant value for AI research:

1. **Novel Methodology**: First comprehensive commit reconstruction evaluation system
2. **Real-world Relevance**: Tests on actual production code scenarios
3. **Comprehensive Metrics**: Multi-dimensional scoring with AI judge analysis
4. **Scalable Architecture**: Handles enterprise-scale codebases
5. **Research Insights**: Direct feedback loop for agent improvement

## 🔍 Key Files Reference

- **Orchestration**: `run-git-evals.ts`, `run-eval-set.ts`
- **Agent Integration**: `runners/codebuff.ts`, `runners/claude.ts`
- **Judging**: `judge-git-eval.ts` with multi-judge robustness
- **Repository Management**: `setup-test-repo.ts` with authentication
- **Data Generation**: `pick-commits.ts`, `gen-evals.ts`
- **Analysis**: `post-eval-analysis.ts` for aggregate insights

## 📈 Future Directions

The evaluation system enables research into:
- **Agent architecture optimization** through systematic testing
- **Tool usage pattern analysis** via comprehensive logging
- **Error pattern identification** for targeted improvements
- **Cost-performance optimization** through detailed metrics
- **Scaling behavior analysis** across different codebase sizes

This documentation provides a complete picture of a sophisticated AI evaluation system that bridges the gap between research and practical AI coding assistance.
238 changes: 238 additions & 0 deletions evals/docs/agents-relationship-analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
# Relationship Between .agents Folder and Evals Research

This document analyzes the relationship between the prompts and agents defined in the `.agents` folder and the research conducted through the evaluation system.

## Overview

The `.agents` folder contains the actual agent definitions that are evaluated by the evals system, creating a direct feedback loop between agent development and evaluation research.

## Agent Architecture in Context

### Agent Types Being Evaluated

Based on the evals system analysis and agent definitions:

1. **Base Agents** (`base.ts`, `base2.ts`)
- Primary general-purpose coding agents
- Core subjects of evaluation research
- Use Claude 4 Sonnet as the underlying model

2. **Specialized Agents** (`git-committer.ts`, `reviewer.ts`, etc.)
- Task-specific agents for specialized workflows
- Secondary evaluation targets
- Test specific capabilities and behaviors

### Direct Evaluation Relationships

#### 1. **Agent Selection in Evals**

From the evaluation runners, we can see the evals system directly references agents by ID:

```typescript
// From runners/codebuff.ts
export class CodebuffRunner implements Runner {
constructor(runState: RunState, agent?: string) {
this.agent = agent ?? 'base' // Default to 'base' agent
}
}
```

The evaluation system tests these specific agent types:
- `base` - Primary general coding agent
- `base2` - Enhanced version with improved architecture
- `base-lite` - Lighter weight version
- Custom agents from `.agents` directory

#### 2. **Agent Loading Integration**

```typescript
// From runners/codebuff.ts
const agentsPath = path.join(__dirname, '../../../.agents')
const localAgentDefinitions = Object.values(
await loadLocalAgents({ agentsPath })
)
```

The evals system dynamically loads agent definitions from the `.agents` folder, making it easy to:
- Test new agent iterations
- Compare different agent approaches
- Evaluate specialized vs. general-purpose agents

## Research-Informed Agent Development

### 1. **Prompting Strategies Influenced by Evals**

The evaluation results directly inform agent prompt engineering. For example, the base prompts include specific guidance likely derived from eval insights:

```typescript
// From base-prompts.ts - Testing guidance
'**Testing:** If you create a unit test, you should run it using `run_terminal_command` to see if it passes, and fix it if it doesn\'t.'

// Package management best practices
'**Package Management:** When adding new packages, use the run_terminal_command tool to install the package rather than editing the package.json file...'
```

These specific instructions likely emerged from evaluation findings showing agents making common mistakes.

### 2. **Tool Usage Patterns**

The evaluation system tests how agents use tools, and this feedback influences tool selection and usage patterns in agent definitions:

```typescript
// From git-committer.ts
toolNames: ['read_files', 'run_terminal_command', 'add_message', 'end_turn']
```

The careful selection of tools and their usage patterns in the `handleSteps` function reflects lessons learned from evaluation research.

### 3. **Multi-Agent Architecture Evolution**

The evolution from `base` to `base2` demonstrates research-driven agent improvement:

```typescript
// base2 uses a factory pattern with specialized sub-agents
import { base2 } from './base2-factory'

// base2-factory.ts includes guidance like:
'Don\'t mastermind the task. Rely on your agents\' judgement to plan, implement, and review the code.'
```

This architectural change likely resulted from evaluation findings about task decomposition and agent coordination.

## Evaluation-Driven Design Patterns

### 1. **Structured Agent Workflows**

The `git-committer` agent demonstrates a structured workflow that was likely refined through evaluation:

```typescript
handleSteps: function* ({ agentState, prompt, params }: AgentStepContext) {
// Step 1: Run git diff and git log to analyze changes
yield { toolName: 'run_terminal_command', input: { command: 'git diff' } }

// Step 2: Read relevant files for context
yield { toolName: 'add_message', ... }

// Step 3: Let AI generate next step
yield 'STEP'

// Step 4: Create commit
yield 'STEP_ALL'
}
```

This systematic approach to breaking down tasks reflects insights from evaluation research about agent decision-making patterns.

### 2. **Error Prevention Strategies**

Agent prompts include specific error prevention guidance derived from evaluation findings:

```typescript
// From base prompts - addressing common eval failures
'You must base your future write_file/str_replace edits off of the latest changes. You must try to accommodate the changes that the user has made...'

'Always run hooks for TypeScript/JavaScript changes, test file changes, or when the changes could affect compilation/tests'
```

### 3. **Quality Assurance Integration**

The emphasis on testing and verification in agent prompts reflects evaluation insights:

```typescript
// From base prompts
'Check the knowledge files to see if the user has specified a further protocol for what terminal commands should be run to verify edits. For example, a `knowledge.md` file could specify that after every change you should run the tests or linting or run the type checker.'
```

## Research Feedback Loop

```mermaid
graph TB
A[Agent Definitions in .agents/] --> B[Evaluation System]
B --> C[Performance Metrics]
C --> D[Analysis of Failures]
D --> E[Prompt Engineering Insights]
E --> F[Updated Agent Definitions]
F --> A

B --> G[Conversation Traces]
G --> H[Tool Usage Patterns]
H --> I[Workflow Optimization]
I --> F

C --> J[Scoring Dimensions]
J --> K[Quality Metrics]
K --> L[Agent Architecture Changes]
L --> F
```

## Specific Evaluation Insights Reflected in Agents

### 1. **File Handling Patterns**

Agent prompts include detailed guidance on file operations, likely informed by evaluation failures:

```typescript
// Emphasis on reading before editing
'Analyze surrounding code, tests, and configuration first'

// Careful handling of user modifications
'You must base your future write_file/str_replace edits off of the latest changes'
```

### 2. **Terminal Command Best Practices**

Specific terminal usage patterns reflect evaluation learnings:

```typescript
// Package installation best practices
'use the run_terminal_command tool to install the package rather than editing the package.json file with a guess at the version number'

// Command chaining for verification
'you should run them all using \'&&\' to concatenate them into one commands, e.g. `npm run lint && npm run test`'
```

### 3. **Context Management**

The evolution of context handling in agents reflects evaluation insights about information management:

```typescript
// From context-pruner.test.ts - sophisticated context management
'removes old terminal command results while keeping recent 5'
'removes large tool results'
'performs message-level pruning when other passes are insufficient'
```

## Evaluation Impact on Agent Evolution

### Base → Base2 Evolution

The progression from `base` to `base2` demonstrates evaluation-driven improvement:

1. **Architecture**: Moved to factory pattern with specialized sub-agents
2. **Tool Usage**: More sophisticated tool selection and usage patterns
3. **Workflow**: Better task decomposition and coordination
4. **Error Handling**: Improved error prevention and recovery

### Specialized Agent Development

Specialized agents like `git-committer` reflect evaluation insights about:
- When to use structured workflows vs. free-form responses
- How to break complex tasks into manageable steps
- The importance of context gathering before action

## Conclusion

The relationship between the `.agents` folder and the evaluation research is a **direct, iterative feedback loop**:

1. **Agents are evaluated** using the commit reconstruction methodology
2. **Performance data and failure modes** inform agent improvements
3. **Updated prompts and architectures** are implemented in the `.agents` folder
4. **New agent versions** are evaluated to measure improvement

This creates a continuous improvement cycle where:
- **Evaluation research drives agent development**
- **Agent performance informs evaluation methodology refinements**
- **Real-world coding scenarios** (from git commits) ensure practical relevance
- **Systematic measurement** enables evidence-based agent improvement

The evaluation system serves as both a research tool for understanding AI coding capabilities and a development tool for improving Codebuff's agent implementations. This tight integration between evaluation and development represents a sophisticated approach to AI agent research and improvement.
Loading