From f80531c35672319326541b53135ea4ef75d94d50 Mon Sep 17 00:00:00 2001 From: Karakatiza666 Date: Wed, 25 Feb 2026 09:39:48 +0000 Subject: [PATCH 1/3] Introduce per-directory CLAUDE.md LLM files in a separate branch Signed-off-by: Karakatiza666 --- .github/workflows/CLAUDE.md | 504 +++++++++++++++++++ CLAUDE.md | 311 +++++++++++- benchmark/CLAUDE.md | 218 ++++++++ crates/CLAUDE.md | 118 +++++ crates/adapterlib/CLAUDE.md | 261 ++++++++++ crates/adapters/CLAUDE.md | 284 +++++++++++ crates/adapters/src/adhoc/CLAUDE.md | 388 ++++++++++++++ crates/adapters/src/controller/CLAUDE.md | 243 +++++++++ crates/adapters/src/format/CLAUDE.md | 527 +++++++++++++++++++ crates/adapters/src/integrated/CLAUDE.md | 453 +++++++++++++++++ crates/adapters/src/transport/CLAUDE.md | 467 +++++++++++++++++ crates/datagen/CLAUDE.md | 253 ++++++++++ crates/dbsp/CLAUDE.md | 218 ++++++++ crates/dbsp/src/CLAUDE.md | 427 ++++++++++++++++ crates/fda/CLAUDE.md | 303 +++++++++++ crates/feldera-types/CLAUDE.md | 282 +++++++++++ crates/fxp/CLAUDE.md | 293 +++++++++++ crates/iceberg/CLAUDE.md | 279 ++++++++++ crates/ir/CLAUDE.md | 343 +++++++++++++ crates/nexmark/CLAUDE.md | 263 ++++++++++ crates/pipeline-manager/CLAUDE.md | 309 ++++++++++++ crates/rest-api/CLAUDE.md | 234 +++++++++ crates/sqllib/CLAUDE.md | 309 ++++++++++++ crates/storage/CLAUDE.md | 274 ++++++++++ deploy/CLAUDE.md | 307 +++++++++++ docs.feldera.com/CLAUDE.md | 165 ++++++ js-packages/CLAUDE.md | 241 +++++++++ js-packages/profiler-app/CLAUDE.md | 218 ++++++++ js-packages/profiler-layout/CLAUDE.md | 153 ++++++ js-packages/profiler-lib/CLAUDE.md | 532 ++++++++++++++++++++ js-packages/web-console/CLAUDE.md | 126 +++++ js-packages/web-console/src/lib/CLAUDE.md | 1 + python/CLAUDE.md | 207 ++++++++ python/tests/CLAUDE.md | 121 +++++ scripts/CLAUDE.md | 43 ++ sql-to-dbsp-compiler/CLAUDE.md | 188 +++++++ sql-to-dbsp-compiler/SQL-compiler/CLAUDE.md | 307 +++++++++++ 37 files changed, 10158 insertions(+), 12 deletions(-) create mode 100644 .github/workflows/CLAUDE.md create mode 100644 benchmark/CLAUDE.md create mode 100644 crates/CLAUDE.md create mode 100644 crates/adapterlib/CLAUDE.md create mode 100644 crates/adapters/CLAUDE.md create mode 100644 crates/adapters/src/adhoc/CLAUDE.md create mode 100644 crates/adapters/src/controller/CLAUDE.md create mode 100644 crates/adapters/src/format/CLAUDE.md create mode 100644 crates/adapters/src/integrated/CLAUDE.md create mode 100644 crates/adapters/src/transport/CLAUDE.md create mode 100644 crates/datagen/CLAUDE.md create mode 100644 crates/dbsp/CLAUDE.md create mode 100644 crates/dbsp/src/CLAUDE.md create mode 100644 crates/fda/CLAUDE.md create mode 100644 crates/feldera-types/CLAUDE.md create mode 100644 crates/fxp/CLAUDE.md create mode 100644 crates/iceberg/CLAUDE.md create mode 100644 crates/ir/CLAUDE.md create mode 100644 crates/nexmark/CLAUDE.md create mode 100644 crates/pipeline-manager/CLAUDE.md create mode 100644 crates/rest-api/CLAUDE.md create mode 100644 crates/sqllib/CLAUDE.md create mode 100644 crates/storage/CLAUDE.md create mode 100644 deploy/CLAUDE.md create mode 100644 docs.feldera.com/CLAUDE.md create mode 100644 js-packages/CLAUDE.md create mode 100644 js-packages/profiler-app/CLAUDE.md create mode 100644 js-packages/profiler-layout/CLAUDE.md create mode 100644 js-packages/profiler-lib/CLAUDE.md create mode 100644 js-packages/web-console/CLAUDE.md create mode 100644 js-packages/web-console/src/lib/CLAUDE.md create mode 100644 python/CLAUDE.md create mode 100644 python/tests/CLAUDE.md create mode 100644 scripts/CLAUDE.md create mode 100644 sql-to-dbsp-compiler/CLAUDE.md create mode 100644 sql-to-dbsp-compiler/SQL-compiler/CLAUDE.md diff --git a/.github/workflows/CLAUDE.md b/.github/workflows/CLAUDE.md new file mode 100644 index 00000000000..9fcabd6dec8 --- /dev/null +++ b/.github/workflows/CLAUDE.md @@ -0,0 +1,504 @@ +## Overview + +The `.github/workflows/` directory contains GitHub Actions workflows that form a comprehensive CI/CD ecosystem for the Feldera platform. These workflows orchestrate a sophisticated development pipeline that spans multiple languages, platforms, and deployment targets. + +### Workflow Ecosystem Architecture + +The Feldera CI/CD pipeline is designed around a **hierarchical workflow architecture** with clear separation of concerns: + +#### **Primary Orchestration Layer** +- **`ci.yml`** serves as the main orchestrator for pull requests and main branch changes +- **`ci-pre-mergequeue.yml`** provides fast feedback for pull request validation +- **`ci-release.yml`** coordinates the complete release process +- **`ci-post-release.yml`** handles post-release tasks like package publishing + +#### **Build Foundation Layer** +- **`build-rust.yml`** - Multi-platform Rust compilation (AMD64/ARM64) +- **`build-java.yml`** - Java-based SQL compiler and Calcite integration +- **`build-docs.yml`** - Documentation site generation with Docusaurus +- **`build-docker.yml`** - Production Docker images for deployment +- **`build-docker-dev.yml`** - Development environment images for CI + +#### **Testing Validation Layer** +- **`test-unit.yml`** - Comprehensive unit testing across all components +- **`test-integration.yml`** - Docker functionality and network isolation testing +- **`test-adapters.yml`** - Multi-architecture I/O adapter validation +- **`test-java.yml`** / **`test-java-nightly.yml`** - SQL compiler testing with extended nightly runs + +#### **Quality Assurance Layer** +- **`docs-linkcheck.yml`** - Daily documentation link validation +- **`check-failures.yml`** - Slack notifications for critical workflow failures + +#### **Publication Layer** +- **`publish-crates.yml`** - Rust crate publishing to crates.io +- **`publish-python.yml`** - Python SDK publishing to PyPI + +### Key Architectural Principles + +#### **Multi-Platform First** +- Native AMD64 and ARM64 support across the entire pipeline +- Uses Kubernetes-based runners (`k8s-runners-amd64`, `k8s-runners-arm64`) for scalable execution +- Ensures Feldera runs consistently across different hardware architectures + +#### **Containerized Execution** +- Most workflows run in the standardized `feldera-dev` container +- Provides consistent build environments across different runner types +- Eliminates "works on my machine" issues through environment standardization + +#### **Performance Optimization** +- **Caching Strategy**: Extensive use of sccache with S3-compatible storage for Rust builds +- **Parallel Execution**: Independent jobs run simultaneously to minimize total pipeline time +- **Artifact Sharing**: Build artifacts are shared between workflows to avoid redundant compilation +- **Documentation-Only Optimization**: Workflows skip expensive CI jobs when only documentation files are changed (see [Documentation-Only Change Optimization](#documentation-only-change-optimization)) +- **Resource Selection**: Use appropriate runner types for different workloads + +#### **Comprehensive Language Support** +- **Rust**: Core DBSP engine, adapters, and runtime components +- **Java**: SQL-to-DBSP compiler with Apache Calcite integration +- **Python**: SDK and client libraries with comprehensive testing +- **TypeScript/JavaScript**: Documentation site and web console components + +#### **Release Automation** +- **Semantic Versioning**: Automated version management across all components +- **Multi-Registry Publishing**: Simultaneous publishing to crates.io, PyPI, and container registries +- **Documentation Deployment**: Automatic documentation updates for each release + +#### **Quality Gates** +- **Multi-Stage Validation**: Code must pass unit tests, integration tests, and adapter tests +- **Cross-Platform Testing**: All components validated on both AMD64 and ARM64 +- **Documentation Validation**: Links and content verified before release +- **External Service Testing**: Real integration testing with Kafka, PostgreSQL, Redis, and cloud services + +### Workflow Interdependencies + +The workflows form a **directed acyclic graph (DAG)** of dependencies: + +``` +ci.yml (orchestrator) +├── build-rust.yml ──────┬──> test-unit.yml +├── build-java.yml ──────┤ test-adapters.yml +├── build-docs.yml │ test-integration.yml +├── ... +└── Dependencies ────────┴──> build-docker.yml + └──> publish-* workflows +``` + +### Development Lifecycle Integration + +#### **Pull Request Flow** +1. **`ci-pre-mergequeue.yml`** provides immediate feedback +2. **`ci.yml`** runs comprehensive validation +3. Quality gates ensure code meets standards before merge + +#### **Release Flow** +1. **`ci-release.yml`** creates release from specified commit +2. **`ci-post-release.yml`** publishes packages to registries +3. **`build-docker.yml`** creates versioned container images +4. **`build-docs.yml`** updates documentation site + +#### **Maintenance Flow** +1. **`docs-linkcheck.yml`** runs daily to catch documentation issues +2. **`test-java-nightly.yml`** provides extended testing coverage +3. **`check-failures.yml`** monitors critical workflows and alerts team + +This ecosystem provides **fast feedback loops** for developers while ensuring **comprehensive validation** and **reliable automation** for the entire Feldera platform development lifecycle. + +## Workflows + +### `build-docker.yml` - Docker Image Build and Push + +**Purpose**: Builds and publishes Docker images for Feldera components to container registries. + +**Triggers**: +- Manual workflow dispatch +- Push to main branch (for latest tags) +- Release events (for version tags) + +**Key Features**: +- Multi-architecture builds (AMD64, ARM64) +- Registry support for Docker Hub, GitHub Container Registry, and AWS ECR +- Builds multiple Docker images: + - `pipeline-manager` - Main API service + - `sql-to-dbsp-compiler` - SQL compilation service + - Additional utility images +- Uses BuildKit for advanced build features +- Implements layer caching for faster builds +- Tags images with version numbers and latest tags +- Supports both development and release builds + +**Runners**: Uses `k8s-runners-amd64` for containerized builds with Docker support. + +**Security**: +- Registry authentication via GitHub secrets +- SBOM (Software Bill of Materials) generation +- Image scanning integration + +**Outputs**: Published Docker images ready for deployment in various environments. + +--- + +### `ci.yml` - Main Continuous Integration Pipeline + +**Purpose**: Orchestrates the complete CI pipeline for pull requests and main branch changes. + +**Triggers**: +- Pull request events (opened, synchronized, reopened) +- Push to main branch +- Manual workflow dispatch +- Skipped if only code documentation was changed + +**Key Features**: +- **Dependency Matrix**: Builds dependency graph of all CI jobs +- **Parallel Execution**: Runs multiple test suites simultaneously +- **Multi-platform Testing**: Tests across different architectures (AMD64, ARM64) +- **Comprehensive Testing**: Includes unit tests, integration tests, Java tests, and adapter tests +- **Build Validation**: Validates Rust, Java, and documentation builds +- **Artifact Management**: Collects and stores build artifacts for downstream jobs +- **Status Reporting**: Provides consolidated CI status for pull requests + +**Workflow Structure**: +- **Build Phase**: Compiles Rust binaries, Java components, and documentation +- **Test Phase**: Executes comprehensive test suite in parallel +- **Integration Phase**: Runs integration tests with real services +- **Reporting Phase**: Aggregates results and reports status + +**Runners**: Uses mix of `k8s-runners-amd64` and `k8s-runners-arm64` for comprehensive platform coverage. + +**Dependencies**: Coordinates execution of multiple reusable workflows including build-rust, test-unit, test-integration, and others. + +--- + +### `ci-pre-mergequeue.yml` - Pre-Merge Queue Validation + +**Purpose**: Provides fast feedback for pull requests with essential validation before merge queue entry. + +**Triggers**: +- Pull request events (opened, synchronized) +- Skipped if only code documentation was changed + +**Key Features**: +- **Fast Feedback**: Lightweight validation for quick pull request feedback +- **Caching Optimization**: Uses sccache with S3-compatible storage for build caching +- **GitHub App Integration**: Generates tokens for secure repository access +- **Single Job Execution**: Streamlined validation in a single containerized job + +**Performance Optimizations**: +- **Build Caching**: RUSTC_WRAPPER with sccache for faster Rust compilation +- **S3 Cache Backend**: Distributed caching across CI runners +- **Container Environment**: Runs in feldera-dev container with pre-installed dependencies + +**Security**: +- GitHub App token generation for secure API access +- S3 credentials for cache access via GitHub secrets +- Containerized execution for isolation + +**Runners**: Uses `k8s-runners-amd64` with containerized execution environment. + +--- + +### `ci-release.yml` - Release Creation + +**Purpose**: Creates new releases by dispatching the release process for specified commits. + +**Triggers**: +- Repository dispatch events (trigger-oss-release) +- Manual workflow dispatch with SHA and version inputs + +**Key Features**: +- **Release Coordination**: Initiates the complete release process +- **Version Management**: Handles version specification and validation +- **GitHub App Integration**: Uses app tokens for secure repository operations +- **Flexible Triggering**: Supports both automated and manual release triggers + +**Input Parameters**: +- `sha_to_release`: Specific commit SHA to create release from +- `version`: Version string for the release + +**Security**: Uses GitHub App tokens for repository write permissions and secure release creation. + +--- + +### `build-rust.yml` - Rust Build Workflow + +**Purpose**: Compiles Rust components across multiple platforms and architectures. + +**Triggers**: +- Called by other workflows (workflow_call) + +**Key Features**: +- **Multi-platform Compilation**: Builds for AMD64 and ARM64 architectures +- **Optimized Builds**: Uses release mode with optimizations +- **Artifact Management**: Stores compiled binaries for downstream workflows +- **Caching**: Leverages Rust build caching for faster compilation +- **Cross-compilation**: Handles cross-platform build requirements + +**Build Targets**: +- `x86_64-unknown-linux-gnu` (AMD64) +- `aarch64-unknown-linux-gnu` (ARM64) + +**Outputs**: Compiled Rust binaries uploaded as GitHub Actions artifacts for use in Docker builds and testing. + +--- + +### `test-integration.yml` - Integration Testing + +**Purpose**: Runs comprehensive integration tests including network isolation, Docker functionality, and end-to-end scenarios. + +**Triggers**: +- Called by other workflows (workflow_call) +- Manual workflow dispatch with run_id input + +**Key Features**: +- **Network Isolation Testing**: Validates pipeline-manager works without network access +- **Docker Integration**: Tests Docker container functionality and health checks +- **Service Validation**: Ensures services start correctly and respond to health checks +- **Multi-Job Testing**: Runs various integration scenarios in parallel + +**Test Scenarios**: +- `manager-no-network`: Tests pipeline-manager in isolated network environment +- Container health check validation +- Service startup and readiness verification +- Docker image functionality testing + +**Infrastructure**: Uses Docker containers with custom networks for isolation testing. + +--- + +### `build-docs.yml` - Documentation Build + +**Purpose**: Builds and validates the Feldera documentation website. + +**Triggers**: +- Called by other workflows (workflow_call) + +**Key Features**: +- **Docusaurus Build**: Compiles the static documentation site +- **Link Validation**: Ensures all internal links are valid +- **OpenAPI Integration**: Includes API documentation generation +- **Multi-format Support**: Handles MDX, images, and interactive content + +**Outputs**: Static documentation site ready for deployment to docs.feldera.com. + +--- + +### `test-unit.yml` - Unit Testing + +**Purpose**: Executes comprehensive unit test suites for all Feldera components. + +**Triggers**: +- Called by other workflows (workflow_call) + +**Key Features**: +- **Comprehensive Coverage**: Tests Rust, Python, and Java components +- **Parallel Execution**: Runs test suites in parallel for faster feedback +- **Test Reporting**: Aggregates and reports test results +- **Failure Analysis**: Provides detailed failure information + +**Test Suites**: +- Rust unit tests for core DBSP functionality +- Python SDK tests +- SQL compiler tests +- Integration library tests + +--- + +### `publish-crates.yml` & `publish-python.yml` - Package Publishing + +**Purpose**: Publishes Rust crates to crates.io and Python packages to PyPI during releases. + +**Triggers**: +- Called by release workflows + +**Key Features**: +- **Automated Publishing**: Publishes packages with proper versioning +- **Dependency Management**: Handles cross-package dependencies correctly +- **Registry Authentication**: Securely authenticates with package registries +- **Publication Validation**: Verifies successful package publication + +--- + +### `test-adapters.yml` - Adapter Testing + +**Purpose**: Tests I/O adapters across multiple architectures with real external services. + +**Triggers**: +- Called by other workflows (workflow_call) +- Manual workflow dispatch with run_id input + +**Key Features**: +- **Multi-architecture Testing**: Runs on both AMD64 and ARM64 platforms +- **Service Integration**: Tests with PostgreSQL and other external services +- **Container Environment**: Runs in feldera-dev container with full service stack +- **Matrix Strategy**: Tests adapter compatibility across different architectures + +**Test Infrastructure**: +- PostgreSQL 15 service for database testing +- Containerized test execution environment +- Multi-platform validation (x86_64 and aarch64) +- Service health verification and connectivity testing + +**Architecture Coverage**: Ensures adapter functionality works consistently across AMD64 and ARM64 platforms. + +--- + +### `docs-linkcheck.yml` - Documentation Link Validation + +**Purpose**: Validates all links in the documentation to ensure they remain accessible and correct. + +**Triggers**: +- Scheduled runs (daily at 16:00 UTC) +- Manual workflow dispatch + +**Key Features**: +- **External Link Checking**: Uses linkchecker to validate docs.feldera.com +- **Selective Validation**: Ignores problematic domains (localhost, crates.io, LinkedIn, etc.) +- **Python Environment**: Uses uv for dependency management +- **No Robot Restrictions**: Bypasses robots.txt restrictions for thorough checking + +**Ignored URLs**: Excludes localhost, crates.io, social media sites, and IEEE domains that commonly have access issues. + +--- + +### `test-java.yml` & `test-java-nightly.yml` - Java Testing + +**Purpose**: Executes Java-based SQL compiler tests with comprehensive validation. + +**Triggers**: +- `test-java.yml`: Called by other workflows (workflow_call) +- `test-java-nightly.yml`: Scheduled nightly runs for extended testing + +**Key Features**: +- **SQL Compiler Testing**: Validates SQL-to-DBSP compilation pipeline +- **Extended Test Suite**: Nightly version runs more comprehensive tests +- **Maven Integration**: Uses Maven for Java build and test execution +- **Cross-platform Testing**: Tests Java components across different environments + +**Test Coverage**: +- SQL parsing and validation +- DBSP code generation +- Compiler optimization passes +- Integration with Calcite framework + +--- + +### `ci-post-release.yml` - Post-Release Tasks + +**Purpose**: Executes post-release tasks after a release is published, including package publishing. + +**Triggers**: +- Release events (published) + +**Key Features**: +- **Package Publishing**: Coordinates publishing of Python and Rust packages +- **Post-release Automation**: Handles tasks that occur after release creation +- **Build Caching**: Uses sccache for optimized compilation during publishing +- **Parallel Publishing**: Publishes Python packages and Rust crates simultaneously + +**Release Tasks**: +- Python package publishing to PyPI +- Rust crate publishing to crates.io +- Post-release validation and verification + +--- + +### `check-failures.yml` - Failure Notification + +**Purpose**: Monitors specific workflows and sends Slack notifications when they fail. + +**Triggers**: +- Workflow run completion events for monitored workflows +- Specifically watches "Java SLT Nightly" and "Link Checker docs.feldera.com" + +**Key Features**: +- **Selective Monitoring**: Only monitors critical workflows on main branch +- **Slack Integration**: Sends formatted notifications to Slack channels +- **Failure Detection**: Triggers on both failure and timeout conditions +- **Rich Notifications**: Includes links to failed workflow runs and repository + +**Monitored Conditions**: +- Workflow failure or timeout +- Main branch executions only +- Configurable via CI_DRY_RUN variable + +--- + +### `build-java.yml` - Java Build Workflow + +**Purpose**: Compiles Java components including the SQL-to-DBSP compiler. + +**Triggers**: +- Called by other workflows (workflow_call) + +**Key Features**: +- **SQL Compiler Build**: Compiles the Java-based SQL-to-DBSP compiler +- **Gradle Caching**: Caches Gradle dependencies and Calcite builds +- **Apache Calcite Integration**: Handles Calcite framework dependency +- **Artifact Generation**: Produces JAR files for downstream workflows + +**Build Components**: +- SQL-to-DBSP compiler +- Calcite integration components +- Maven-based Java artifacts + +--- + +### `build-docker-dev.yml` - Development Docker Images + +**Purpose**: Builds the development Docker image (feldera-dev) used by CI workflows. + +**Triggers**: +- Manual workflow dispatch only + +**Key Features**: +- **CI Environment**: Creates the base container image used by other workflows +- **Multi-architecture**: Builds for both AMD64 and ARM64 platforms +- **Development Tools**: Includes all tools needed for CI/CD operations +- **Registry Publishing**: Publishes to GitHub Container Registry + +**Purpose**: This image contains pre-installed development dependencies and tools used across all other CI workflows, providing a consistent build environment. + +--- + +## Development Workflow Integration + +### Local Testing +- Test workflow changes in feature branches before merging +- Use `act` tool for local workflow testing where possible +- Validate Docker builds locally before pushing + +### Secrets Management +- All sensitive data stored in GitHub secrets +- Environment-specific secrets for different deployment targets +- Regular rotation of authentication tokens + +### Monitoring and Debugging +- Workflow run logs available in GitHub Actions tab +- Failed builds trigger notifications +- Performance metrics tracked for build times and success rates + +## Documentation-Only Change Optimization + +The Feldera CI/CD pipeline has been optimized to skip expensive CI jobs when only code documentation files (`CLAUDE.md` and `README.md`) are changed. This optimization significantly reduces resource consumption and provides faster feedback for code documentation updates. + +### Implementation Strategy + +- **`ci-pre-mergequeue.yml`** uses GitHub's native `paths-ignore` filtering: +- **`ci.yml`** uses custom conditional execution since GitHub Actions doesn't support `paths-ignore` for `merge_group` events + +## Best Practices + +### Workflow Development +- **Modular Design**: Break complex workflows into reusable actions +- **Conditional Logic**: Use appropriate conditions to skip unnecessary steps +- **Resource Efficiency**: Optimize for faster execution times +- **Error Handling**: Include proper error handling and cleanup + +### Security Considerations +- **Least Privilege**: Use minimal required permissions +- **Secret Protection**: Never log or expose secrets in output +- **Dependency Scanning**: Regularly update action dependencies +- **Supply Chain Security**: Pin action versions to specific SHAs + +### [Performance Optimization](#performance-optimization) + +This documentation will be expanded with additional workflow descriptions as they are analyzed and documented. \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index 7a91641978c..4b358f7a168 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,12 +1,299 @@ -When navigating the codebase, look at relevant README.md for more project context. - -- If on a branch maybe check the last 2-3 to add commits for more context -- Look at the outstanding changes in the tree -- Write production quality code -- Make sure the code compiles -- When adding code always ensure that tests cover the newly added code: - - Unit tests that validate for regular and exceptional inputs - - Use property based testing/model based testing/fuzzing when appropriate - - Integration tests for big platform-level features (in @python/tests) - -At the start of every conversation, offer the user to run `scripts/claude.sh` to pull in shared LLM context files as unstaged changes. These should not be committed outside the `claude-context` branch. \ No newline at end of file +# CLAUDE.md + +## Repository Overview + +Feldera is an **incremental view maintenance (IVM) system** built around **DBSP (Database Stream Processor)**, a computational model that enables true incremental computation. The repository implements a complete platform where SQL queries are compiled into streaming circuits that process only changes rather than recomputing entire datasets, enabling high-performance analytics that scale with change volume rather than data size. + +## Core Technical Architecture + +### **DBSP: The Computational Foundation** + +At the heart of Feldera lies DBSP, a computational model for incremental stream processing: + +- **Change Propagation**: Processes streams of changes (insertions, deletions, updates) rather than full datasets +- **Algebraic Operators**: Implements relational algebra operators (join, aggregate, filter) that work incrementally +- **Nested Relational Model**: Supports complex data types and nested queries through compositional operators +- **Circuit Model**: SQL queries compile to circuits of interconnected DBSP operators that maintain incremental state + +### **SQL-to-Circuit Compilation Pipeline** + +The transformation from declarative SQL to executable circuits: + +1. **SQL Parsing & Validation** (Apache Calcite in Java) +2. **Incremental Conversion** (Automatic IVM transformation) +3. **Circuit Generation** (DBSP operator graph construction) +4. **Code Generation** (Rust implementation of circuits) +5. **Runtime Integration** (Circuit deployment and execution) + +### **Multi-Language Implementation Strategy** + +Each language serves a specific role in the IVM implementation: + +- **Rust**: High-performance DBSP runtime with zero-cost abstractions and memory safety +- **Java**: SQL compilation leveraging Apache Calcite for parsing, optimization, and incremental query planning +- **Python**: User-facing APIs and data science integration with type safety and ergonomic interfaces +- **TypeScript**: Web interface for visual pipeline development and real-time monitoring + +## Repository Architecture: IVM Implementation + +This repository implements the complete Feldera IVM system as interconnected components that transform SQL into high-performance incremental computation: + +### **SQL-to-DBSP Transformation Pipeline** + +1. **Declarative Input** → SQL queries with complex joins, aggregates, and window functions +2. **Incremental Planning** → Apache Calcite-based transformation to change-oriented operations +3. **Circuit Construction** → DBSP operator graphs with dataflow connections and state management +4. **Code Generation** → Rust implementations optimized for incremental execution +5. **Runtime Deployment** → Multi-threaded circuit execution with fault tolerance + +### **IVM-Centered Component Integration** + +Each repository component serves the incremental computation model: + +- **`crates/dbsp/`**: Core IVM engine implementing 50+ incremental operators +- **`sql-to-dbsp-compiler/`**: SQL→DBSP circuit transformation with IVM optimizations +- **`crates/pipeline-manager/`**: Circuit lifecycle management and deployment orchestration +- **`crates/adapters/`**: Change stream I/O for incremental data ingestion/emission +- **`python/`**: IVM-aware APIs for programmatic circuit management + +## Component Ecosystem + +### **DBSP Runtime Engine (`crates/`)** +The incremental computation runtime implemented as a Rust workspace with 15+ specialized crates: + +#### **DBSP Computational Core** +- **`dbsp/`**: The core IVM engine implementing the DBSP computational model with incremental operators (map, filter, join, aggregate), circuit execution runtime, multi-threaded scheduling with work-stealing, and persistent state management for large-scale incremental computation +- **`sqllib/`**: Runtime library providing SQL function implementations optimized for incremental execution, including incremental aggregates (SUM, COUNT, AVG with +/- updates), scalar functions with null propagation, and DBSP-compatible type conversions + +#### **Circuit Management** +- **`pipeline-manager/`**: HTTP service orchestrating the complete IVM workflow: SQL compilation requests, DBSP circuit deployment, runtime circuit lifecycle management, and integration with storage/networking for production incremental computation systems +- **`rest-api/`**: OpenAPI specifications for IVM system management ensuring type-safe client-server communication for circuit operations, pipeline configuration, and incremental computation monitoring + +#### **Incremental I/O Framework** +- **`adapters/`**: Change-oriented I/O system enabling incremental computation across external systems. Implements change data capture (CDC) for databases, streaming connectors (Kafka) with exactly-once semantics, and delta-oriented file processing (Parquet, Delta Lake) optimized for IVM workflows +- **`adapterlib/`**: Core abstractions for building IVM-compatible adapters including change stream protocols, incremental batch processing, circuit integration APIs, and fault tolerance mechanisms for long-running incremental computations + +#### **IVM Infrastructure** +- **`feldera-types/`**: Shared type system for the IVM platform including circuit configuration types, incremental data structures, change stream representations, and error handling across the SQL-to-DBSP compilation and runtime execution pipeline +- **`fxp/`**: High-precision decimal arithmetic with exact incremental computation properties, supporting financial applications where floating-point errors would compound across incremental updates +- **`storage/`**: Pluggable storage backend supporting incremental computation persistence, including memory-optimized storage for development and disk-based storage for production IVM deployments with efficient delta management +- **`nexmark/`**: Streaming analytics benchmark suite validating IVM performance against industry standards, with 22 complex queries testing incremental joins, aggregations, and window operations + +### **SQL-to-DBSP Compiler (`sql-to-dbsp-compiler/`)** +Java-based compilation system implementing the core IVM transformation from declarative SQL to incremental DBSP circuits: + +#### **IVM Compilation Stages** +1. **SQL Parsing**: Apache Calcite-based parsing with comprehensive SQL dialect support +2. **Incremental Planning**: Automatic transformation of SQL operations into change-propagating equivalents (e.g., JOIN → incremental join with state management) +3. **DBSP Lowering**: Conversion to DBSP operator graphs with explicit state management and change propagation semantics +4. **Circuit Optimization**: DBSP-specific optimizations including operator fusion, state minimization, and parallelization opportunities +5. **Rust Code Generation**: Production of type-safe Rust implementations with zero-cost abstractions + +#### **IVM-Specific Features** +- **Automatic Delta Computation**: Transforms standard SQL semantics into change-oriented operations maintaining mathematical correctness +- **Nested Query Support**: Handles complex subqueries through DBSP's nested relational model with incremental correlation +- **Recursive Query Compilation**: Implements SQL recursion using DBSP's fixed-point iteration with incremental convergence detection +- **Type System Integration**: Maps SQL types to Rust representations optimized for incremental computation and serialization + +### **IVM Management SDK (`python/`)** +Python SDK providing programmatic control over the complete IVM system lifecycle: + +#### **Circuit Management Capabilities** +- **Pipeline Orchestration**: Create, deploy, and monitor DBSP circuits through high-level Python APIs with automatic SQL-to-circuit compilation +- **Incremental Data Integration**: Pandas DataFrame integration optimized for IVM workflows, supporting efficient change detection and delta-oriented data loading +- **Circuit Testing Framework**: Shared test patterns enabling rapid iteration on IVM logic with optimized compilation cycles and deterministic change stream generation +- **Enterprise IVM Features**: Advanced incremental computation features (circuit optimization, distributed execution, persistent state management) with feature flag controls + +#### **IVM-Aware Developer Experience** +- **Circuit Type Safety**: Full type annotations reflecting the compiled DBSP circuit structure with incremental data type validation +- **Incremental Error Handling**: Error recovery mechanisms aware of circuit state and incremental computation semantics +- **Change Stream APIs**: Native support for change-oriented data manipulation optimized for incremental computation patterns + +### **Frontend Ecosystem (`js-packages/`)** +TypeScript/JavaScript packages providing user-facing interfaces and visualization tools for the Feldera platform. See `js-packages/CLAUDE.md` for comprehensive documentation of the frontend monorepo. + +#### **IVM Development Console (`js-packages/web-console/`)** +Production web application for visual IVM system development and monitoring: +- **SvelteKit 2.x**: Reactive framework enabling real-time circuit state visualization with Svelte 5 runes for efficient incremental UI updates +- **Circuit Monitoring**: WebSocket integration providing live DBSP circuit execution metrics, operator throughput, and incremental computation performance +- **Secure Circuit Access**: OIDC/OAuth2 authentication protecting access to production IVM deployments +- **Visual Circuit Builder**: Interactive SQL-to-DBSP compilation with real-time circuit graph visualization showing operator connections and data flow +- **IVM-Aware SQL Editor**: Syntax highlighting with incremental computation hints, circuit compilation previews, and optimization suggestions +- **Circuit Performance Dashboard**: Real-time visualization of incremental computation metrics including change propagation latency, operator memory usage, and throughput analysis + +#### **Profiler Visualization Stack** +- **`profiler-lib`**: Core visualization engine using Cytoscape.js for DBSP circuit graph rendering with callback-based API for UI integration +- **`profiler-layout`**: Reusable Svelte component library implementing common UI layout elements around the profiler diagram - ProfilerLayout, ProfilerTooltip, ProfileTimestampSelector and shared bundle processing utilities +- **`profiler-app`**: Standalone single-page application for offline profile analysis, building to a self-contained HTML file to be included with the support bundle + +#### Profile processing +- Unzip-once bundle processing enabling quick timestamp switching across profile snapshots without repeated decompression + +### **Documentation (`docs.feldera.com/`)** +Comprehensive documentation ecosystem built with Docusaurus: + +#### **Content Organization** +- **Getting Started**: Installation and quickstart guides +- **SQL Reference**: Complete SQL function and operator documentation +- **Connectors**: Data source and sink connector documentation +- **Tutorials**: Step-by-step learning materials +- **API Documentation**: Auto-generated from OpenAPI specifications + +#### **Interactive Features** +- **Sandbox Integration**: Live SQL examples executable in browser +- **Multi-format Support**: MDX, diagrams, videos, and interactive components + +### **Benchmarking (`benchmark/`)** +Comprehensive performance evaluation framework: + +#### **Multi-System Comparison** +- **Feldera**: Both Rust-native and SQL implementations +- **Apache Flink**: Standalone and Kafka-integrated configurations +- **Apache Beam**: Multiple runners (Direct, Flink, Spark, Dataflow) + +#### **Industry Benchmarks** +- **NEXMark**: 22-query streaming benchmark for auction data +- **TPC-H**: Traditional analytical processing benchmark +- **TikTok**: Custom social media analytics workload + +### **CI/CD Infrastructure (`.github/workflows/`)** +Sophisticated automation ecosystem with 16+ specialized workflows: + +#### **Hierarchical Architecture** +- **Orchestration Layer**: Main CI coordination and release management +- **Build Foundation**: Multi-platform compilation (Rust, Java, docs, Docker) +- **Testing Validation**: Comprehensive testing across unit, integration, and adapter levels +- **Quality Assurance**: Link validation, failure monitoring, and maintenance +- **Publication**: Automated package publishing to multiple registries + +#### **Key Characteristics** +- **Multi-Platform**: Native AMD64 and ARM64 support throughout +- **Performance Optimized**: Extensive caching with S3-compatible storage +- **Containerized**: Consistent build environments with feldera-dev container + +## Technical Relationships and Data Flow + +### **Development to Production Pipeline** + +1. **SQL Development**: Developers write SQL programs using web console or Python SDK +2. **Compilation**: SQL-to-DBSP compiler generates optimized Rust circuits +3. **Runtime Execution**: DBSP engine executes circuits with incremental computation +4. **Data Integration**: Adapters handle input/output with external systems +5. **Monitoring**: Real-time metrics and observability through web console + +### **Cross-Component Integration** + +#### **Type System Consistency** +- `feldera-types` provides shared type definitions across all components +- OpenAPI specifications ensure client-server consistency +- SQL type system maps to Rust types with null safety guarantees + +#### **Configuration Management** +- Unified configuration system across pipeline-manager, adapters, and clients +- Environment-based configuration with validation and defaults +- Secret management and secure credential handling + +#### **Error Propagation** +- Structured error types with detailed context information +- Consistent error handling patterns across Rust, Java, and Python components +- User-friendly error messages with actionable guidance + +### **DBSP Performance Architecture** + +#### **Incremental Computation Model** +- **Change-Oriented Processing**: DBSP circuits process only deltas (insertions, deletions, modifications) rather than full datasets +- **Incremental State Management**: Materialized views and operator state updated incrementally with mathematical guarantees of correctness +- **Bounded Memory Usage**: Memory consumption scales with active state and change rate, not total dataset size, enabling processing of arbitrarily large datasets + +#### **Multi-Threaded DBSP Execution** +- **Work-Stealing Scheduler**: DBSP runtime automatically parallelizes circuit execution across available cores with dynamic load balancing +- **Operator Parallelization**: Individual DBSP operators (joins, aggregates) utilize multiple threads with lock-free data structures +- **Pipeline Concurrency**: Multiple independent DBSP circuits execute simultaneously with isolated state management + +#### **IVM-Optimized Storage** +- **Incremental Persistence**: Storage backends optimized for change-oriented workloads with efficient delta serialization +- **Circuit State Management**: Persistent storage of DBSP operator state enabling recovery and scaling of long-running incremental computations +- **Build-Time Optimization**: Development workflow acceleration through sccache and incremental compilation of DBSP circuits + +## Deployment and Operations + +### **Container Strategy** +- Multi-architecture Docker images (AMD64/ARM64) for all components +- Development containers with pre-installed dependencies +- Production containers optimized for size and security + +### **Cloud Integration** +- Kubernetes-native deployment with Helm charts +- Multi-cloud support (AWS, GCP, Azure) through cloud-agnostic abstractions +- Auto-scaling based on workload demands + +### **Observability** +- Structured logging throughout the platform +- Prometheus metrics integration +- Distributed tracing for complex pipeline debugging +- Health checks and readiness probes + +## Development Ecosystem + +### **Multi-Language Coordination** +The repository seamlessly integrates four major programming languages, each serving specific purposes while maintaining consistency through shared specifications and automated code generation. + +### **Testing Philosophy** +- **Unit Testing**: Component-level testing with comprehensive coverage +- **Integration Testing**: Cross-component testing with real external services +- **Performance Testing**: Continuous benchmarking against industry standards +- **End-to-End Testing**: Complete pipeline testing from SQL to results + +### **Quality Assurance** +- **Automated Validation**: Pre-merge validation with fast feedback loops +- **Multi-Platform Testing**: Validation across AMD64 and ARM64 architectures +- **Documentation Testing**: Link validation and content verification +- **Security Scanning**: Dependency and vulnerability scanning + +### **Developer Experience** +- **Comprehensive Documentation**: Each component has detailed CLAUDE.md guidance +- **Consistent Tooling**: Standardized build and test commands across components +- **Interactive Development**: Web console for visual pipeline development +- **Performance Monitoring**: Built-in benchmarking and profiling tools + +This repository represents a complete platform for incremental stream processing, providing everything needed to develop, deploy, and operate high-performance streaming analytics at scale. The modular architecture enables both ease of development and flexibility in deployment while maintaining the performance characteristics essential for real-time data processing. + +## CLAUDE.md Authoring Guide + +Follow these governing principles when documenting the repository in CLAUDE.md files. + +### Purpose and placement + +* Write only what’s needed to work effectively in that directory's context; avoid project history; concise underlying reasoning is OK. + +### Layering and scope + +* Top level provides a high-level overview: the project’s purpose, key components, core workflows, and dependencies between major areas. +* In the top level, include exactly one short paragraph per current subdirectory describing why it exists and what lives there. Keep it concrete but concise. +* Subdirectory docs add progressively more detail the deeper you go. Each level narrows to responsibilities, interfaces, commands, schemas, and caveats specific to that scope. + +### DRY information flow + +* Do not repeat what a parent `CLAUDE.md` already states about a subdirectory. Instead, link up to the relevant section. +* Put cross-cutting concepts at the highest level that owns them, and reference from below. +* Keep a single source of truth for contracts, schemas, and commands; everything else links to it. + +### Clarity for Claude Code + +* Prefer crisp headings, short paragraphs, and tight bullets over prose. +* Name files, entry points, public interfaces, and primary commands explicitly where they belong. +* Call out constraints, feature flags, performance notes, and “gotchas” near the workflows they affect. + +### Maintenance rules + +* Update the highest applicable level first; ensure lower levels still defer to it. +* Remove stale sections rather than letting them drift; shorter and correct beats exhaustive and outdated. +* When documenting a new directory, add its paragraph to the top level and create its own `CLAUDE.md` that deepens—never duplicates—the parent’s description. + +### Quality checklist + +* Top level gives a true overview and one concise paragraph per important subdirectory. +* Every subdirectory doc increases detail appropriate to its scope. +* No duplication across levels; links replace repetition. +* Commands, interfaces, and data shapes are precise and current. It is OK to document different arguments for the same command for different use-cases. +* Formatting is skim-friendly and consistent across the repo. diff --git a/benchmark/CLAUDE.md b/benchmark/CLAUDE.md new file mode 100644 index 00000000000..b16a2033ac9 --- /dev/null +++ b/benchmark/CLAUDE.md @@ -0,0 +1,218 @@ +## Overview + +The `benchmark/` directory contains comprehensive benchmarking infrastructure for comparing Feldera's performance against other stream processing systems. It implements industry-standard benchmarks (NEXMark, TPC-H, TikTok) across multiple platforms to provide objective performance comparisons. + +## Benchmark Ecosystem + +### Supported Systems + +The benchmarking framework supports comparative analysis across: + +- **Feldera** - Both Rust-native and SQL implementations +- **Apache Flink** - Standalone and Kafka-integrated configurations +- **Apache Beam** - Multiple runners: + - Direct runner (development/testing) + - Flink runner + - Spark runner + - Google Cloud Dataflow runner + +### Benchmark Suites + +#### **NEXMark Benchmark** +- **Industry Standard**: Streaming benchmark for auction data processing +- **22 Queries**: Complete suite of streaming analytics queries (q0-q22) +- **Realistic Data**: Auction, bidder, and seller event generation +- **Multiple Modes**: Streaming and batch processing modes + +#### **TPC-H Benchmark** +- **OLAP Standard**: Traditional analytical processing benchmark +- **22 Queries**: Complex analytical queries for business intelligence +- **Batch Processing**: Focus on analytical query performance + +#### **TikTok Benchmark** +- **Custom Workload**: Social media analytics patterns +- **Streaming Focus**: Real-time social media data processing + +## Key Development Commands + +### Running Individual Benchmarks + +```bash +# Basic Feldera benchmark +./run-nexmark.sh --runner=feldera --events=100M + +# Compare Feldera vs Flink +./run-nexmark.sh --runner=flink --events=100M + +# SQL implementation on Feldera +./run-nexmark.sh --runner=feldera --language=sql + +# Batch processing mode +./run-nexmark.sh --batch --events=100M + +# Specific query testing +./run-nexmark.sh --query=q3 --runner=feldera + +# Core count specification +./run-nexmark.sh --cores=8 --runner=feldera +``` + +### Running Benchmark Suites + +```bash +# Full benchmark suite using Makefile +make -f suite.mk + +# Limited runners and modes +make -f suite.mk runners='feldera flink' modes=batch events=1M + +# Specific configuration +make -f suite.mk runners=feldera events=100M cores=16 +``` + +### Analysis and Results + +```bash +# Generate analysis (requires PSPP/SPSS) +pspp analysis.sps + +# View results in CSV format +cat nexmark.csv +``` + +## Project Structure + +### Core Scripts +- `run-nexmark.sh` - Main benchmark execution script +- `suite.mk` - Makefile for running comprehensive benchmark suites +- `analysis.sps` - SPSS/PSPP script for statistical analysis + +### Implementation Directories + +#### `feldera-sql/` +- **SQL Benchmarks**: Pure SQL implementations of benchmark queries +- **Pipeline Management**: Integration with Feldera's pipeline manager +- **Query Definitions**: Standard benchmark queries in SQL format +- **Table Schemas**: Database schema definitions for benchmarks + +#### `flink/` & `flink-kafka/` +- **Flink Integration**: Standalone and Kafka-integrated Flink setups +- **Docker Containers**: Containerized Flink environments +- **Configuration**: Flink-specific performance tuning configurations +- **NEXMark Implementation**: Java-based NEXMark implementation + +#### `beam/` +- **Apache Beam**: Multi-runner Beam implementations +- **Language Support**: Java, SQL (Calcite), and ZetaSQL implementations +- **Cloud Integration**: Google Cloud Dataflow configuration +- **Setup Scripts**: Environment preparation and dependency management + +## Important Implementation Details + +### Performance Optimization + +#### **Feldera Optimizations** +- **Storage Configuration**: Uses `/tmp` by default, configure `TMPDIR` for real filesystem +- **Multi-threading**: Automatic core detection with 16-core maximum default +- **Memory Management**: Efficient incremental computation with minimal memory overhead + +#### **System-Specific Tuning** +- **Flink**: RocksDB and HashMap state backends available +- **Beam**: Multiple language implementations (Java, SQL, ZetaSQL) +- **Cloud**: Optimized configurations for cloud deployments + +### Benchmark Modes + +#### **Streaming Mode (Default)** +- **Real-time Processing**: Continuous data processing simulation +- **Incremental Results**: Measure throughput and latency +- **Event Generation**: Configurable event rates and patterns + +#### **Batch Mode** +- **Analytical Processing**: Traditional batch analytics +- **Complete Data**: Process entire datasets at once +- **Throughput Focus**: Optimized for maximum data processing rates + +### Data Generation + +- **Configurable Scale**: From 100K to 100M+ events +- **Realistic Patterns**: Auction data with realistic distributions +- **Reproducible**: Deterministic data generation for consistent comparisons + +## Development Workflow + +### For New Benchmarks + +1. Add query definitions to appropriate `benchmarks/*/queries/` directory +2. Update table schemas in `table.sql` files +3. Implement runner-specific logic in system directories +4. Add query to `run-nexmark.sh` query list +5. Test across multiple systems for consistency + +### For System Integration + +1. Create system-specific directory (e.g., `newsystem/`) +2. Implement setup and configuration scripts +3. Add runner option to `run-nexmark.sh` +4. Update `suite.mk` runner list +5. Document setup requirements + +### Testing Strategy + +#### **Correctness Validation** +- **Cross-System Consistency**: Results should match across systems +- **Query Verification**: Validate SQL semantics and outputs +- **Edge Case Testing**: Test with various data sizes and patterns + +#### **Performance Analysis** +- **Throughput Measurement**: Events processed per second +- **Latency Analysis**: End-to-end processing delays +- **Resource Usage**: CPU, memory, and I/O utilization +- **Scalability Testing**: Performance across different core counts + +### Configuration Management + +#### **Environment Variables** +- `TMPDIR` - Storage location for temporary files +- `FELDERA_API_URL` - Pipeline manager endpoint (default: localhost:8080) +- Cloud credentials for Dataflow benchmarks + +#### **System Requirements** +- **Java 11+** - Required for Beam and Flink +- **Docker** - For containerized system testing +- **Python 3** - For analysis scripts +- **Cloud SDK** - For Google Cloud Dataflow testing + +### Results Analysis + +#### **Statistical Analysis** +- **PSPP Integration**: Statistical analysis with `analysis.sps` +- **Performance Tables**: Formatted comparison tables +- **Trend Analysis**: Performance trends across system configurations + +#### **Output Formats** +- **CSV Results**: Machine-readable performance data +- **Formatted Tables**: Human-readable comparison tables +- **Statistical Reports**: Detailed statistical analysis + +## Best Practices + +### Benchmark Execution +- **Warm-up Runs**: Allow systems to reach steady state +- **Multiple Iterations**: Run benchmarks multiple times for statistical significance +- **Resource Isolation**: Ensure consistent resource availability +- **Environment Control**: Use consistent hardware and software configurations + +### Performance Comparison +- **Fair Comparison**: Use equivalent configurations across systems +- **Resource Limits**: Apply consistent memory and CPU limits +- **Data Consistency**: Use identical datasets across systems +- **Metric Standardization**: Use consistent performance metrics + +### System Setup +- **Documentation**: Follow setup instructions for each system +- **Version Control**: Pin specific versions for reproducible results +- **Configuration**: Use optimized configurations for each system +- **Monitoring**: Monitor resource usage during benchmarks + +This benchmarking infrastructure provides comprehensive tools for validating Feldera's performance advantages and identifying optimization opportunities across different workloads and system configurations. \ No newline at end of file diff --git a/crates/CLAUDE.md b/crates/CLAUDE.md new file mode 100644 index 00000000000..5228a5108d9 --- /dev/null +++ b/crates/CLAUDE.md @@ -0,0 +1,118 @@ +## Overview + +The `crates/` directory contains all Rust crates that make up the Feldera platform. This is a workspace-based project where each crate serves a specific purpose in the overall architecture, from core computational engines to I/O adapters and development tools. + +## Crate Descriptions + +### Core Engine Crates + +**`dbsp/`** - The Database Stream Processor is the heart of Feldera's computational engine. It provides incremental computation capabilities where changes propagate in time proportional to the size of the change rather than the dataset. Contains operators for filtering, mapping, joining, aggregating, and advanced streaming operations with support for multi-threaded execution and persistent storage. + +**`sqllib/`** - Runtime support library providing SQL function implementations for the compiled circuits. Includes aggregate functions (SUM, COUNT, AVG), scalar functions (string manipulation, date/time operations), type conversions, and null handling following SQL standard semantics. Essential for SQL compatibility in the DBSP computational model. + +**`feldera-types/`** - Shared type definitions and configuration structures used across the entire platform. Provides pipeline configuration types, transport configurations, data format schemas, error types, and validation frameworks. Serves as the foundational type system ensuring consistency across all Feldera components. + +### Service and Management Crates + +**`pipeline-manager/`** - HTTP API service for managing the lifecycle of data pipelines. Provides RESTful endpoints for creating, configuring, starting, stopping, and monitoring pipelines. Integrates with PostgreSQL for persistence, handles authentication, and orchestrates the compilation and deployment of SQL programs to DBSP circuits. + +**`rest-api/`** - Type definitions and OpenAPI specification generation for the Feldera REST API. Automatically generates machine-readable API specifications from Rust types, ensuring consistency between server implementation and client SDKs. Includes comprehensive request/response schemas and validation rules. + +### I/O and Integration Crates + +**`adapters/`** - Comprehensive I/O framework providing input and output adapters for DBSP circuits. Supports multiple transport protocols (Kafka, HTTP, File, Redis, S3) and data formats (JSON, CSV, Avro, Parquet). Includes integrated connectors for databases like PostgreSQL and data lake formats like Delta Lake with fault-tolerant processing and automatic retry logic. + +**`adapterlib/`** - Foundational abstractions and utilities for building I/O adapters. Provides generic transport traits, format processing abstractions, circuit catalog interfaces for runtime introspection, and comprehensive error handling frameworks. Enables consistent adapter implementation across different data sources and sinks. + +**`iceberg/`** - Apache Iceberg table format integration enabling Feldera to work with modern data lake architectures. Supports schema evolution, time travel queries, and efficient data partitioning. Includes S3 and cloud storage integration with optimized reading patterns for large analytic datasets. + +### Storage and Persistence Crates + +**`storage/`** - Pluggable storage abstraction layer supporting multiple backends including memory-based storage for testing and POSIX I/O for production deployments. Provides async file operations, block-level caching, buffer management, and error recovery mechanisms optimized for DBSP's access patterns. + +### Mathematical and Type System Crates + +**`fxp/`** - High-precision fixed-point arithmetic library for financial and decimal computations. Provides exact decimal arithmetic without floating-point errors, SQL DECIMAL type compatibility, and DBSP integration with zero-copy serialization. Supports both compile-time fixed scales and dynamic precision for flexible numeric processing. + +### Benchmarking and Testing Crates + +**`nexmark/`** - Industry-standard streaming benchmark suite implementing the NEXMark benchmark queries. Generates realistic auction data with configurable rates and distributions, implements all 22 standard benchmark queries, and provides performance measurement tools for validating DBSP's streaming capabilities. + +**`datagen/`** - Test data generation utilities providing realistic datasets for testing and benchmarking. Supports configurable data distributions, correlated data generation, high-throughput batch generation, and deterministic seeded generation for reproducible testing scenarios. + +### Development and Tooling Crates + +**`fda/`** - Feldera Development Assistant providing CLI tools and interactive development environment. Includes command-line utilities for common development tasks, interactive shell for exploratory development, benchmarking tools, and API specification validation. Serves as the primary development companion tool. + +**`ir/`** - Multi-level intermediate representation system for SQL compilation pipeline. Provides High-level IR (HIR) close to SQL structure, Mid-level IR (MIR) for optimization, and Low-level IR (LIR) for code generation. Includes comprehensive analysis frameworks and transformation passes for SQL program compilation. + +## Development Workflow + +### Building All Crates + +```bash +# Build all crates +cargo build --workspace + +# Test all crates +cargo test --workspace + +# Test documentation (limit threads to prevent OOM) +cargo test --doc -- --test-threads 12 + +# Check all crates +cargo check --workspace + +# Build specific crate +cargo build -p +``` + +### Workspace Management + +```bash +# List all workspace members +cargo metadata --format-version=1 | jq '.workspace_members' + +# Run command on all workspace members +cargo workspaces exec cargo check + +# Update dependencies across workspace +cargo update +``` + +### Cross-Crate Dependencies + +The crates form a dependency graph where: +- **Core crates** (`dbsp`, `feldera-types`) are dependencies for most other crates +- **Service crates** (`pipeline-manager`, `adapters`) depend on core and utility crates +- **Tool crates** (`fda`, `nexmark`) typically depend on multiple core crates +- **Library crates** (`sqllib`, `adapterlib`) provide functionality to higher-level crates + +### Testing Strategy + +- **Unit Tests**: Each crate contains comprehensive unit tests +- **Integration Tests**: Cross-crate integration testing +- **Workspace Tests**: Full workspace testing for compatibility +- **Benchmark Tests**: Performance validation across crates + +## Best Practices + +### Crate Organization +- Each crate has a single, well-defined responsibility +- Dependencies flow from higher-level to lower-level crates +- Shared types and utilities are extracted to common crates +- Feature flags control optional functionality and integrations + +### Development Guidelines +- Follow consistent coding patterns across crates +- Use workspace-level dependency management +- Maintain comprehensive documentation for each crate +- Write tests that work both in isolation and as part of the workspace + +### Performance Considerations +- Core computational crates (`dbsp`, `sqllib`) are highly optimized +- I/O crates (`adapters`, `storage`) focus on throughput and efficiency +- Tool crates prioritize developer experience over raw performance +- Benchmark crates provide performance validation and regression detection + +This workspace architecture enables modular development while maintaining consistency and performance across the entire Feldera platform. \ No newline at end of file diff --git a/crates/adapterlib/CLAUDE.md b/crates/adapterlib/CLAUDE.md new file mode 100644 index 00000000000..04b0ef72b75 --- /dev/null +++ b/crates/adapterlib/CLAUDE.md @@ -0,0 +1,261 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the adapterlib crate +cargo build -p adapterlib + +# Run tests +cargo test -p adapterlib + +# Check documentation +cargo doc -p adapterlib --open +``` + +## Architecture Overview + +### Technology Stack + +- **Transport Abstraction**: Generic transport layer interface +- **Format Processing**: Data format abstraction and utilities +- **Catalog System**: Runtime circuit introspection +- **Error Handling**: Comprehensive error types for I/O operations + +### Core Purpose + +Adapter Library provides **foundational abstractions** for building I/O adapters: + +- **Transport Traits**: Generic interfaces for input/output transports +- **Format Abstractions**: Data serialization/deserialization framework +- **Catalog Interface**: Runtime circuit inspection and control +- **Utility Functions**: Common functionality for adapter implementations + +### Project Structure + +#### Core Modules + +- `src/transport.rs` - Transport layer abstractions and traits +- `src/format.rs` - Data format processing abstractions +- `src/catalog.rs` - Circuit catalog and introspection +- `src/errors/` - Error types for adapter operations +- `src/utils/` - Utility functions and helpers + +## Important Implementation Details + +### Transport Abstraction + +#### Core Transport Traits +```rust +pub trait InputTransport { + type Error; + + async fn start(&mut self) -> Result<(), Self::Error>; + async fn read_batch(&mut self) -> Result, Self::Error>; + async fn stop(&mut self) -> Result<(), Self::Error>; +} + +pub trait OutputTransport { + type Error; + + async fn start(&mut self) -> Result<(), Self::Error>; + async fn write_batch(&mut self, data: &[OutputEvent]) -> Result<(), Self::Error>; + async fn stop(&mut self) -> Result<(), Self::Error>; +} +``` + +#### Transport Features +- **Async Interface**: Non-blocking I/O operations +- **Batch Processing**: Efficient bulk data transfer +- **Error Handling**: Transport-specific error types +- **Lifecycle Management**: Start/stop semantics + +### Format Processing + +#### Format Traits +```rust +pub trait Deserializer { + type Error; + type Output; + + fn deserialize(&mut self, data: &[u8]) -> Result; +} + +pub trait Serializer { + type Error; + type Input; + + fn serialize(&mut self, data: &Self::Input) -> Result, Self::Error>; +} +``` + +#### Format Features +- **Type Safety**: Strongly typed serialization +- **Error Recovery**: Graceful handling of malformed data +- **Schema Evolution**: Handle schema changes +- **Performance**: Zero-copy where possible + +### Catalog System + +#### Circuit Introspection +```rust +pub trait Catalog { + fn input_handles(&self) -> Vec; + fn output_handles(&self) -> Vec; + fn get_input_handle(&self, name: &str) -> Option; + fn get_output_handle(&self, name: &str) -> Option; +} +``` + +#### Runtime Control +- **Handle Discovery**: Find inputs and outputs by name +- **Schema Inspection**: Query table and view schemas +- **Type Information**: Runtime type metadata +- **Statistics**: Performance and throughput metrics + +### Error Handling Framework + +#### Error Categories +```rust +pub enum AdapterError { + TransportError(TransportError), + FormatError(FormatError), + ConfigurationError(String), + RuntimeError(String), +} +``` + +#### Error Features +- **Structured Errors**: Rich error information with context +- **Error Chaining**: Preserve error causation chains +- **Recovery Hints**: Actionable error messages +- **Logging Integration**: Structured logging support + +### Utility Functions + +#### Common Patterns +- **Retry Logic**: Configurable retry mechanisms +- **Rate Limiting**: Throughput control utilities +- **Buffer Management**: Efficient buffer handling +- **Connection Pooling**: Resource management helpers + +#### Data Processing Utilities +- **Type Conversion**: Between different data representations +- **Validation**: Data integrity checking +- **Transformation**: Common data transformations +- **Batching**: Batch size optimization + +## Development Workflow + +### For Transport Implementation + +1. Implement `InputTransport` or `OutputTransport` trait +2. Define transport-specific error types +3. Add configuration types for transport parameters +4. Implement lifecycle methods (start/stop) +5. Add comprehensive error handling +6. Test with various failure scenarios + +### For Format Support + +1. Implement `Deserializer` or `Serializer` trait +2. Handle schema validation and evolution +3. Add performance optimizations +4. Test with malformed data +5. Document format-specific behavior +6. Add benchmarks for performance validation + +### Testing Strategy + +#### Unit Tests +- **Interface Compliance**: Trait implementation correctness +- **Error Handling**: Comprehensive error scenario testing +- **Edge Cases**: Boundary conditions and limits +- **Resource Management**: Proper cleanup and lifecycle + +#### Integration Tests +- **End-to-End**: Complete adapter pipeline testing +- **Performance**: Throughput and latency validation +- **Reliability**: Failure recovery and resilience +- **Compatibility**: Cross-format and cross-transport testing + +### Design Patterns + +#### Async Patterns +```rust +// Async transport implementation +impl InputTransport for MyTransport { + async fn read_batch(&mut self) -> Result, Self::Error> { + // Non-blocking batch read + let data = self.connection.read_batch().await?; + Ok(self.format.deserialize_batch(&data)?) + } +} +``` + +#### Error Handling Patterns +```rust +// Comprehensive error handling +match transport.read_batch().await { + Ok(events) => process_events(events), + Err(TransportError::ConnectionLost) => reconnect_and_retry(), + Err(TransportError::Timeout) => continue_with_empty_batch(), + Err(e) => return Err(e.into()), +} +``` + +### Performance Considerations + +#### Batch Processing +- **Optimal Batch Size**: Balance latency vs throughput +- **Memory Usage**: Avoid excessive buffering +- **CPU Utilization**: Efficient serialization/deserialization +- **Network I/O**: Minimize network round trips + +#### Resource Management +- **Connection Pooling**: Reuse expensive resources +- **Buffer Reuse**: Minimize allocations +- **Async Efficiency**: Non-blocking operations +- **Cleanup**: Proper resource disposal + +### Configuration Files + +- `Cargo.toml` - Core dependencies for transport and format abstractions +- Minimal dependencies to avoid bloat +- Feature flags for optional functionality + +### Dependencies + +#### Core Dependencies +- `tokio` - Async runtime +- `serde` - Serialization framework +- `anyhow` - Error handling +- `tracing` - Structured logging + +#### Optional Dependencies +- `datafusion` - SQL query engine integration +- Various format-specific libraries + +### Best Practices + +#### API Design +- **Trait-Based**: Use traits for extensibility +- **Async First**: Design for async from the start +- **Error Rich**: Provide detailed error information +- **Performance Aware**: Consider performance implications + +#### Implementation Patterns +- **Resource Safety**: Proper RAII patterns +- **Error Propagation**: Use `?` operator consistently +- **Logging**: Comprehensive structured logging +- **Testing**: Test both success and failure paths + +#### Documentation +- **Trait Documentation**: Clear trait contract specification +- **Examples**: Comprehensive usage examples +- **Error Scenarios**: Document error conditions +- **Performance Notes**: Performance characteristics + +This crate provides the foundational abstractions that make it easy to implement new I/O adapters while maintaining consistency and performance across the Feldera platform. \ No newline at end of file diff --git a/crates/adapters/CLAUDE.md b/crates/adapters/CLAUDE.md new file mode 100644 index 00000000000..ef831b2e019 --- /dev/null +++ b/crates/adapters/CLAUDE.md @@ -0,0 +1,284 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the adapters crate +cargo build -p adapters + +# Run tests +cargo test -p adapters + +# Run tests with required dependencies (Kafka, Redis, etc.) +cargo test -p adapters --features=with-kafka + +# Build with all features +cargo build -p adapters --all-features +``` + +### Running Examples + +```bash +# Run server demo with Kafka support +cargo run --example server --features="with-kafka server" + +# Test specific transport integrations +cargo test -p adapters test_kafka +cargo test -p adapters test_http +``` + +### Development Environment + +```bash +# Start required services for testing +docker run -p 9092:9092 --rm -itd docker.redpanda.com/vectorized/redpanda:v24.2.4 redpanda start --smp 2 + +# Install system dependencies +sudo apt install cmake +``` + +## Architecture Overview + +### Technology Stack + +- **I/O Framework**: Async I/O with tokio runtime +- **Serialization**: Multiple format support (JSON, CSV, Avro, Parquet) +- **Transport Layer**: Kafka, HTTP, File, Redis, S3 integrations +- **Database Integration**: PostgreSQL and Delta Lake connectors +- **Testing**: Comprehensive integration testing with external services + +### Core Concepts + +The Adapters crate provides a **unified I/O framework** for DBSP circuits: + +- **Input Adapters**: Ingest data from external sources into DBSP circuits +- **Output Adapters**: Stream circuit outputs to external consumers +- **Transport Layer**: Pluggable transport mechanisms +- **Format Layer**: Serialization/deserialization for different data formats +- **Controller**: Orchestrates circuit execution with I/O adapters + +### Project Structure + +#### Core Directories + +- `src/adhoc/` - Implementation for the ad-hoc queries that uses Apache DataFusion batch query engine +- `src/format/` - Data format handlers (JSON, CSV, Avro, Parquet) +- `src/transport/` - Transport implementations (Kafka, HTTP, File, etc.) +- `src/integrated/` - Integrated connectors (Postgres, Delta Lake) +- `src/controller/` - Circuit controller and lifecycle management +- `src/static_compile/` - Compile-time adapter generation +- `src/test/` - Testing utilities and mock implementations + +#### Key Components + +- **Transport Adapters**: Protocol-specific I/O implementations +- **Format Processors**: Data serialization/deserialization +- **Controller**: Circuit execution coordinator +- **Catalog**: Runtime circuit introspection and control + +## Important Implementation Details + +### Transport Layer + +#### Supported Transports + +**Kafka Integration** +```rust +// Fault-tolerant Kafka producer +use adapters::transport::kafka::ft::KafkaOutputTransport; + +// Non-fault-tolerant (higher performance) +use adapters::transport::kafka::nonft::KafkaOutputTransport; +``` + +**HTTP REST API** +```rust +use adapters::transport::http::{HttpInputTransport, HttpOutputTransport}; +``` + +**File I/O** +```rust +use adapters::transport::file::{FileInputTransport, FileOutputTransport}; +``` + +**Redis Streams** +```rust +use adapters::transport::redis::RedisOutputTransport; +``` + +### Format Layer + +#### Supported Formats + +- **CSV**: Delimited text with schema inference +- **JSON**: Nested data structures with type coercion +- **Avro**: Schema-based binary serialization +- **Parquet**: Columnar data format +- **Raw**: Unprocessed byte streams + +#### Format Pipeline + +1. **Deserialization**: Convert external format to internal representation +2. **Type Coercion**: Map external types to DBSP types +3. **Validation**: Ensure data integrity and constraints +4. **Batch Processing**: Optimize throughput with batching + +### Controller Architecture + +```rust +use adapters::controller::Controller; + +// Create controller with circuit +let controller = Controller::new(circuit, adapters)?; + +// Start data processing +controller.start().await?; + +// Monitor and control +controller.step().await?; +controller.pause().await?; +``` + +#### Controller Features + +- **Lifecycle Management**: Start, pause, stop, checkpoint +- **Error Handling**: Graceful degradation and recovery +- **Statistics**: Performance monitoring and metrics +- **Flow Control**: Backpressure and rate limiting + +### Integrated Connectors + +#### PostgreSQL CDC +```rust +use adapters::integrated::postgres::PostgresInputAdapter; + +// Real-time change data capture +let adapter = PostgresInputAdapter::new(config).await?; +``` + +#### Delta Lake +```rust +use adapters::integrated::delta_table::DeltaTableOutputAdapter; + +// Write to Delta Lake format +let adapter = DeltaTableOutputAdapter::new(config).await?; +``` + +## Development Workflow + +### For New Transport Implementation + +1. Create module in `src/transport/` +2. Implement `InputTransport` and/or `OutputTransport` traits +3. Add configuration types in `feldera-types` +4. Implement error handling and reconnection logic +5. Add comprehensive tests with real service +6. Update feature flags in `Cargo.toml` + +### For New Format Support + +1. Add format module in `src/format/` +2. Implement `Deserializer` and/or `Serializer` traits +3. Add schema inference and validation +4. Handle type coercion edge cases +5. Add performance benchmarks +6. Test with various data patterns + +### Testing Strategy + +#### Unit Tests +- Mock implementations for isolated testing +- Format conversion correctness +- Error handling scenarios + +#### Integration Tests +- Real external service dependencies +- End-to-end data flow validation +- Performance and reliability testing + +#### Test Dependencies + +The test suite requires external services: + +- **Kafka/Redpanda**: Message queue testing +- **PostgreSQL**: Database connector testing +- **Redis**: Stream testing +- **S3-compatible**: Object storage testing + +### Performance Optimization + +#### Throughput Optimization +- **Batch Processing**: Amortize per-record overhead +- **Connection Pooling**: Reuse network connections +- **Parallel Processing**: Multi-threaded I/O operations +- **Zero-Copy**: Minimize memory allocations + +#### Memory Management +- **Streaming Processing**: Bounded memory usage +- **Backpressure**: Flow control mechanisms +- **Buffer Management**: Optimal buffer sizing +- **Garbage Collection**: Efficient cleanup + +### Configuration Files + +- `Cargo.toml` - Feature flags and dependencies +- `lsan.supp` - Leak sanitizer suppressions +- Test data files in various formats + +### Key Features + +- **Fault Tolerance**: Automatic reconnection and retry logic +- **Schema Evolution**: Handle schema changes gracefully +- **Monitoring**: Built-in metrics and observability +- **Multi-format**: Support for diverse data formats + +### Dependencies + +#### Core Dependencies +- `tokio` - Async runtime +- `serde` - Serialization framework +- `anyhow` - Error handling +- `tracing` - Structured logging + +#### Transport Dependencies +- `rdkafka` - Kafka client +- `reqwest` - HTTP client +- `redis` - Redis client +- `rusoto_s3` - AWS S3 client + +#### Format Dependencies +- `csv` - CSV processing +- `serde_json` - JSON processing +- `apache-avro` - Avro format support +- `parquet` - Columnar data format + +### Error Handling Patterns + +- **Transient Errors**: Retry with exponential backoff +- **Permanent Errors**: Fail fast with detailed diagnostics +- **Schema Errors**: Graceful degradation with data preservation +- **Network Errors**: Connection management and recovery + +### Security Considerations + +- **Authentication**: Support for various auth mechanisms +- **Encryption**: TLS/SSL for network transports +- **Access Control**: Fine-grained permissions +- **Data Privacy**: PII handling and sanitization + +### Monitoring and Observability + +- **Metrics**: Throughput, latency, error rates +- **Tracing**: Distributed request tracing +- **Logging**: Structured log output +- **Health Checks**: Service health monitoring + +### Best Practices + +- **Async Design**: Non-blocking I/O operations +- **Resource Management**: Proper cleanup and lifecycle +- **Error Propagation**: Structured error handling +- **Testing**: Real service integration tests +- **Documentation**: Comprehensive examples and guides \ No newline at end of file diff --git a/crates/adapters/src/adhoc/CLAUDE.md b/crates/adapters/src/adhoc/CLAUDE.md new file mode 100644 index 00000000000..bf1d78eaed8 --- /dev/null +++ b/crates/adapters/src/adhoc/CLAUDE.md @@ -0,0 +1,388 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the adhoc module as part of adapters crate +cargo build -p adapters + +# Run tests for the adhoc module +cargo test -p adapters adhoc + +# Run tests with DataFusion features enabled +cargo test -p adapters --features="with-datafusion" + +# Check documentation +cargo doc -p adapters --open +``` + +### Running Ad-hoc Queries + +```bash +# Via CLI +fda exec pipeline-name "SELECT * FROM materialized_view" + +# Interactive shell +fda shell pipeline-name + +# Via HTTP API +curl "http://localhost:8080/v0/pipelines/my-pipeline/query?sql=SELECT%20*%20FROM%20users&format=json" +``` + +## Architecture Overview + +### Technology Stack + +- **Query Engine**: Apache DataFusion for SQL query execution +- **Output Formats**: Text tables, JSON, Parquet, Arrow IPC +- **Transport**: WebSocket and HTTP streaming +- **Runtime**: Multi-threaded execution using DBSP's Tokio runtime +- **Storage**: Direct access to DBSP's materialized state + +### Core Purpose + +The Ad-hoc Query module provides **batch SQL query capabilities** alongside Feldera's core incremental processing: + +- **Interactive Queries**: Real-time SQL queries against current pipeline state +- **Multiple Formats**: Support for various output formats optimized for different use cases +- **Streaming Results**: Memory-efficient streaming of large result sets +- **DataFusion Integration**: Full SQL compatibility through Apache DataFusion engine + +This is the primary module that is responsible for ad-hoc queries backend functionality. + +### Project Structure + +#### Core Files + +- `mod.rs` - Main module with WebSocket handling and session context creation +- `executor.rs` - Query execution and result streaming implementations +- `format.rs` - Text table formatting utilities +- `table.rs` - DataFusion table provider and INSERT operation support + +## Implementation Details + +### DataFusion Session Configuration + +#### Session Context Setup +```rust +// From mod.rs - actual DataFusion configuration +pub(crate) fn create_session_context(config: &PipelineConfig) -> Result { + const SORT_IN_PLACE_THRESHOLD_BYTES: usize = 64 * 1024 * 1024; + const SORT_SPILL_RESERVATION_BYTES: usize = 64 * 1024 * 1024; + + let session_config = SessionConfig::new() + .with_target_partitions(config.global.workers as usize) + .with_sort_in_place_threshold_bytes(SORT_IN_PLACE_THRESHOLD_BYTES) + .with_sort_spill_reservation_bytes(SORT_SPILL_RESERVATION_BYTES); + + // Memory pool configuration for large queries + let mut runtime_env_builder = RuntimeEnvBuilder::new(); + if let Some(memory_mb_max) = config.global.resources.memory_mb_max { + let memory_bytes_max = memory_mb_max * 1024 * 1024; + runtime_env_builder = runtime_env_builder + .with_memory_pool(Arc::new(FairSpillPool::new(memory_bytes_max as usize))); + } + + // Spill-to-disk directory for temp files during large operations + if let Some(storage) = &config.storage_config { + let path = PathBuf::from(storage.path.clone()).join("adhoc-tmp"); + runtime_env_builder = runtime_env_builder.with_temp_file_path(path); + } +} +``` + +### Query Execution Pipeline + +#### Multi-Threaded Execution Strategy +```rust +// From executor.rs - execution in DBSP runtime for parallelization +fn execute_stream(df: DataFrame) -> Receiver> { + let (tx, rx) = oneshot::channel(); + // Execute in DBSP's multi-threaded runtime, not actix-web's single-threaded runtime + dbsp::circuit::tokio::TOKIO.spawn(async move { + let _r = tx.send(df.execute_stream().await); + }); + rx +} +``` + +#### Result Streaming Implementation +```rust +// Four different streaming formats implemented: + +// 1. Text format - human-readable tables +pub(crate) fn stream_text_query(df: DataFrame) + -> impl Stream> + +// 2. JSON format - line-delimited JSON records (deprecated) +pub(crate) fn stream_json_query(df: DataFrame) + -> impl Stream> + +// 3. Arrow IPC format - high-performance binary streaming +pub(crate) fn stream_arrow_query(df: DataFrame) + -> impl Stream> + +// 4. Parquet format - columnar data with compression +pub(crate) fn stream_parquet_query(df: DataFrame) + -> impl Stream> +``` + +### WebSocket Handler Implementation + +#### Real-time Query Processing +```rust +// From mod.rs - WebSocket message handling +pub async fn adhoc_websocket( + df_session: SessionContext, + req: HttpRequest, + stream: Payload, +) -> Result { + let (res, mut ws_session, stream) = actix_ws::handle(&req, stream)?; + let mut stream = stream + .max_frame_size(MAX_WS_FRAME_SIZE) // 2MB frame limit + .aggregate_continuations() + .max_continuation_size(4 * MAX_WS_FRAME_SIZE); + + // Process WebSocket messages + while let Some(msg) = stream.next().await { + match msg { + Ok(AggregatedMessage::Text(text)) => { + let args = serde_json::from_str::(&text)?; + let df = df_session + .sql_with_options(&args.sql, SQLOptions::new().with_allow_ddl(false)) + .await?; + adhoc_query_handler(df, ws_session.clone(), args).await?; + } + // Handle other message types... + } + } +} +``` + +### Table Provider Implementation + +#### DBSP State Integration +```rust +// From table.rs - AdHocTable provides DataFusion access to DBSP materialized state +pub struct AdHocTable { + controller: Weak, // Weak ref to avoid cycles + input_handle: Option>, // For INSERT operations + name: SqlIdentifier, + materialized: bool, // Only materialized tables are queryable + schema: Arc, + snapshots: ConsistentSnapshots, // Current state snapshots +} + +#[async_trait] +impl TableProvider for AdHocTable { + async fn scan(&self, projection: Option<&Vec>, filters: &[Expr], limit: Option) + -> Result> { + + // Validate materialization requirement + if !self.materialized { + return Err(DataFusionError::Execution( + format!("Make sure `{}` is configured as materialized: \ + use `with ('materialized' = 'true')` for tables, or `create materialized view` for views", + self.name) + )); + } + + // Create execution plan + Ok(Arc::new(AdHocQueryExecution::new(/* ... */))) + } +} +``` + +#### INSERT Operation Support +```rust +// From table.rs - INSERT INTO materialized tables +async fn insert_into(&self, input: Arc, overwrite: InsertOp) + -> Result> { + + match &self.input_handle { + Some(ih) => { + // Create temporary adhoc input endpoint + let endpoint_name = format!("adhoc-ingress-{}-{}", self.name.name(), Uuid::new_v4()); + let sink = Arc::new(AdHocTableSink::new(/* ... */)); + Ok(Arc::new(DataSinkExec::new(input, sink, None))) + } + None => exec_err!("Called insert_into on a view") + } +} +``` + +### Query Execution Details + +#### Record Batch Processing +```rust +// From table.rs - efficient batch processing from DBSP cursors +let mut cursor = batch_reader.cursor(RecordFormat::Parquet( + output_adhoc_arrow_serde_config().clone() +))?; + +const MAX_BATCH_SIZE: usize = 256; // Optimized for latency vs throughput +let mut cur_batch_size = 0; + +while cursor.key_valid() { + if cursor.weight() > 0 { // Skip deleted records + cursor.serialize_key_to_arrow(&mut insert_builder.builder)?; + cur_batch_size += 1; + + if cur_batch_size >= MAX_BATCH_SIZE { + let batch = insert_builder.builder.to_record_batch()?; + send_batch(&tx, &projection, batch).await?; + cur_batch_size = 0; + } + } + cursor.step_key(); +} +``` + +### Output Format Implementations + +#### Text Table Formatting +```rust +// From format.rs - human-readable table output +pub(crate) fn create_table(results: &[RecordBatch]) -> Result { + let mut table = Table::new(); + table.load_preset("||--+-++| ++++++"); // comfy-table preset + + // Add headers from schema + for field in schema.fields() { + header.push(Cell::new(field.name())); + } + + // Process each row with cell length limiting + const CELL_MAX_LENGTH: usize = 64; + for formatter in &formatters { + let mut content = formatter.value(row).to_string(); + if content.len() > CELL_MAX_LENGTH { + content.truncate(CELL_MAX_LENGTH); + content.push_str("..."); + } + } +} +``` + +#### Parquet Streaming +```rust +// From executor.rs - efficient Parquet streaming with compression +const PARQUET_CHUNK_SIZE: usize = MAX_WS_FRAME_SIZE / 2; // 1MB chunks + +let mut writer = AsyncArrowWriter::try_new( + ChannelWriter::new(tx), + schema, + Some(WriterProperties::builder() + .set_compression(Compression::SNAPPY) + .build()), +)?; + +// Stream with controlled chunk sizes +while let Some(batch) = stream.next().await.transpose()? { + writer.write(&batch).await?; + if writer.in_progress_size() > PARQUET_CHUNK_SIZE { + writer.flush().await?; // Send chunk to client + } +} +``` + +## Key Features and Limitations + +### Supported Operations +- **SELECT queries** from materialized tables and views +- **Complex queries** with joins, aggregations, window functions +- **INSERT statements** into materialized tables +- **Multiple output formats** for different use cases +- **Streaming results** for memory efficiency + +### Critical Limitations +- **Materialization Required**: Only `WITH ('materialized' = 'true')` tables/views are queryable +- **No DDL Operations**: CREATE, ALTER, DROP are explicitly disabled +- **SQL Dialect Differences**: DataFusion SQL vs Feldera SQL (Calcite-based) differences +- **Resource Sharing**: Shares CPU/memory with main pipeline processing +- **WebSocket Frame Limits**: 2MB maximum frame size + +### Configuration Requirements + +#### Pipeline Configuration +```sql +-- Tables must be explicitly materialized +CREATE TABLE users (id INT, name STRING) WITH ('materialized' = 'true'); + +-- Or use materialized views +CREATE MATERIALIZED VIEW user_stats AS + SELECT COUNT(*) as count, AVG(age) as avg_age FROM users; +``` + +#### Memory and Threading +- Uses pipeline's configured worker thread count +- Memory limits configurable via pipeline resources +- Spill-to-disk support for large operations +- Automatic parallelization across available cores + +## Development Workflow + +### Adding New Output Formats + +1. Add format enum variant to `AdHocResultFormat` in feldera-types +2. Implement streaming function in `executor.rs` +3. Add format handling in `adhoc_query_handler()` in `mod.rs` +4. Add HTTP response handling in `stream_adhoc_result()` +5. Update client SDKs and documentation + +### Performance Optimization + +#### Memory Management +- Configure appropriate memory pools for large queries +- Use spill-to-disk for operations exceeding memory limits +- Optimize batch sizes for latency vs throughput trade-offs +- Monitor WebSocket frame sizes to prevent timeouts + +#### Query Performance +- Leverage DataFusion's query optimizer +- Use projection pushdown when possible +- Consider materialized view design for common queries +- Monitor resource usage impact on main pipeline + +### Error Handling Patterns + +#### Materialization Validation +```rust +if !self.materialized { + return Err(DataFusionError::Execution( + format!("Tried to SELECT from a non-materialized source. Make sure `{}` is configured as materialized", + self.name) + )); +} +``` + +#### Resource Management Errors +- Memory limit exceeded → automatic spill to disk +- WebSocket connection lost → graceful cleanup +- Query timeout → configurable limits with clear messages +- Invalid SQL → DataFusion parser error propagation + +## Dependencies and Integration + +### Core Dependencies +- **datafusion** - SQL query engine and execution +- **arrow** - Columnar data processing and formats +- **actix-web** - WebSocket and HTTP handling +- **tokio** - Async runtime and streaming +- **parquet** - Columnar file format support + +### Integration Points +- **Controller**: Access to circuit state and snapshots +- **Transport**: AdHoc input endpoints for INSERT operations +- **Storage**: Direct access to materialized table storage +- **Types**: Shared configuration and error types + +### Performance Characteristics +- **Throughput**: Optimized for interactive query latency +- **Memory Usage**: Bounded by configuration with spill support +- **Parallelization**: Automatic multi-threading via DataFusion +- **Format Efficiency**: Binary formats (Arrow, Parquet) for high throughput + +This module bridges the gap between Feldera's incremental streaming processing and traditional batch SQL analytics, providing essential inspection and analysis capabilities for developers and operators. \ No newline at end of file diff --git a/crates/adapters/src/controller/CLAUDE.md b/crates/adapters/src/controller/CLAUDE.md new file mode 100644 index 00000000000..226420e0ac7 --- /dev/null +++ b/crates/adapters/src/controller/CLAUDE.md @@ -0,0 +1,243 @@ +# Controller Architecture + +The controller module implements the centralized orchestration layer for Feldera's I/O adapter system. It coordinates circuit execution, manages data flow, implements fault tolerance, and provides runtime statistics. + +## Core Purpose + +The controller serves as the **central control plane** that: +- Coordinates DBSP circuit execution with I/O endpoints +- Implements runtime flow control and backpressure management +- Manages pipeline state transitions and lifecycle +- Provides fault tolerance through journaling and checkpointing +- Collects and reports performance statistics +- Orchestrates streaming data ingestion and output processing + +## Architecture Overview + +### Multi-Threaded Design + +The controller employs a **three-thread architecture** for optimal performance: + +1. **Circuit Thread** (`CircuitThread`) - The main orchestrator that: + - Owns the DBSP circuit handle and executes `circuit.step()` + - Coordinates with input/output endpoints + - Manages circuit state transitions (Running/Paused/Terminated) + - Handles commands (checkpointing, profiling, suspension) + - Implements fault-tolerant journaling and replay logic + +2. **Backpressure Thread** (`BackpressureThread`) - Flow control manager that: + - Monitors input buffer levels across all endpoints + - Pauses/resumes input endpoints based on buffer thresholds + - Prevents memory exhaustion during high-throughput scenarios + - Implements user-requested connector state changes + +3. **Statistics Thread** (`StatisticsThread`) - Performance monitoring that: + - Collects metrics every second (memory, storage, processed records) + - Maintains a 60-second rolling window of time series data + - Notifies subscribers of new metrics via broadcast channels + - Provides data for real-time monitoring and alerting + +### Key Components + +#### Controller (`Controller`) +- **Public API** for pipeline management and state control +- **Thread-safe wrapper** around `ControllerInner` with proper lifecycle management +- **Command interface** for asynchronous operations (checkpointing, profiling) + +#### ControllerInner (`ControllerInner`) +- **Shared state** accessible by all threads +- **Input/Output endpoint management** with dynamic reconfiguration +- **Circuit catalog integration** for schema-aware processing +- **Configuration management** and validation + +#### ControllerStatus (`ControllerStatus`) +- **Lock-free performance counters** using atomics for minimal overhead +- **Statistical aggregation** across all endpoints and circuit operations +- **Time series broadcasting** for real-time monitoring clients +- **Flow control metrics** for backpressure management + +## File Organization + +### Core Files + +- **`mod.rs`** - Main controller implementation with multi-threaded orchestration +- **`stats.rs`** - Performance monitoring, metrics collection, and statistics reporting +- **`error.rs`** - Error handling abstractions (re-exports from adapterlib) + +### Specialized Modules + +- **`checkpoint.rs`** - Fault tolerance through persistent checkpointing + - Circuit state serialization and recovery + - Input endpoint offset tracking + - Configuration preservation across restarts + +- **`journal.rs`** - Transaction log for exactly-once processing + - Step-by-step metadata recording + - Replay mechanism for fault recovery + - Storage-agnostic persistence layer + +- **`sync.rs`** - Enterprise checkpoint synchronization (S3/cloud storage) + - Distributed checkpoint management + - Cross-instance state coordination + - Activation markers for standby pipelines + +- **`validate.rs`** - Configuration validation and dependency analysis + - Connector dependency cycle detection + - Pipeline configuration consistency checks + - Input/output endpoint validation + +- **`test.rs`** - Comprehensive testing utilities and integration tests + +## Key Design Patterns + +### 1. Lock-Free Statistics +```rust +// Atomic counters prevent blocking the datapath +pub total_input_records: AtomicU64, +pub total_processed_records: AtomicU64, +// Broadcast channels for real-time streaming +pub time_series_notifier: broadcast::Sender, +``` + +**Benefits:** +- Zero-overhead metrics collection +- Non-blocking circuit execution +- Real-time monitoring without performance impact + +### 2. Thread Coordination +```rust +struct CircuitThread { + controller: Arc, // Shared state + circuit: DBSPHandle, // Circuit ownership + backpressure_thread: BackpressureThread, // Flow control + _statistics_thread: StatisticsThread, // Metrics +} +``` + +**Benefits:** +- Clear separation of concerns +- Independent thread lifecycles +- Coordinated shutdown without deadlocks + +### 3. Command Pattern for Async Operations +```rust +enum Command { + GraphProfile(GraphProfileCallbackFn), + Checkpoint(CheckpointCallbackFn), + Suspend(SuspendCallbackFn), +} +``` + +**Benefits:** +- Non-blocking public API +- Type-safe callback handling +- Graceful error propagation + +### 4. State Machine Pipeline Management +```rust +pub enum PipelineState { + Paused, // Endpoints paused, draining existing data + Running, // Active ingestion and processing + Terminated, // Permanent shutdown +} +``` + +**Benefits:** +- Predictable state transitions +- Clean resource management +- User-controlled pipeline lifecycle + +## Fault Tolerance Architecture + +### Journaling System +- **Step-by-step logging** of all input processing +- **Metadata preservation** for exact replay scenarios +- **Storage-agnostic** design supporting file systems and cloud storage + +### Checkpointing Integration +- **Circuit state snapshots** at configurable intervals +- **Input offset tracking** for consistent recovery points +- **Configuration preservation** across pipeline restarts + +### Error Handling Strategy +- **Transient vs Permanent errors** with different recovery strategies +- **Graceful degradation** rather than complete failure +- **Detailed error context** for debugging and monitoring + +## Performance Characteristics + +### Throughput Optimization +- **Lock-free statistics** prevent contention on hot paths +- **Batched circuit steps** amortize processing overhead +- **Parallel I/O processing** with independent endpoint threads + +### Memory Management +- **Bounded buffers** prevent unlimited memory growth +- **Backpressure coordination** maintains system stability +- **Efficient broadcast channels** for multiple monitoring clients + +### Latency Minimization +- **Direct circuit coupling** avoids unnecessary queuing +- **Atomic metric updates** enable real-time monitoring +- **Thread affinity** for CPU cache optimization + +## Integration Points + +### DBSP Circuit Integration +- Direct `DBSPHandle` ownership and step execution +- Schema-aware input/output catalog management +- Performance profile collection and analysis + +### Transport Layer Integration +- Dynamic endpoint creation and lifecycle management +- Pluggable transport implementations (Kafka, HTTP, Files) +- Format-agnostic data processing pipeline + +### Storage System Integration +- Backend-agnostic checkpoint and journal persistence +- Cloud storage synchronization for distributed deployments +- Efficient binary serialization for compact storage + +## Development Patterns + +### Adding New Metrics +1. Define atomic counters in `ControllerStatus` +2. Update collection logic in `StatisticsThread` +3. Include in time series broadcasting +4. Add appropriate accessor methods + +### Extending State Management +1. Modify `PipelineState` enum if needed +2. Update state transition logic in circuit thread +3. Add appropriate synchronization primitives +4. Update error handling and recovery paths + +### Implementing New Commands +1. Add command variant to `Command` enum +2. Implement callback-based processing +3. Add public API method with proper error handling +4. Ensure graceful shutdown behavior + +## Best Practices + +### Thread Safety +- Use `Arc` for cancellation flags +- Employ `crossbeam` primitives for lock-free data structures +- Minimize shared mutable state through message passing + +### Resource Management +- Implement proper `Drop` traits for cleanup +- Use `JoinHandle` for thread lifecycle management +- Employ timeouts for external operations + +### Error Propagation +- Distinguish between recoverable and permanent errors +- Provide detailed error context for debugging +- Use structured error types with clear categorization + +### Performance Monitoring +- Collect metrics without impacting datapath performance +- Use efficient serialization for network transmission +- Implement configurable aggregation windows + +This controller architecture enables Feldera to achieve high-performance stream processing while maintaining strong consistency guarantees and providing comprehensive observability into pipeline operations. \ No newline at end of file diff --git a/crates/adapters/src/format/CLAUDE.md b/crates/adapters/src/format/CLAUDE.md new file mode 100644 index 00000000000..49288121b48 --- /dev/null +++ b/crates/adapters/src/format/CLAUDE.md @@ -0,0 +1,527 @@ +## Overview + +The format module implements Feldera's unified data serialization/deserialization layer, providing extensible support for multiple data formats (JSON, CSV, Avro, Parquet, Raw) while integrating seamlessly with DBSP circuits. It enables high-performance data parsing and generation across diverse external systems while maintaining strong type safety and comprehensive error handling. + +## Architecture Overview + +### **Plugin-Based Factory Pattern** + +The format system uses a registry-based architecture with runtime format discovery: + +```rust +// Global format registries for input/output operations +static INPUT_FORMATS: LazyLock>> = LazyLock::new(|| { + let mut formats = IndexMap::new(); + formats.insert("json".to_string(), Box::new(JsonInputFormat) as _); + formats.insert("csv".to_string(), Box::new(CsvInputFormat) as _); + formats.insert("avro".to_string(), Box::new(AvroInputFormat) as _); + formats.insert("parquet".to_string(), Box::new(ParquetInputFormat) as _); + formats.insert("raw".to_string(), Box::new(RawInputFormat) as _); + formats +}); +``` + +### **Three-Layer Architecture** + +#### **1. Format Factory Layer** +```rust +trait InputFormat: Send + Sync { + fn name(&self) -> Cow<'static, str>; + fn new_parser(&self, endpoint_name: &str, input_stream: &InputCollectionHandle, + config: &YamlValue) -> Result, ControllerError>; +} + +trait OutputFormat: Send + Sync { + fn name(&self) -> Cow<'static, str>; + fn new_encoder(&self, endpoint_name: &str, config: &ConnectorConfig, + key_schema: &Option, value_schema: &Relation, + consumer: Box) -> Result, ControllerError>; +} +``` + +#### **2. Stream Processing Layer** +```rust +trait Parser: Send + Sync { + fn parse(&mut self, data: &[u8]) -> (Option>, Vec); + fn splitter(&self) -> Box; + fn fork(&self) -> Box; +} + +trait Encoder: Send { + fn consumer(&mut self) -> &mut dyn OutputConsumer; + fn encode(&mut self, batch: &dyn SerBatch) -> AnyResult<()>; + fn fork(&self) -> Box; +} +``` + +#### **3. Data Abstraction Layer** +```rust +trait InputBuffer: Send + Sync { + fn flush(&mut self, handle: &InputCollectionHandle) -> Vec; + fn take_some(&mut self, n: usize) -> Box; + fn hash(&self) -> u64; + fn len(&self) -> usize; +} +``` + +## Format Implementations + +### **JSON Format** (`json/`) + +**Most Feature-Rich Implementation** supporting diverse update patterns: + +#### **Update Format Variants** +```rust +pub enum JsonUpdateFormat { + Raw, // Direct record insertion: {"field1": "value1", "field2": "value2"} + InsertDelete, // Explicit ops: {"insert": {record}, "delete": {record}} + Debezium, // CDC: {"payload": {"op": "c|u|d", "before": {..}, "after": {..}}} + Snowflake, // Streaming: {"action": "INSERT", "data": {record}} + Redis, // Key-value: {"key": "value", "op": "SET"} +} +``` + +#### **Advanced JSON Processing** +- **JSON Splitter**: Sophisticated state machine handling nested objects and arrays +- **Schema Generation**: Kafka Connect JSON schema generation for Debezium integration +- **Error Recovery**: Field-level error attribution with suggested fixes +- **Array Support**: Both single objects and object arrays + +**JSON Splitter State Machine**: +```rust +enum JsonSplitterState { + Start, // Looking for object/array start + InObject(u32), // Inside object, tracking nesting depth + InArray(u32), // Inside array, tracking nesting depth + InString, // Inside quoted string + Escape, // Handling escape sequences +} +``` + +### **CSV Format** (`csv/`) + +**High-Performance Text Processing** with custom deserializer: + +#### **Key Features** +- **Custom Deserializer**: Forked from `csv` crate to expose low-level `Deserializer` +- **Streaming Processing**: Line-by-line processing with configurable delimiters +- **Quote Handling**: Proper CSV quote escaping and multi-line field support +- **Header Support**: Optional header row processing and field mapping + +#### **Configuration Options** +```rust +pub struct CsvParserConfig { + pub delimiter: u8, // Field delimiter (default: comma) + pub quote: u8, // Quote character (default: double quote) + pub escape: Option, // Escape character + pub has_headers: bool, // First row contains headers + pub flexible: bool, // Allow variable field counts + pub comment: Option, // Comment line prefix +} +``` + +### **Avro Format** (`avro/`) + +**Enterprise Integration Focus** with schema registry support: + +#### **Schema Registry Integration** +```rust +pub struct AvroConfig { + pub registry_url: Option, // Confluent Schema Registry URL + pub authentication: SchemaRegistryAuth, // Authentication configuration + pub schema: Option, // Inline schema definition + pub schema_id: Option, // Registry schema ID + pub proxy: Option, // HTTP proxy configuration + pub timeout_ms: Option, // Request timeout +} + +pub enum SchemaRegistryAuth { + None, + Basic { username: String, password: String }, + Bearer { token: String }, +} +``` + +#### **Binary Processing** +- **Efficient Deserialization**: Direct Avro binary format processing +- **Schema Evolution**: Automatic handling of schema changes +- **Type Safety**: Strong typing with Rust's type system +- **Connection Pooling**: Reuse HTTP connections for schema registry access + +### **Parquet Format** (`parquet/`) + +**Analytics-Optimized Columnar Processing** with Arrow integration: + +#### **Arrow Integration** +```rust +// Feldera SQL types → Arrow schema conversion +fn sql_to_arrow_type(sql_type: &SqlType) -> ArrowResult { + match sql_type { + SqlType::Boolean => Ok(DataType::Boolean), + SqlType::TinyInt => Ok(DataType::Int8), + SqlType::SmallInt => Ok(DataType::Int16), + SqlType::Integer => Ok(DataType::Int32), + SqlType::BigInt => Ok(DataType::Int64), + SqlType::Decimal(precision, scale) => Ok(DataType::Decimal128(*precision as u8, *scale as i8)), + // ... comprehensive type mapping + } +} +``` + +#### **Batch Processing Architecture** +- **Columnar Efficiency**: Leverages Arrow's vectorized operations +- **Large Dataset Support**: Streaming processing of multi-GB files +- **Delta Lake Integration**: Special handling for Databricks Delta Lake format +- **Type Preservation**: Maintains SQL type semantics through Arrow conversion + +### **Raw Format** (`raw.rs`) + +**Minimal Overhead Processing** for binary data: + +#### **Processing Modes** +```rust +pub enum RawReaderMode { + Lines, // Split input by newlines, each line becomes a record + Blob, // Entire input becomes a single record +} +``` + +#### **Use Cases** +- **Binary Data**: Direct processing of binary streams +- **Log Processing**: Line-based log file ingestion +- **Custom Protocols**: Raw byte stream handling for specialized formats +- **Zero Parsing**: Minimal CPU overhead for high-throughput scenarios + +## Stream Processing Architecture + +### **Data Flow Pipeline** + +``` +Raw Bytes → Splitter → Parser → InputBuffer → DBSP Circuit + ↑ ↑ ↑ ↑ ↑ + | | | | | +Transport Boundary Record Batched Database +Layer Detection Parsing Records Operations +``` + +### **Stream Splitting Strategies** + +#### **Format-Specific Splitters** +```rust +trait Splitter: Send { + fn input(&mut self, data: &[u8]) -> Option; // Returns boundary position + fn clear(&mut self); // Reset internal state +} + +// Different splitting strategies: +pub struct LineSplitter; // Newline-based splitting (CSV, Raw) +pub struct JsonSplitter; // JSON object/array boundary detection +pub struct Sponge; // No splitting - consume entire input (Parquet) +``` + +#### **Boundary Detection** +- **Incremental Processing**: Handle partial records across buffer boundaries +- **State Preservation**: Maintain parsing state between buffer chunks +- **Memory Efficiency**: Process large streams without loading entire content + +### **Buffer Management** + +#### **InputBuffer Trait Implementation** +```rust +trait InputBuffer: Send + Sync { + fn flush(&mut self, handle: &InputCollectionHandle) -> Vec; + fn take_some(&mut self, n: usize) -> Box; + fn hash(&self) -> u64; // For replay consistency + fn len(&self) -> usize; +} +``` + +#### **Advanced Buffer Features** +- **Partial Consumption**: `take_some()` for controlled batch processing +- **Replay Verification**: `hash()` for deterministic fault tolerance +- **Memory Management**: Efficient buffer allocation and reuse +- **Backpressure Handling**: Buffer size limits with overflow management + +## Error Handling and Recovery + +### **Structured Error System** + +```rust +pub struct ParseError(Box); + +struct ParseErrorInner { + description: String, // Human-readable error description + event_number: Option, // Stream position for debugging + field: Option, // Failed field name + invalid_bytes: Option>, // Binary data that failed parsing + invalid_text: Option, // Text data that failed parsing + suggestion: Option>, // Suggested fix +} +``` + +### **Error Recovery Strategies** + +#### **Graceful Degradation** +```rust +impl Parser for JsonParser { + fn parse(&mut self, data: &[u8]) -> (Option>, Vec) { + let mut buffer = JsonInputBuffer::new(); + let mut errors = Vec::new(); + + // Continue processing after individual record failures + for record in self.split_records(data) { + match self.parse_record(record) { + Ok(parsed) => buffer.push(parsed), + Err(e) => { + errors.push(e); + // Continue with next record + } + } + } + + (Some(Box::new(buffer)), errors) + } +} +``` + +#### **Error Classification** +- **Field-Level Errors**: Specific field parsing failures with context +- **Record-Level Errors**: Entire record rejection with recovery suggestions +- **Batch-Level Errors**: Array parsing failures requiring batch rollback +- **Format-Level Errors**: Configuration or schema errors + +## Type System Integration + +### **Schema Conversion Pipeline** + +#### **Unified Type Mapping** +```rust +// SQL types → Format-specific schema conversion +const JSON_SERDE_CONFIG: SqlSerdeConfig = SqlSerdeConfig { + timestamp_format: TimestampFormat::String("%Y-%m-%d %H:%M:%S%.f".to_string()), + date_format: DateFormat::String("%Y-%m-%d".to_string()), + time_format: TimeFormat::String("%H:%M:%S%.f".to_string()), + decimal_format: DecimalFormat::String, + json_flavor: JsonFlavor::Default, +}; +``` + +#### **Format-Specific Configurations** +- **JSON**: Flexible type coercion with null handling +- **CSV**: String-based parsing with configurable type inference +- **Avro**: Strong typing with schema evolution support +- **Parquet**: Arrow type system with SQL semantics preservation + +### **Null Safety and Three-Valued Logic** + +**Comprehensive Null Handling**: +```rust +// SQL NULL propagation through format layers +match field_value { + Some(value) => parse_typed_value(value, field_type)?, + None => Ok(SqlValue::Null), // Preserve SQL null semantics +} +``` + +## Performance Optimization Patterns + +### **Memory Management** + +#### **Zero-Copy Processing** +```rust +// Direct byte slice processing where possible +fn parse_string_field(input: &[u8]) -> Result<&str, ParseError> { + std::str::from_utf8(input).map_err(|e| ParseError::invalid_utf8(e)) +} +``` + +#### **Buffer Reuse Patterns** +```rust +// Efficient buffer recycling +impl InputBuffer for JsonInputBuffer { + fn take_some(&mut self, n: usize) -> Box { + let mut taken = Self::with_capacity(n); + taken.records = self.records.drain(..n.min(self.records.len())).collect(); + Box::new(taken) + } +} +``` + +### **Batch Processing Optimization** + +#### **Configurable Batch Sizes** +- **Memory Constraints**: Limit batch sizes based on available memory +- **Latency Requirements**: Smaller batches for low-latency scenarios +- **Throughput Optimization**: Larger batches for maximum throughput + +#### **Vectorized Operations** +- **Arrow Integration**: SIMD-optimized columnar operations for Parquet +- **Batch Type Conversion**: Vectorized SQL type conversions +- **Parallel Processing**: Multi-threaded parsing with `fork()` support + +### **Network and I/O Optimization** + +#### **Schema Registry Optimization** +```rust +// Connection pooling and caching +struct SchemaRegistryClient { + client: Arc, // Reuse HTTP connections + schema_cache: RwLock>, // Cache schemas + auth_token: Option>, // Cached authentication +} +``` + +## Testing Infrastructure + +### **Comprehensive Test Strategy** + +#### **Property-Based Testing** +```rust +use proptest::prelude::*; + +proptest! { + #[test] + fn json_parser_handles_arbitrary_input(input in ".*") { + let parser = JsonParser::new(config); + let (buffer, errors) = parser.parse(input.as_bytes()); + // Verify parsing never panics and errors are structured + assert!(buffer.is_some() || !errors.is_empty()); + } +} +``` + +#### **Mock Infrastructure** +```rust +// Isolated testing with mock consumers +struct MockOutputConsumer { + records: Vec, + errors: Vec, +} + +impl OutputConsumer for MockOutputConsumer { + fn batch_end(&mut self) -> AnyResult<()> { + // Capture batch boundaries for testing + } +} +``` + +### **Format Compatibility Testing** + +#### **Round-Trip Verification** +```rust +#[test] +fn test_json_roundtrip() { + let original_data = generate_test_records(); + + // Serialize with encoder + let encoded = JsonEncoder::encode(&original_data)?; + + // Parse with parser + let (parsed_buffer, errors) = JsonParser::parse(&encoded)?; + + // Verify data integrity + assert_eq!(original_data, parsed_buffer.records); + assert!(errors.is_empty()); +} +``` + +## Configuration and Setup Patterns + +### **Hierarchical Configuration System** + +#### **Multi-Source Configuration** +```rust +// HTTP Request → Format Configuration +fn config_from_http_request(&self, endpoint_name: &str, request: &HttpRequest) + -> Result, ControllerError> { + + // Extract format-specific configuration from HTTP headers/body + let config = self.extract_config_from_request(request)?; + self.validate_config(&config)?; + Ok(Box::new(config)) +} + +// YAML Configuration → Parser Instance +fn new_parser(&self, endpoint_name: &str, input_stream: &InputCollectionHandle, + config: &YamlValue) -> Result, ControllerError> { + + let parsed_config = self.parse_yaml_config(config)?; + self.validate_parser_config(&parsed_config)?; + Ok(Box::new(self.create_parser(parsed_config))) +} +``` + +#### **Configuration Validation** +- **Schema Validation**: Early validation of format configurations +- **Transport Compatibility**: Format-transport compatibility verification +- **Resource Constraints**: Memory and performance limit validation +- **Security Validation**: Authentication and authorization checks + +## Integration Patterns + +### **Transport Layer Integration** + +#### **Transport-Agnostic Design** +The format layer maintains complete independence from transport specifics: +- **Abstract Interfaces**: `InputConsumer`/`OutputConsumer` for transport integration +- **Configuration Extraction**: HTTP, YAML, and programmatic configuration sources +- **Buffer Negotiation**: Transport-specific buffer size and batching limits + +### **DBSP Circuit Integration** + +#### **Streaming Data Flow** +```rust +// Direct integration with DBSP circuits +impl InputBuffer for FormatInputBuffer { + fn flush(&mut self, handle: &InputCollectionHandle) -> Vec { + let mut errors = Vec::new(); + + // Stream records directly to DBSP circuit + for record in self.records.drain(..) { + match handle.insert(record) { + Ok(_) => {}, + Err(e) => errors.push(ParseError::circuit_error(e)), + } + } + + errors + } +} +``` + +## Development Best Practices + +### **Adding New Format Support** + +1. **Create Format Module**: Add new module under `format/` +2. **Implement Core Traits**: `InputFormat`, `OutputFormat`, `Parser`, `Encoder` +3. **Add Stream Splitter**: Implement boundary detection logic +4. **Type System Integration**: Map format types to SQL types +5. **Configuration Support**: Add YAML and HTTP configuration parsing +6. **Comprehensive Testing**: Unit tests, integration tests, property tests +7. **Documentation**: Update format documentation and examples +8. **Registry Integration**: Add to `INPUT_FORMATS`/`OUTPUT_FORMATS` registries + +### **Performance Tuning Guidelines** + +#### **Memory Optimization** +- **Profile Memory Usage**: Use `cargo bench` and memory profiling tools +- **Optimize Hot Paths**: Focus on frequently executed parsing code +- **Buffer Size Tuning**: Configure optimal buffer sizes for workload +- **Connection Pooling**: Reuse network connections where possible + +#### **Throughput Optimization** +- **Batch Size Tuning**: Balance latency vs throughput requirements +- **Parallel Processing**: Leverage multi-core processing with `fork()` +- **Zero-Copy Patterns**: Avoid unnecessary data copying +- **SIMD Integration**: Use vectorized operations for columnar formats + +### **Error Handling Best Practices** + +#### **User-Friendly Errors** +- **Precise Error Messages**: Include field names, positions, and suggested fixes +- **Error Attribution**: Link errors back to original data sources +- **Graceful Recovery**: Continue processing after recoverable errors +- **Rich Context**: Provide sufficient context for debugging + +This format module represents a sophisticated, production-ready data processing system that successfully abstracts the complexity of multiple data formats while providing high performance, reliability, and extensibility for the Feldera streaming analytics platform. \ No newline at end of file diff --git a/crates/adapters/src/integrated/CLAUDE.md b/crates/adapters/src/integrated/CLAUDE.md new file mode 100644 index 00000000000..2da57c32d2c --- /dev/null +++ b/crates/adapters/src/integrated/CLAUDE.md @@ -0,0 +1,453 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build with all integrated connector features +cargo build -p adapters --features="with-deltalake,with-iceberg" + +# Run PostgreSQL connector tests +cargo test -p adapters postgres + +# Run Delta Lake connector tests (requires AWS credentials) +cargo test -p adapters delta_table + +# Run Iceberg connector tests +cargo test -p adapters --features="iceberg-tests-fs" iceberg + +# Build specific integrated connector +cargo build -p adapters --features="with-deltalake" +``` + +### Feature Flags + +```toml +# Available integrated connector features +with-deltalake = ["deltalake"] # Delta Lake support +with-iceberg = ["feldera-iceberg"] # Apache Iceberg support +# PostgreSQL support is always enabled via tokio-postgres +``` + +## Architecture Overview + +### Technology Stack + +- **Database Integration**: PostgreSQL with tokio-postgres +- **Data Lake Integration**: Delta Lake with deltalake crate, Apache Iceberg +- **Query Engine**: DataFusion for Delta Lake filtering and projection +- **Storage Backends**: S3, Azure Blob, Google Cloud Storage +- **Data Formats**: Parquet for Delta Lake, custom serialization for PostgreSQL +- **Async Runtime**: Tokio for non-blocking I/O operations + +### Core Purpose + +The Integrated Connectors module provides **tightly-coupled transport and format implementations** for external data systems: + +- **Single-Purpose Connectors**: Transport protocol and data format are integrated into one component +- **Database CDC**: Real-time change data capture from PostgreSQL +- **Data Lake Integration**: Batch and streaming access to Delta Lake and Iceberg tables +- **Format Optimization**: Native format handling without separate format layer +- **Schema Evolution**: Support for evolving schemas in data lake formats + +## Project Structure + +### Core Components + +#### Main Module (`mod.rs`) +- Factory functions for creating integrated endpoints +- Trait definitions for `IntegratedInputEndpoint` and `IntegratedOutputEndpoint` +- Feature-gated module imports + +#### PostgreSQL Integration (`postgres/`) +- **Input**: SQL query execution and result streaming +- **Output**: INSERT, UPSERT, DELETE operations with prepared statements +- **Features**: Type mapping, connection pooling, error retry logic + +#### Delta Lake Integration (`delta_table/`) +- **Input**: Table scanning, incremental reads, change data feed (CDF) +- **Output**: Parquet file writing with Delta Lake transaction log +- **Features**: S3/Azure/GCS storage, schema evolution, time travel + +#### Apache Iceberg Integration +- **External crate**: `feldera-iceberg` (referenced but implementation in separate crate) +- **Features**: Schema evolution, partition management, catalog integration + +## Implementation Details + +### Integrated Endpoint Architecture + +#### Core Traits +```rust +// From mod.rs - integrated output endpoint combining transport and encoding +pub trait IntegratedOutputEndpoint: OutputEndpoint + Encoder { + fn into_encoder(self: Box) -> Box; + fn as_endpoint(&mut self) -> &mut dyn OutputEndpoint; +} + +// Factory function for creating integrated endpoints +pub fn create_integrated_output_endpoint( + endpoint_id: EndpointId, + endpoint_name: &str, + connector_config: &ConnectorConfig, + key_schema: &Option, + schema: &Relation, + controller: Weak, +) -> Result, ControllerError> +``` + +#### Format Integration Validation +```rust +// Integrated connectors don't allow separate format specification +if connector_config.format.is_some() { + return Err(ControllerError::invalid_parser_configuration( + endpoint_name, + &format!("{} transport does not allow 'format' specification", + connector_config.transport.name()) + )); +} +``` + +### PostgreSQL Integration + +#### Input Connector - Query Execution +```rust +// From postgres/input.rs - SQL query execution and streaming +pub struct PostgresInputEndpoint { + inner: Arc, +} + +impl PostgresInputEndpoint { + pub fn new( + endpoint_name: &str, + config: &PostgresReaderConfig, // Contains SQL query and connection info + consumer: Box, + ) -> Self +} +``` + +#### Connection Management +```rust +// From postgres/input.rs - PostgreSQL connection handling +async fn connect_to_postgres(&self) + -> Result<(Client, Connection), ControllerError> { + + let (client, connection) = tokio_postgres::connect(self.config.uri.as_str(), NoTls).await?; + + // Spawn connection handler + tokio::spawn(async move { + if let Err(e) = connection.await { + eprintln!("connection error: {}", e); + } + }); + + Ok((client, connection)) +} +``` + +#### Type Mapping Implementation +```rust +// From postgres/input.rs - comprehensive PostgreSQL to JSON type mapping +let value = match *col_type { + Type::BOOL => { + let v: Option = row.get(col_idx); + json!(v) + } + Type::VARCHAR | Type::TEXT => { + let v: Option = row.get(col_idx); + json!(v) + } + Type::INT4 => { + let v: Option = row.get(col_idx); + json!(v) + } + Type::TIMESTAMP => { + let v: Option = row.get(col_idx); + let vutc = v.as_ref().map(|v| Utc.from_utc_datetime(v).to_rfc3339()); + json!(vutc) + } + Type::NUMERIC => { + let v: Option = row.get(col_idx); + json!(v) + } + // Support for arrays, UUIDs, and other PostgreSQL types... +}; +``` + +#### Output Connector - Prepared Statements +```rust +// From postgres/output.rs - prepared statement management +struct PreparedStatements { + insert: Statement, + upsert: Statement, // INSERT ... ON CONFLICT DO UPDATE + delete: Statement, +} + +// Error classification for retry logic +enum BackoffError { + Temporary(anyhow::Error), // Retry with backoff + Permanent(anyhow::Error), // Fail immediately +} + +impl From for BackoffError { + fn from(value: postgres::Error) -> Self { + // Classify connection errors as temporary, SQL errors as permanent + if value.is_closed() || matches!(value.code(), Some(SqlState::CONNECTION_FAILURE)) { + Self::Temporary(anyhow!("failed to connect to postgres: {value}")) + } else { + Self::Permanent(anyhow!("postgres error: permanent: {value}")) + } + } +} +``` + +### Delta Lake Integration + +#### Input Connector - Table Scanning +```rust +// From delta_table/input.rs - Delta Lake table scanning +pub struct DeltaTableInputEndpoint { + inner: Arc, +} + +impl DeltaTableInputEndpoint { + pub fn new( + endpoint_name: &str, + config: &DeltaTableReaderConfig, // Table URI, query filters, etc. + consumer: Box, + ) -> Self +} +``` + +#### Storage Handler Registration +```rust +// From delta_table/mod.rs - cloud storage integration +static REGISTER_STORAGE_HANDLERS: Once = Once::new(); + +pub fn register_storage_handlers() { + REGISTER_STORAGE_HANDLERS.call_once(|| { + deltalake::aws::register_handlers(None); // S3 support + deltalake::azure::register_handlers(None); // Azure Blob support + deltalake::gcp::register_handlers(None); // Google Cloud Storage + }); +} +``` + +#### Delta Lake-Specific Serialization +```rust +// From delta_table/mod.rs - Delta Lake serialization configuration +pub fn delta_input_serde_config() -> SqlSerdeConfig { + SqlSerdeConfig::default() + // Delta uses microsecond timestamps in Parquet + .with_timestamp_format(TimestampFormat::String("%Y-%m-%dT%H:%M:%S%.f%Z")) + .with_date_format(DateFormat::DaysSinceEpoch) + .with_decimal_format(DecimalFormat::String) + .with_uuid_format(UuidFormat::String) // UUIDs as strings in Delta +} +``` + +#### DataFusion Integration for Query Pushdown +```rust +// From delta_table/input.rs - DataFusion integration for filtering +use datafusion::prelude::{SessionContext, DataFrame}; +use deltalake::datafusion::logical_expr::Expr; + +// Create DataFusion context for Delta Lake +let session_ctx = SessionContext::new(); +let delta_table = DeltaTableBuilder::from_uri(&config.uri).build().await?; + +// Apply filtering at storage layer for performance +if let Some(filter_expr) = &config.filter { + let logical_plan = session_ctx + .read_table(Arc::new(delta_table))? + .filter(filter_expr.clone())? + .logical_plan().clone(); +} +``` + +### Multi-Format Support Patterns + +#### Feature-Gated Endpoint Creation +```rust +// From mod.rs - conditional compilation based on features +pub fn create_integrated_input_endpoint( + endpoint_name: &str, + config: &ConnectorConfig, + consumer: Box, +) -> Result, ControllerError> { + + let ep: Box = match &config.transport { + #[cfg(feature = "with-deltalake")] + DeltaTableInput(config) => Box::new(DeltaTableInputEndpoint::new( + endpoint_name, config, consumer + )), + + #[cfg(feature = "with-iceberg")] + IcebergInput(config) => Box::new(feldera_iceberg::IcebergInputEndpoint::new( + endpoint_name, config, consumer + )), + + PostgresInput(config) => Box::new(PostgresInputEndpoint::new( + endpoint_name, config, consumer + )), + + transport => Err(ControllerError::unknown_input_transport( + endpoint_name, &transport.name() + ))?, + }; + + Ok(ep) +} +``` + +## Key Features and Capabilities + +### PostgreSQL Integration +- **Query-based Input**: Execute arbitrary SQL queries and stream results +- **Comprehensive Type Support**: All PostgreSQL types including arrays and UUIDs +- **Change Streaming**: Real-time data ingestion from query results +- **Prepared Statements**: Optimized INSERT/UPSERT/DELETE operations +- **Connection Management**: Automatic reconnection and error recovery +- **Transaction Safety**: ACID compliance for output operations + +### Delta Lake Integration +- **Schema Evolution**: Handle schema changes across table versions +- **Time Travel**: Read data from specific table versions or timestamps +- **Change Data Feed**: Incremental processing of table changes +- **Multi-Cloud Storage**: S3, Azure Blob, Google Cloud Storage support +- **Partition Pruning**: Efficient reading through partition elimination +- **DataFusion Optimization**: Query pushdown and column projection + +### Apache Iceberg Integration +- **Catalog Integration**: Support for Hive, Glue, REST catalogs +- **Schema Evolution**: Advanced schema evolution capabilities +- **Partition Management**: Automatic partition lifecycle management +- **Multi-Engine Compatibility**: Compatible with Spark, Flink, Trino + +## Configuration and Usage + +### PostgreSQL Configuration +```yaml +# Input connector - execute query and stream results +transport: + name: postgres_input + config: + uri: "postgresql://user:pass@localhost:5432/database" + query: "SELECT id, name, created_at FROM users WHERE created_at > NOW() - INTERVAL '1 day'" + +# Output connector - write to table +transport: + name: postgres_output + config: + uri: "postgresql://user:pass@localhost:5432/database" + table: "processed_events" + mode: "upsert" # insert, upsert, or append +``` + +### Delta Lake Configuration +```yaml +# Input connector - read from Delta Lake table +transport: + name: delta_table_input + config: + uri: "s3://my-bucket/delta-tables/events/" + mode: "snapshot" # snapshot or follow + version: 42 # optional: read specific version + timestamp: "2024-01-15T10:00:00Z" # optional: read at timestamp + +# Output connector - write to Delta Lake +transport: + name: delta_table_output + config: + uri: "s3://my-bucket/delta-tables/processed/" + mode: "append" # append or overwrite +``` + +## Development Workflow + +### Adding New Integrated Connector + +1. **Create connector module** in `src/integrated/my_connector/` +2. **Implement required traits**: + - `IntegratedInputEndpoint` for input connectors + - `IntegratedOutputEndpoint` for output connectors +3. **Add feature flag** in `Cargo.toml` +4. **Update factory functions** in `mod.rs` +5. **Add configuration types** in `feldera-types` +6. **Implement comprehensive tests** with real external services + +### Testing Strategy + +#### Unit Tests +- Type conversion and serialization correctness +- Error handling and retry logic +- Configuration validation + +#### Integration Tests +- Real database/storage system connectivity +- End-to-end data flow validation +- Performance and reliability testing +- Schema evolution scenarios + +### Performance Optimization + +#### Connection Management +- **Connection Pooling**: Reuse database connections +- **Async I/O**: Non-blocking operations throughout +- **Batch Processing**: Optimize throughput with batching +- **Retry Logic**: Exponential backoff for transient failures + +#### Data Processing +- **Schema Caching**: Cache schema information to avoid repeated lookups +- **Columnar Processing**: Leverage Arrow/Parquet columnar formats +- **Query Pushdown**: Push filters and projections to storage layer +- **Parallel Processing**: Multi-threaded data processing where possible + +## Dependencies and Integration + +### Core Dependencies +- **tokio-postgres** - PostgreSQL async client +- **deltalake** - Delta Lake Rust implementation with DataFusion +- **feldera-iceberg** - Apache Iceberg integration (separate crate) +- **datafusion** - Query engine for Delta Lake optimization +- **arrow** - Columnar data processing + +### Cloud Storage Dependencies +- **AWS SDK** - S3 integration for Delta Lake +- **Azure SDK** - Azure Blob Storage support +- **Google Cloud SDK** - Google Cloud Storage support + +### Error Handling Patterns +- **Classified Errors**: Distinguish between temporary and permanent failures +- **Retry Logic**: Exponential backoff for transient errors +- **Circuit Breakers**: Prevent cascade failures +- **Graceful Degradation**: Continue operation when possible + +### Security Considerations +- **Connection Encryption**: TLS for database connections +- **Cloud Authentication**: IAM roles and service accounts +- **Credential Management**: Secure credential storage and rotation +- **Access Control**: Fine-grained permissions for data access + +## Best Practices + +### Configuration Design +- **Sensible Defaults**: Provide reasonable default values +- **Validation**: Validate configuration at startup +- **Documentation**: Clear documentation for all configuration options +- **Feature Flags**: Use feature flags for optional dependencies + +### Error Handling +- **Structured Errors**: Use detailed error types with context +- **User-Friendly Messages**: Provide actionable error messages +- **Logging**: Comprehensive logging for debugging +- **Metrics**: Export metrics for monitoring and alerting + +### Performance Monitoring +- **Throughput Metrics**: Track records/second processed +- **Latency Metrics**: Monitor end-to-end processing latency +- **Error Rates**: Track and alert on error rates +- **Resource Usage**: Monitor CPU, memory, and network usage + +This module enables Feldera to integrate seamlessly with external data systems while maintaining high performance and reliability through native format handling and optimized data processing pipelines. \ No newline at end of file diff --git a/crates/adapters/src/transport/CLAUDE.md b/crates/adapters/src/transport/CLAUDE.md new file mode 100644 index 00000000000..d73b0fdbdf0 --- /dev/null +++ b/crates/adapters/src/transport/CLAUDE.md @@ -0,0 +1,467 @@ +## Overview + +The transport module implements Feldera's I/O abstraction layer, providing unified access to diverse external systems for DBSP circuits. It enables high-performance, fault-tolerant data ingestion and emission across transports including Kafka, HTTP, files, databases, and cloud storage systems, while maintaining strong consistency guarantees for incremental computation. + +## Architecture Overview + +### **Dual Transport Abstraction Model** + +The transport system uses two complementary abstraction patterns: + +#### **Regular Transports** (Transport + Format Separation) +```rust +TransportInputEndpoint → InputReader → Parser → InputConsumer +``` +- **Transport**: Handles data movement and connection management +- **Format**: Separate parser handles data serialization/deserialization +- **Examples**: HTTP, Kafka, File, URL, S3, PubSub, Redis + +#### **Integrated Transports** (Transport + Format Combined) +```rust +IntegratedInputEndpoint → InputReader → InputConsumer +``` +- **Unified**: Transport and format tightly coupled for efficiency +- **Examples**: PostgreSQL, Delta Lake, Iceberg (database-specific optimizations) + +### **Core Abstractions and Traits** + +#### Primary Transport Traits +```rust +pub trait InputEndpoint { + fn endpoint_name(&self) -> &str; + fn supports_fault_tolerance(&self) -> bool; +} + +pub trait TransportInputEndpoint: InputEndpoint { + fn open(&self, consumer: Box, + parser: Box, ...) -> Box; +} + +pub trait IntegratedInputEndpoint: InputEndpoint { + fn open(self: Box, input_handle: &InputCollectionHandle, + ...) -> Box; +} +``` + +#### Input Reader Interface +```rust +pub trait InputReader: Send { + fn seek(&mut self, position: JsonValue) -> AnyResult<()>; + fn request(&mut self, command: InputReaderCommand); + fn is_closed(&self) -> bool; +} +``` + +**Command-Driven State Management**: +```rust +pub enum InputReaderCommand { + Queue, // Start queuing data + Extend, // Resume normal processing + Pause, // Pause data ingestion + Replay { seek, queue, hash }, // Replay from checkpoint + Disconnect, // Graceful shutdown +} +``` + +## Transport Implementations + +### **HTTP Transport** (`http/`) + +**Key Features**: +- **Real-time Streaming**: Chunked transfer encoding with configurable timeouts +- **Exactly-Once Support**: Fault tolerance with seek/replay capabilities +- **State Management**: Atomic operations for thread-safe state transitions +- **Flexible Endpoints**: GET, POST, and streaming endpoint support + +**Architecture Pattern**: +```rust +// Background worker thread pattern +thread::spawn(move || { + while let Some(command) = command_receiver.recv() { + match command { + Extend => start_streaming_data(), + Pause => pause_and_queue(), + Disconnect => graceful_shutdown(), + } + } +}); +``` + +**Configuration**: +```rust +HttpInputConfig { + path: String, // HTTP endpoint URL + method: HttpMethod, // GET/POST request method + timeout_ms: Option, // Request timeout + headers: BTreeMap, // Custom headers +} +``` + +### **Kafka Transport** (`kafka/`) + +**Dual Implementation Strategy**: + +#### **Fault-Tolerant Kafka** (`ft/`): +- **Exactly-Once Semantics**: Transaction-based processing with commit coordination +- **Complex State Management**: Offset tracking with replay capabilities +- **Memory Monitoring**: Real-time memory usage reporting for large datasets +- **Error Handling**: Sophisticated retry logic with exponential backoff + +#### **Non-Fault-Tolerant Kafka** (`nonft/`): +- **High Performance**: Simplified processing for maximum throughput +- **At-Least-Once**: Basic reliability without exact replay guarantees +- **Reduced Overhead**: Minimal state tracking for performance-critical scenarios + +**Key Implementation Details**: +```rust +// Kafka-specific error refinement +fn refine_kafka_error(client: &KafkaClient, e: KafkaError) -> (bool, AnyError) { + // Converts librdkafka errors into actionable Anyhow errors + // Determines fatality for proper error handling strategy +} + +// Authentication integration +KafkaAuthConfig::Aws { region } => { + // AWS MSK IAM authentication with credential chain +} +``` + +### **File Transport** (`file.rs`) + +**File Processing Features**: +- **Follow Mode**: Tail-like behavior for continuously growing files +- **Seek Support**: Resume from specific file positions using byte offsets +- **Buffer Management**: Configurable read buffers for memory efficiency +- **Testing Integration**: Barrier synchronization for deterministic test execution + +**Seek Implementation**: +```rust +impl InputReader for FileInputReader { + fn seek(&mut self, position: JsonValue) -> AnyResult<()> { + if let JsonValue::Number(n) = position { + self.file.seek(SeekFrom::Start(n.as_u64().unwrap()))?; + } + } +} +``` + +### **URL Transport** (`url.rs`) + +**Advanced HTTP Features**: +- **Range Requests**: HTTP Range header support for resumable downloads +- **Connection Management**: Automatic reconnection with configurable timeouts +- **Pause/Resume**: Sophisticated state management with timeout handling +- **Error Recovery**: Retry logic with exponential backoff and circuit breaking + +**Resume Capability**: +```rust +// HTTP Range request for resumption +let range_header = format!("bytes={}-", current_position); +request.headers.insert("Range", range_header); +``` + +### **Specialized Transports** + +#### **S3 Transport** (`s3.rs`): +- **AWS SDK Integration**: Native AWS authentication and region support +- **Object Streaming**: Efficient streaming of large S3 objects +- **Prefix Filtering**: Support for processing multiple objects with key patterns + +#### **Redis Transport** (`redis.rs`): +- **Multiple Data Structures**: Support for streams, lists, and pub/sub patterns +- **Connection Pooling**: Efficient connection reuse and management +- **Auth Support**: Redis AUTH and ACL integration + +#### **Ad Hoc Transport** (`adhoc.rs`): +- **Direct Data Injection**: Programmatic data insertion for testing and development +- **Schema Flexibility**: Support for arbitrary data structures +- **Development Support**: Simplified data injection for rapid prototyping + +## Error Handling Architecture + +### **Hierarchical Error Classification** + +**Error Severity Levels**: +```rust +// Fatal errors require complete restart +if is_fatal_error(&error) { + consumer.error(true, error); // Signal fatal condition + break; // Terminate processing loop +} + +// Non-fatal errors allow recovery +consumer.error(false, error); // Continue processing +``` + +**Error Context Preservation**: +- **Rich Error Messages**: Actionable information with suggested fixes +- **Source Location**: Precise error location tracking for debugging +- **Error Chains**: Maintain full error context through transformation layers + +### **Async Error Handling** + +**Out-of-Band Error Reporting**: +```rust +// Background thread error callback +let error_callback = Arc::clone(&consumer); +tokio::spawn(async move { + if let Err(e) = async_operation().await { + error_callback.error(false, e.into()); + } +}); +``` + +## Concurrency and Threading Patterns + +### **Thread-per-Transport Architecture** + +Most transports follow a consistent threading pattern: + +```rust +let (command_sender, command_receiver) = unbounded(); +let worker_thread = thread::Builder::new() + .name(format!("{}-worker", transport_name)) + .spawn(move || { + transport_worker_loop(command_receiver, consumer) + })?; + +// InputReader implementation delegates to worker thread +struct TransportInputReader { + command_sender: UnboundedSender, + worker_handle: JoinHandle<()>, +} +``` + +### **Channel-Based Communication** + +**Command Dispatch Pattern**: +- **Unbounded Channels**: `UnboundedSender` for command dispatch +- **Backpressure Handling**: Optional bounded channels for flow control +- **Error Propagation**: Separate error channels for out-of-band error reporting + +### **Mixed Async/Sync Design** + +**Tokio Integration Strategy**: +```rust +// Background thread with Tokio runtime for async operations +let rt = tokio::runtime::Builder::new_current_thread() + .enable_all() + .build()?; + +rt.block_on(async { + // Async HTTP client operations + let response = http_client.get(url).send().await?; + // Process streaming response +}); +``` + +## Fault Tolerance and State Management + +### **Three-Tier Fault Tolerance Model** + +#### **Level 1: None** +- No persistence or recovery capabilities +- Suitable for non-critical or easily reproducible data sources +- Example: Development/testing scenarios + +#### **Level 2: At-Least-Once** +- Can resume from checkpoints but may have duplicates +- Checkpoint-based recovery with seek capability +- Example: File reading with position tracking + +#### **Level 3: Exactly-Once** +- Can replay exact data with hash verification +- Strong consistency guarantees for critical data processing +- Example: Kafka transactions with offset management + +### **Resume/Replay Implementation** + +```rust +pub enum Resume { + Barrier, // Cannot resume - start from beginning + Seek { seek: JsonValue }, // Resume from checkpoint position + Replay { seek: JsonValue, queue: Vec, hash: u64 }, // Exact replay +} + +impl InputReader for FaultTolerantReader { + fn request(&mut self, command: InputReaderCommand) { + match command { + InputReaderCommand::Replay { seek, queue, hash } => { + self.seek(seek)?; + self.replay_queue(queue, hash)?; // Exact data replay + self.resume_normal_processing(); + } + } + } +} +``` + +### **Journaling and Checkpointing** + +**Metadata Persistence**: +- Position tracking for seekable transports +- Queue snapshots for exact replay scenarios +- Hash verification for data integrity validation + +## Configuration and Factory Patterns + +### **Unified Configuration System** + +All transport configurations are centralized in `feldera-types/src/transport/`: + +```rust +#[derive(Deserialize, Serialize, Clone)] +pub enum TransportConfig { + Kafka(KafkaInputConfig), + Http(HttpInputConfig), + File(FileInputConfig), + // ... other transport configs +} +``` + +### **Factory Pattern Implementation** + +```rust +pub fn input_transport_config_to_endpoint( + config: &TransportConfig, + endpoint_name: &str, + secrets_dir: &Path, +) -> AnyResult>> { + + match config { + TransportConfig::Kafka(kafka_config) => { + let endpoint = KafkaInputEndpoint::new(kafka_config, secrets_dir)?; + Ok(Some(Box::new(endpoint))) + } + TransportConfig::Http(http_config) => { + let endpoint = HttpInputEndpoint::new(http_config)?; + Ok(Some(Box::new(endpoint))) + } + // ... other transport factories + } +} +``` + +### **Secret Management Integration** + +**Secure Credential Handling**: +```rust +// Resolve secrets from files or environment +let resolved_config = config.resolve_secrets(secrets_dir)?; +let credentials = extract_credentials(&resolved_config)?; +``` + +## Testing Infrastructure and Patterns + +### **Barrier-Based Test Synchronization** + +**Deterministic Test Execution**: +```rust +#[cfg(test)] +static BARRIERS: Mutex> = Mutex::new(BTreeMap::new()); + +pub fn set_barrier(name: &str, value: usize) { + BARRIERS.lock().unwrap().insert(name.into(), value); +} + +pub fn barrier_wait(name: &str) { + // Synchronization point for deterministic test execution + while BARRIERS.lock().unwrap().get(name).copied().unwrap_or(0) > 0 { + thread::sleep(Duration::from_millis(10)); + } +} +``` + +### **Mock Transport Implementations** + +**Test Transport Pattern**: +```rust +struct MockInputReader { + data: Vec, + position: usize, + consumer: Box, +} + +impl InputReader for MockInputReader { + fn request(&mut self, command: InputReaderCommand) { + match command { + InputReaderCommand::Extend => self.send_next_batch(), + InputReaderCommand::Pause => self.pause_processing(), + } + } +} +``` + +### **Integration Testing** + +**Real External Service Testing**: +- Docker-based test environments for Kafka, PostgreSQL, Redis +- Testcontainers integration for isolated testing +- Property-based testing for fault tolerance scenarios + +## Performance Optimization Patterns + +### **Batching and Buffering** + +**Efficient Data Processing**: +```rust +// Configurable batch sizes for optimal throughput +const DEFAULT_BATCH_SIZE: usize = 1000; + +// Buffer management for memory efficiency +struct BufferedReader { + buffer: Vec, + batch_size: usize, + max_buffer_size: usize, +} +``` + +### **Memory Management** + +**Resource Monitoring**: +```rust +// Memory usage tracking for large datasets +pub fn report_memory_usage(&self) -> MemoryUsage { + MemoryUsage { + buffered_bytes: self.buffer.len(), + queued_records: self.queue.len(), + peak_memory: self.peak_memory.load(Ordering::Relaxed), + } +} +``` + +### **Connection Pooling and Reuse** + +**Efficient Resource Utilization**: +- HTTP connection pooling with keep-alive +- Kafka connection reuse across multiple topics +- Database connection pooling for integrated transports + +## Development Best Practices + +### **Adding New Transport Implementation** + +1. **Define Configuration**: Add transport config to `TransportConfig` enum +2. **Implement Traits**: Create `TransportInputEndpoint` and `InputReader` implementations +3. **Error Handling**: Implement proper error classification and recovery +4. **Threading**: Follow thread-per-transport pattern with command channels +5. **Testing**: Add comprehensive unit and integration tests +6. **Documentation**: Update transport documentation and examples + +### **Debugging Transport Issues** + +**Diagnostic Tools**: +- **Logging**: Comprehensive tracing at debug/trace levels +- **Metrics**: Built-in performance and error rate monitoring +- **State Inspection**: Runtime state examination capabilities +- **Error Analysis**: Rich error context with actionable information + +### **Performance Tuning** + +**Optimization Areas**: +- **Buffer Sizes**: Tune based on data characteristics and memory constraints +- **Thread Configuration**: Optimize worker thread count for I/O patterns +- **Connection Pooling**: Configure pool sizes based on workload patterns +- **Batching**: Optimize batch sizes for throughput vs. latency trade-offs + +This transport layer provides Feldera with a robust, high-performance, and extensible I/O foundation that enables reliable data processing at scale while maintaining the strong consistency guarantees required for incremental computation. \ No newline at end of file diff --git a/crates/datagen/CLAUDE.md b/crates/datagen/CLAUDE.md new file mode 100644 index 00000000000..ba5ca72d161 --- /dev/null +++ b/crates/datagen/CLAUDE.md @@ -0,0 +1,253 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the datagen crate +cargo build -p datagen + +# Run tests +cargo test -p datagen + +# Use as library in other crates +cargo run --example generate_test_data -p datagen +``` + +## Architecture Overview + +### Technology Stack + +- **Random Generation**: Pseudo-random and deterministic data generation +- **Configurable**: Flexible data distribution and patterns +- **Performance**: High-throughput data generation +- **Testing**: Support for unit and integration testing + +### Core Purpose + +Data Generation provides **test data generation utilities**: + +- **Realistic Data**: Generate realistic test datasets +- **Configurable Distributions**: Control data patterns and distributions +- **Performance Testing**: High-volume data generation for benchmarks +- **Reproducible**: Deterministic generation for testing + +### Project Structure + +#### Core Module + +- `src/lib.rs` - Main data generation library and utilities + +## Important Implementation Details + +### Data Generation Framework + +#### Core Traits +```rust +pub trait DataGenerator { + fn generate(&mut self) -> T; + fn generate_batch(&mut self, count: usize) -> Vec; +} + +pub trait ConfigurableGenerator { + fn new(config: C) -> Self; + fn with_seed(config: C, seed: u64) -> Self; +} +``` + +#### Generation Features +- **Seeded Random**: Deterministic generation with seeds +- **Batch Generation**: Efficient bulk data creation +- **Custom Distributions**: Normal, uniform, zipf, and custom distributions +- **Correlated Data**: Generate related data with realistic correlations + +### Data Types Support + +#### Supported Types +- **Numeric**: Integers, floats with various distributions +- **Strings**: Random strings, names, emails, addresses +- **Dates**: Timestamps with realistic patterns +- **Geographic**: Coordinates, addresses, regions +- **Business**: Customer IDs, transaction data, product information + +#### Realistic Patterns +```rust +// Generate realistic customer data +pub struct CustomerGenerator { + name_gen: NameGenerator, + email_gen: EmailGenerator, + age_dist: NormalDistribution, +} + +impl CustomerGenerator { + pub fn generate_customer(&mut self) -> Customer { + let name = self.name_gen.generate(); + let email = self.email_gen.generate_from_name(&name); + let age = self.age_dist.sample(); + + Customer { name, email, age } + } +} +``` + +### Performance Optimization + +#### High-Throughput Generation +- **Batch Processing**: Generate data in efficient batches +- **Memory Pooling**: Reuse memory allocations +- **SIMD**: Vector operations for numeric generation +- **Caching**: Cache expensive computations + +#### Memory Management +- **Zero-Copy**: Minimize data copying where possible +- **Streaming**: Generate data on-demand without storing +- **Bounded Memory**: Control memory usage for large datasets +- **Cleanup**: Automatic resource cleanup + +## Development Workflow + +### For New Data Types + +1. Define data structure and generation parameters +2. Implement `DataGenerator` trait +3. Add realistic distribution patterns +4. Add configuration options +5. Test with various parameters +6. Add performance benchmarks + +### For Custom Distributions + +1. Implement distribution algorithm +2. Add statistical validation +3. Test distribution properties +4. Document distribution parameters +5. Add examples and use cases +6. Benchmark performance characteristics + +### Testing Strategy + +#### Correctness Testing +- **Distribution Testing**: Validate statistical properties +- **Boundary Testing**: Test edge cases and limits +- **Correlation Testing**: Verify data relationships +- **Determinism**: Test reproducible generation + +#### Performance Testing +- **Throughput**: Measure generation rate +- **Memory Usage**: Monitor memory consumption +- **Scaling**: Test with various data sizes +- **Efficiency**: Compare with baseline implementations + +### Configuration System + +#### Generation Configuration +```rust +pub struct DataGenConfig { + pub seed: Option, + pub batch_size: usize, + pub distribution: DistributionType, + pub correlation_strength: f64, +} + +pub enum DistributionType { + Uniform { min: f64, max: f64 }, + Normal { mean: f64, std_dev: f64 }, + Zipf { exponent: f64 }, + Custom(Box), +} +``` + +#### Flexible Configuration +- **Multiple Sources**: Code, files, environment variables +- **Validation**: Validate configuration parameters +- **Defaults**: Sensible default values +- **Documentation**: Document all configuration options + +### Usage Examples + +#### Basic Generation +```rust +use datagen::{DataGenerator, UniformIntGenerator}; + +let mut gen = UniformIntGenerator::new(1, 100); +let values: Vec = gen.generate_batch(1000); +``` + +#### Realistic Data +```rust +use datagen::{CustomerGenerator, TransactionGenerator}; + +let mut customer_gen = CustomerGenerator::with_seed(42); +let mut transaction_gen = TransactionGenerator::new(); + +for _ in 0..1000 { + let customer = customer_gen.generate(); + let transactions = transaction_gen.generate_for_customer(&customer, 1..10); +} +``` + +### Configuration Files + +- `Cargo.toml` - Dependencies for random generation and statistics +- Minimal external dependencies for broad compatibility + +### Dependencies + +#### Core Dependencies +- `rand` - Random number generation +- `rand_distr` - Statistical distributions +- `chrono` - Date/time generation +- `uuid` - UUID generation + +### Best Practices + +#### Generation Design +- **Realistic**: Generate data that resembles real-world patterns +- **Configurable**: Allow customization of generation parameters +- **Deterministic**: Support seeded generation for reproducible tests +- **Efficient**: Optimize for high-throughput generation + +#### API Design +- **Trait-Based**: Use traits for extensible generation +- **Batch-Friendly**: Support efficient batch generation +- **Memory-Aware**: Consider memory usage patterns +- **Error-Free**: Avoid panics in generation code + +#### Testing Integration +- **Test Utilities**: Provide utilities for common testing patterns +- **Fixtures**: Generate standard test fixtures +- **Scenarios**: Support various testing scenarios +- **Validation**: Include data validation utilities + +### Integration Patterns + +#### With Testing Frameworks +```rust +#[cfg(test)] +mod tests { + use datagen::TestDataGenerator; + + #[test] + fn test_with_generated_data() { + let data = TestDataGenerator::new() + .with_size(100) + .with_pattern(Pattern::Realistic) + .generate(); + + assert!(validate_data(&data)); + } +} +``` + +#### With Benchmarks +```rust +fn bench_processing(b: &mut Bencher) { + let data = generate_benchmark_data(10_000); + b.iter(|| { + process_data(&data) + }); +} +``` + +This crate provides essential data generation capabilities that support testing, benchmarking, and development across the entire Feldera platform. \ No newline at end of file diff --git a/crates/dbsp/CLAUDE.md b/crates/dbsp/CLAUDE.md new file mode 100644 index 00000000000..24d04e71560 --- /dev/null +++ b/crates/dbsp/CLAUDE.md @@ -0,0 +1,218 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the DBSP crate +cargo build -p dbsp + +# Run tests +cargo test -p dbsp + +# Run specific test +cargo test -p dbsp test_name + +# Run benchmarks +cargo bench -p dbsp + +# Run examples +cargo run --example degrees +cargo run --example tutorial1 +``` + +### Development Tools + +```bash +# Check with clippy +cargo clippy -p dbsp + +# Format code +cargo fmt -p dbsp + +# Generate documentation +cargo doc -p dbsp --open + +# Run with memory sanitizer (requires nightly) +cargo +nightly test -p dbsp --target x86_64-unknown-linux-gnu -Zbuild-std --features=sanitizer +``` + +## Architecture Overview + +### Technology Stack + +- **Language**: Rust with advanced type system features +- **Concurrency**: Multi-threaded execution with worker pools +- **Memory Management**: Custom allocators with mimalloc integration +- **Serialization**: rkyv for zero-copy serialization +- **Testing**: Extensive property-based testing with proptest + +### Core Concepts + +DBSP is a computational engine for **incremental computation** on changing datasets: + +- **Incremental Processing**: Changes propagate in time proportional to change size, not dataset size +- **Stream Processing**: Continuous analysis of changing data +- **Differential Computation**: Maintains differences between consecutive states +- **Circuit Model**: Computation expressed as circuits of operators + +### Project Structure + +#### Core Directories + +- `src/circuit/` - Circuit infrastructure and runtime +- `src/operator/` - Computational operators (map, filter, join, aggregate, etc.) +- `src/trace/` - Data structures for storing and indexing collections +- `src/algebra/` - Mathematical abstractions (lattices, groups, etc.) +- `src/dynamic/` - Dynamic typing system for runtime flexibility +- `src/storage/` - Persistent storage backend +- `src/time/` - Timestamp and ordering abstractions + +#### Key Components + +- **Runtime**: Circuit execution engine with scheduling +- **Operators**: Computational primitives (50+ operators) +- **Traces**: Indexed data structures for efficient querying +- **Handles**: Input/output interfaces for circuits + +## Important Implementation Details + +### Circuit Model + +```rust +use dbsp::{Runtime, OutputHandle, IndexedZSet}; + +// Create circuit with 2 worker threads +let (mut circuit, handles) = Runtime::init_circuit(2, |circuit| { + let (input, handle) = circuit.add_input_zset::<(String, i32), isize>(); + let output = input.map(|(k, v)| (k.clone(), v * 2)); + Ok((handle, output.output())) +})?; + +// Execute one step +circuit.step()?; +``` + +### Operator Categories + +#### Core Operators +- **Map/Filter**: Element-wise transformations +- **Aggregation**: Group-by and windowing operations +- **Join**: Various join algorithms (hash, indexed, asof) +- **Time Series**: Windowing and temporal operations + +#### Advanced Operators +- **Recursive**: Fixed-point computations +- **Dynamic**: Runtime-configurable operators +- **Communication**: Multi-worker coordination +- **Storage**: Persistent state management + +### Memory Management + +- **Custom Allocators**: mimalloc for performance +- **Zero-Copy**: rkyv serialization avoids allocations +- **Batch Processing**: Amortized allocation costs +- **Garbage Collection**: Automatic cleanup of unused data + +## Development Workflow + +### For Operator Development + +1. Define operator in `src/operator/` +2. Implement required traits (`Operator`, `StrictOperator`, etc.) +3. Add comprehensive tests in module +4. Add property tests with proptest +5. Update documentation and examples + +### For Algorithm Implementation + +1. Study existing patterns in `src/operator/` +2. Consider incremental vs non-incremental versions +3. Implement both typed and dynamic variants if needed +4. Add benchmarks for performance validation +5. Test with various data distributions + +### Testing Strategy + +#### Unit Tests +- Located alongside implementation code +- Test both correctness and incremental behavior +- Use `proptest` for property-based testing + +#### Integration Tests +- Complex multi-operator scenarios +- End-to-end pipeline testing +- Performance regression detection + +#### Benchmarks +- Located in `benches/` directory +- Real-world datasets (fraud detection, social networks) +- Performance tracking across versions + +### Performance Considerations + +#### Optimization Techniques +- **Batch Processing**: Process multiple elements together +- **Index Selection**: Choose appropriate data structures +- **Memory Layout**: Optimize for cache performance +- **Parallelization**: Leverage multi-core execution + +#### Profiling Tools +- Built-in performance monitoring +- CPU profiling integration +- Memory usage tracking +- Visual circuit graph generation + +### Configuration Files + +- `Cargo.toml` - Package configuration with extensive feature flags +- `Makefile.toml` - Build automation scripts +- `benches/` - Performance benchmark suite +- `proptest-regressions/` - Property test regression data + +### Key Features and Flags + +- `default` - Standard feature set +- `with-serde` - Serialization support +- `persistence` - Persistent storage backend +- `mimalloc` - High-performance allocator +- `tokio` - Async runtime integration + +### Dependencies + +#### Core Dependencies +- `rkyv` - Zero-copy serialization +- `crossbeam` - Lock-free data structures +- `hashbrown` - Fast hash maps +- `num-traits` - Numeric abstractions + +#### Development Dependencies +- `proptest` - Property-based testing +- `criterion` - Benchmarking framework +- `tempfile` - Temporary file management + +### Advanced Features + +#### Dynamic Typing +- Runtime type flexibility +- Operator composition at runtime +- Schema evolution support + +#### Storage Backend +- Persistent state management +- Checkpoint/recovery mechanisms +- File-based and cloud storage options + +#### Multi-threading +- Automatic work distribution +- Lock-free data structures +- NUMA-aware scheduling + +### Best Practices + +- **Incremental First**: Always consider incremental behavior +- **Type Safety**: Leverage Rust's type system for correctness +- **Performance**: Profile before optimizing +- **Testing**: Property tests for complex invariants +- **Documentation**: Comprehensive examples and tutorials \ No newline at end of file diff --git a/crates/dbsp/src/CLAUDE.md b/crates/dbsp/src/CLAUDE.md new file mode 100644 index 00000000000..125c43a3276 --- /dev/null +++ b/crates/dbsp/src/CLAUDE.md @@ -0,0 +1,427 @@ +## Overview + +The `crates/dbsp/src/` directory contains the core implementation of DBSP (Database Stream Processor), a computational engine for incremental computation on changing datasets. DBSP enables processing changes in time proportional to the size of changes rather than the entire dataset, making it ideal for continuous analysis of large, frequently-changing data. + +## Architecture Overview + +### **Computational Model** + +DBSP is based on a formal theoretical foundation that provides: + +1. **Semantics**: Formal language of streaming operators with precise stream transformation specifications +2. **Algorithm**: Incremental dataflow program generation that processes input events proportional to input size rather than database state size + +### **Core Design Principles** + +- **Incremental Processing**: Changes propagate through circuits in time proportional to change size, not dataset size +- **Stream Processing**: Continuous analysis of changing data through operator circuits +- **Differential Computation**: Maintains differences between consecutive states using mathematical foundations +- **Circuit Model**: Computation expressed as circuits of interconnected operators with formal semantics + +## Directory Structure Overview + +### **Direct Subdirectories of `dbsp/src/`** + +#### **`algebra/`** +Mathematical foundations and algebraic structures underlying DBSP's incremental computation model. Contains abstract algebraic concepts including monoids, groups, rings, and lattices that provide the theoretical basis for change propagation and differential computation. The `zset/` subdirectory implements Z-sets (collections with multiplicities) which are fundamental to representing insertions and deletions in incremental computation. Includes specialized number types (`checked_int.rs`, `floats.rs`) and ordering abstractions (`order.rs`, `lattice.rs`) essential for maintaining mathematical correctness in streaming operations. + +#### **`circuit/`** +Core circuit infrastructure providing the runtime execution engine for DBSP computations. Contains circuit construction (`circuit_builder.rs`), execution control (`runtime.rs`), and scheduling (`schedule/`) components. The `dbsp_handle.rs` provides the main API for multi-worker circuit execution, while `checkpointer.rs` handles state persistence for fault tolerance. Includes performance monitoring (`metrics.rs`), circuit introspection (`trace.rs`), and integration with async runtimes (`tokio.rs`). The scheduler supports dynamic work distribution with NUMA-aware thread management. + +#### **`dynamic/`** +Dynamic typing system that enables runtime flexibility while maintaining performance. Implements trait object architecture to avoid excessive monomorphization during compilation. Contains core dynamic types (`data.rs`, `pair.rs`, `vec.rs`), serialization support (`rkyv.rs`), and factory patterns (`factory.rs`) for creating trait objects. The `erase.rs` module handles type erasure, while `downcast.rs` provides safe downcasting mechanisms. Essential for SQL compiler integration where concrete types are not known at compile time. + +#### **`monitor/`** +Circuit monitoring and visualization tools for debugging and performance analysis. Provides circuit graph generation (`circuit_graph.rs`) for visual representation of operator connectivity and data flow. The `visual_graph.rs` module creates GraphViz-compatible output for circuit visualization. Essential for understanding complex circuit behavior, debugging performance bottlenecks, and validating circuit construction correctness during development. + +#### **`operator/`** +Complete implementation of DBSP's operator library containing 50+ streaming operators. Organized into basic operators (`apply.rs`, `filter_map.rs`, `plus.rs`), aggregation operators (`aggregate.rs`, `group/`), join operators (`join.rs`, `asof_join.rs`, `semijoin.rs`), and specialized operators (`time_series/`, `recursive.rs`). The `dynamic/` subdirectory provides dynamically-typed versions of all operators for runtime flexibility. Communication operators (`communication/`) handle multi-worker data distribution. Input/output handling (`input.rs`, `output.rs`) provides type-safe interfaces for data ingestion and emission. + +#### **`profile/`** +Performance profiling and monitoring infrastructure for circuit execution analysis. Contains CPU profiling support (`cpu.rs`) with integration to standard Rust profiling tools. Provides DBSP-specific performance counters, memory usage tracking, and execution time analysis. Essential for performance optimization, identifying bottlenecks, and ensuring efficient resource utilization in production deployments. + +#### **`storage/`** +Multi-tier persistent storage system supporting various backend implementations. The `backend/` directory contains storage abstractions with memory (`memory_impl.rs`) and POSIX file system (`posixio_impl.rs`) implementations. Buffer caching (`buffer_cache/`) provides intelligent memory management with LRU eviction policies. File format handling (`file/`) implements zero-copy serialization with rkyv, while `dirlock/` provides file system synchronization. Supports both memory-optimized and disk-based storage for different deployment scenarios. + +#### **`time/`** +Temporal abstractions and time-based computation support for streaming operations. Implements timestamp types and ordering relationships (`antichain.rs`, `product.rs`) essential for maintaining temporal consistency in incremental computation. Provides the foundation for time-based operators, windowing operations, and watermark handling. Critical for ensuring correct temporal semantics in streaming analytics and maintaining consistency across distributed computations. + +#### **`trace/`** +Core data structures for storing and indexing collections with efficient access patterns. Contains batch and trace implementations (`ord/`) supporting both in-memory and file-based storage. The `cursor/` subdirectory provides iteration interfaces, while `layers/` implements hierarchical data organization. Spine-based temporal organization (`spine_async/`) enables efficient merge operations for time-ordered data. Supports various specialized trace types including key-value batches, weighted sets, and indexed collections optimized for different query patterns. + +#### **`utils/`** +Utility functions and helper modules supporting core DBSP functionality. Contains sorting algorithms (`sort.rs`, `unstable_sort.rs`), data consolidation (`consolidation/`), and specialized data structures. Tuple generation (`tuple/`) provides compile-time tuple creation, while `vec_ext.rs` extends vector functionality. The `sample.rs` module implements sampling algorithms for statistical operations. Includes property-based testing utilities and performance optimization helpers. + +## Module Architecture + +### **Core Infrastructure Layers** + +#### **1. Dynamic Dispatch System** (`dynamic/`) +**Purpose**: Provides dynamic typing to limit monomorphization and balance compilation speed with runtime performance. + +**Key Components**: +```rust +// Core trait hierarchy for dynamic dispatch +trait Data: Clone + Eq + Ord + Hash + SizeOf + Send + Sync + Debug + 'static {} +trait DataTrait: DowncastTrait + ClonableTrait + ArchiveTrait + Data {} + +// Factory pattern for creating trait objects +trait Factory { + fn default_box(&self) -> Box; + fn default_ref(&self) -> &mut T; +} +``` + +**Dynamic Type System**: +- `DynData`: Dynamically typed data with concrete type erasure +- `DynPair`: Dynamic tuples for key-value relationships +- `DynVec`: Dynamic vectors with efficient batch operations +- `DynWeightedPairs`: Collections of weighted data for incremental computation + +**Safety Considerations**: Uses unsafe code for performance by eliding TypeId checks in release builds while maintaining type safety through careful API design. + +#### **2. Algebraic Foundations** (`algebra/`) +**Purpose**: Mathematical abstractions underlying incremental computation. + +**Core Algebraic Structures**: +```rust +// Trait hierarchy for mathematical structures +trait SemigroupValue: Clone + Eq + SizeOf + AddByRef + 'static {} +trait MonoidValue: SemigroupValue + HasZero {} +trait GroupValue: MonoidValue + Neg + NegByRef {} +trait RingValue: GroupValue + Mul + MulByRef + HasOne {} +``` + +**Z-Set Implementation** (`algebra/zset/`): +- **ZSet**: Collections with multiplicities (positive/negative weights) +- **IndexedZSet**: Indexed collections for efficient lookups and joins +- **Mathematical Operations**: Addition, subtraction, multiplication following group theory +- **Change Propagation**: Differential computation using +/- weights for insertions/deletions + +#### **3. Trace System** (`trace/`) +**Purpose**: Core data structures for storing and indexing collections with time-based organization. + +**Batch and Trace Abstractions**: +```rust +trait BatchReader { + type Key: DBData; + type Val: DBData; + type Time: Timestamp; + type Diff: MonoidValue; + + fn cursor(&self) -> Self::Cursor; +} + +trait Trace: BatchReader { + fn append_batch(&mut self, batch: &Self::Batch); + fn map_batches(&self, f: F); +} +``` + +**Storage Implementations**: +- **In-Memory**: `VecBatch`, `OrdBatch` for fast access +- **File-Based**: `FileBatch` for persistent storage with memory efficiency +- **Fallback**: `FallbackBatch` combining memory and file storage +- **Spine**: Efficient merge trees for temporal data organization + +#### **4. Circuit Infrastructure** (`circuit/`) +**Purpose**: Runtime execution engine with scheduling, state management, and multi-threading. + +**Circuit Execution Model**: +```rust +trait Operator { + fn eval(&mut self); + fn commit(&mut self); + fn get_metadata(&self) -> OperatorMeta; +} + +// Circuit construction and execution +let (circuit, handles) = Runtime::init_circuit(workers, |circuit| { + let input = circuit.add_input_zset::(); + let output = input.map(|r| transform(r)); + Ok((input_handle, output.output())) +})?; +``` + +**Key Circuit Components**: +- **Runtime**: Multi-worker execution with NUMA-aware scheduling +- **Scheduler**: Dynamic scheduling with work-stealing parallelization +- **Handles**: Type-safe input/output interfaces (`InputHandle`, `OutputHandle`) +- **Checkpointing**: State persistence for fault tolerance +- **Metrics**: Performance monitoring and circuit introspection + +## Operator System + +### **Operator Categories and Implementations** + +#### **Core Transformation Operators** (`operator/`) + +**Basic Operators**: +```rust +// Element-wise transformations +.map(|x| f(x)) // Map operator +.filter(|x| p(x)) // Filter operator +.filter_map(|x| f(x)) // Combined filter and map + +// Arithmetic operations +.plus(&other) // Set union with weight addition +.minus(&other) // Set difference with weight subtraction +``` + +**Aggregation Operators**: +```rust +// Group-by aggregation with incremental maintenance +.aggregate_generic::( + |k, v| key_func(k, v), // Key extraction + |acc, k, v, w| agg_func(acc, k, v, w), // Aggregation function +) + +// Window aggregation +.window_aggregate(window_spec, agg_func) +``` + +**Join Operators**: +```rust +// Hash join with incremental updates +.join::(&other, |k, v1, v2| result(k, v1, v2)) + +// Index join for efficient lookups +.index_join(&indexed_stream, join_key) + +// As-of join for temporal relationships +.asof_join(&other, time_key, join_condition) +``` + +#### **Advanced Operators** + +**Recursive Computation**: +```rust +// Fixed-point iteration for recursive queries +circuit.recursive(|child| { + let (feedback, output) = child.add_feedback(z_set_factory); + let result = base_case.plus(&feedback.delay().recursive_step()); + feedback.connect(&result); + Ok(output) +}) +``` + +**Time Series Operators** (`operator/time_series/`): +- **Windowing**: Time-based and row-based windows with expiration +- **Watermarks**: Late data handling and progress tracking +- **Rolling Aggregates**: Efficient sliding window computations +- **Range Queries**: Time range-based data retrieval + +**Communication Operators** (`operator/communication/`): +- **Exchange**: Data redistribution across workers +- **Gather**: Collecting distributed results +- **Shard**: Partitioning data for parallel processing + +### **Dynamic vs Static APIs** + +#### **Static API** (Top-level `operator/` modules) +```rust +// Type-safe, compile-time checked operations +let result: Stream<_, OrdZSet<(String, i32), isize>> = input + .map(|(k, v)| (k.to_uppercase(), v * 2)) + .filter(|(_, v)| *v > 0); +``` + +#### **Dynamic API** (`operator/dynamic/`) +```rust +// Runtime-flexible operations with dynamic typing +let result = input_dyn + .map_generic(&map_factory, &output_factory) + .filter_generic(&filter_factory); +``` + +**Trade-offs**: +- **Static**: Better performance, compile-time type checking, less flexibility +- **Dynamic**: Faster compilation, runtime flexibility, SQL compiler integration + +## Storage Architecture + +### **Multi-Tier Storage System** (`storage/`) + +#### **Storage Backends** (`storage/backend/`) +```rust +trait StorageBackend { + fn create_file(&self, path: &StoragePath) -> Result; + fn open_file(&self, path: &StoragePath) -> Result; +} + +// Implementations: +// - MemoryBackend: In-memory for testing and small datasets +// - PosixBackend: File system storage for production +``` + +#### **Buffer Cache System** (`storage/buffer_cache/`) +- **LRU Caching**: Intelligent buffer management with configurable cache sizes +- **Async I/O**: Non-blocking file operations with efficient prefetching +- **Memory Mapping**: Direct memory access for large files when beneficial +- **Cache Statistics**: Performance monitoring and cache hit rate tracking + +#### **File Format and Serialization** (`storage/file/`) +```rust +// Zero-copy serialization with rkyv +trait Serializable { + fn serialize(&self, serializer: &mut S) -> Result<(), S::Error>; +} + +// File-based batch implementations +FileBatch { + reader: FileReader, + metadata: BatchMetadata, + cache_stats: CacheStats, +} +``` + +## Performance Architecture + +### **Multi-Threading and Parallelization** + +#### **Worker-Based Execution Model**: +```rust +// Multi-worker runtime with work-stealing scheduler +let workers = 8; +let (circuit, handles) = Runtime::init_circuit(workers, circuit_constructor)?; + +// Automatic work distribution across operators +// NUMA-aware memory allocation +// Lock-free data structures for synchronization +``` + +#### **Operator Parallelization**: +- **Embarrassingly Parallel**: Map, filter operations across workers +- **Hash-Based Partitioning**: Join operations with consistent hashing +- **Pipeline Parallelism**: Different operators executing concurrently +- **Batch Processing**: Amortized costs through efficient batching + +### **Memory Management Optimization** + +#### **Custom Allocation Strategies**: +```rust +// mimalloc integration for high-performance allocation +#[cfg(feature = "mimalloc")] +use mimalloc::MiMalloc; +#[global_allocator] +static GLOBAL: MiMalloc = MiMalloc; +``` + +#### **Zero-Copy Techniques**: +- **rkyv Serialization**: Zero-copy deserialization for file I/O +- **Reference Sharing**: `Rc`/`Arc` for immutable data sharing +- **In-Place Updates**: Mutation when safe for performance +- **Batch Reuse**: Buffer recycling to minimize allocations + +### **Optimization Techniques** + +#### **Data Structure Selection**: +- **Indexed Collections**: B-trees for sorted data with range queries +- **Hash Tables**: Hash maps for point lookups and equi-joins +- **Vectors**: Sequential access patterns with cache efficiency +- **Sparse Representations**: Efficient storage for sparse datasets + +#### **Algorithm Optimization**: +- **Incremental Algorithms**: Differential computation for all operators +- **Index Selection**: Automatic index creation for query optimization +- **Lazy Evaluation**: Defer computation until results are needed +- **Batch Consolidation**: Merge operations for efficiency + +## Testing and Validation + +### **Testing Architecture** + +#### **Property-Based Testing**: +```rust +use proptest::prelude::*; + +proptest! { + #[test] + fn test_incremental_correctness( + initial_data in vec((any::(), any::()), 0..100), + updates in vec((any::(), any::()), 0..50) + ) { + // Verify incremental result matches batch computation + let incremental_result = incremental_computation(&initial_data, &updates); + let batch_result = batch_computation(&[&initial_data[..], &updates[..]].concat()); + prop_assert_eq!(incremental_result, batch_result); + } +} +``` + +#### **Integration Testing**: +- **End-to-End Pipelines**: Complete circuit execution validation +- **Correctness Verification**: Incremental vs batch result comparison +- **Performance Regression**: Benchmark tracking across versions +- **Fault Tolerance**: Checkpoint/recovery scenario testing + +### **Debugging and Profiling** + +#### **Circuit Introspection** (`monitor/`): +```rust +// Visual circuit graph generation +circuit.generate_graphviz(&mut output); + +// Performance metrics collection +let metrics = circuit.gather_metrics(); +println!("Operator throughput: {} records/sec", metrics.throughput); +``` + +#### **Profiling Integration** (`profile/`): +- **CPU Profiling**: Integration with standard Rust profiling tools +- **Memory Analysis**: Allocation tracking and memory usage patterns +- **Custom Metrics**: DBSP-specific performance counters +- **Tracing Support**: Distributed tracing for complex circuits + +## Development Patterns and Best Practices + +### **Operator Development Guidelines** + +#### **Implementing New Operators**: +1. **Define Traits**: Specify operator behavior through trait definitions +2. **Static Implementation**: Create type-safe static version first +3. **Dynamic Wrapper**: Add dynamic dispatch for SQL compiler integration +4. **Testing**: Comprehensive unit tests with property-based validation +5. **Documentation**: Examples and performance characteristics + +#### **Performance Optimization**: +- **Profile First**: Use built-in profiling before optimizing +- **Batch Processing**: Prefer batch operations over single-record processing +- **Memory Layout**: Consider cache-friendly data arrangements +- **Incremental Logic**: Ensure algorithms process only changes when possible + +### **Integration Points** + +#### **SQL Compiler Integration**: +- Dynamic operators provide runtime flexibility for compiled SQL +- Factory pattern enables type-erased circuit construction +- Metadata system supports query optimization and introspection + +#### **Storage System Integration**: +- Pluggable storage backends for different deployment scenarios +- Checkpointing support for fault-tolerant long-running computations +- File-based batches for memory-efficient large dataset processing + +### **Error Handling Patterns** + +#### **Structured Error Types**: +```rust +#[derive(Debug, Error)] +pub enum Error { + #[error("Storage error: {0}")] + Storage(#[from] StorageError), + + #[error("Runtime error: {0}")] + Runtime(#[from] RuntimeError), + + #[error("Scheduler error: {0}")] + Scheduler(#[from] SchedulerError), +} +``` + +#### **Recovery Strategies**: +- **Graceful Degradation**: Continue processing when possible +- **State Restoration**: Checkpoint-based recovery for critical errors +- **Error Propagation**: Structured error context through computation stack + +This DBSP source code represents a sophisticated computational engine that successfully implements incremental computation theory in a high-performance, production-ready system. The modular architecture, extensive use of Rust's type system, and careful performance optimization create a powerful foundation for streaming analytics and incremental view maintenance. \ No newline at end of file diff --git a/crates/fda/CLAUDE.md b/crates/fda/CLAUDE.md new file mode 100644 index 00000000000..e7e9e6cf63d --- /dev/null +++ b/crates/fda/CLAUDE.md @@ -0,0 +1,303 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the fda crate +cargo build -p fda + +# Run the FDA CLI tool +cargo run -p fda + +# Run interactive shell +cargo run -p fda -- shell + +# Run benchmarks +cargo run -p fda -- bench + +# Test bash integration +./test.bash +``` + +## Architecture Overview + +### Technology Stack + +- **CLI Framework**: Command-line interface for development tasks +- **Interactive Shell**: REPL-style development environment +- **Benchmarking**: Performance testing utilities +- **API Integration**: OpenAPI specification and testing + +### Core Purpose + +FDA provides **development tools and utilities** for Feldera: + +- **CLI Commands**: Development workflow automation +- **Interactive Shell**: Exploratory development environment +- **Benchmarking Tools**: Performance measurement and testing +- **API Testing**: OpenAPI specification validation + +### Project Structure + +#### Core Modules + +- `src/main.rs` - CLI entry point and argument parsing +- `src/cli.rs` - Command-line interface implementation +- `src/shell.rs` - Interactive shell and REPL +- `src/bench/` - Benchmarking utilities and API +- `src/adhoc.rs` - Ad-hoc development utilities + +## Important Implementation Details + +### CLI Interface + +#### Command Structure +```rust +#[derive(Parser)] +pub enum Command { + Shell, + Bench(BenchArgs), + Adhoc(AdhocArgs), +} +``` + +#### CLI Features +- **Subcommands**: Organized development tasks +- **Interactive Mode**: Shell-like development environment +- **Configuration**: Flexible configuration options +- **Help System**: Comprehensive help and documentation + +### Interactive Shell + +#### Shell Features +```rust +pub struct Shell { + history: Vec, + context: ShellContext, +} + +impl Shell { + pub fn run(&mut self) -> Result<(), ShellError>; + pub fn execute_command(&mut self, cmd: &str) -> Result; +} +``` + +#### Shell Capabilities +- **Command History**: Navigate previous commands +- **Tab Completion**: Smart command completion +- **Context Awareness**: Maintain development context +- **Script Execution**: Run shell scripts + +### Benchmarking System + +#### Benchmark API +```rust +pub struct BenchmarkSuite { + tests: Vec, + config: BenchConfig, +} + +impl BenchmarkSuite { + pub fn run(&self) -> BenchmarkResults; + pub fn add_test(&mut self, test: BenchmarkTest); +} +``` + +#### Benchmarking Features +- **Performance Testing**: Measure execution time and throughput +- **Statistical Analysis**: Confidence intervals and variance +- **Comparison**: Compare against baseline performance +- **Reporting**: Detailed benchmark reports + +### OpenAPI Integration + +#### API Testing +```rust +pub struct ApiTester { + spec: OpenApiSpec, + client: HttpClient, +} + +impl ApiTester { + pub fn validate_spec(&self) -> Result<(), ValidationError>; + pub fn test_endpoints(&self) -> Result; +} +``` + +#### API Features +- **Spec Validation**: OpenAPI specification validation +- **Endpoint Testing**: Automated endpoint testing +- **Schema Validation**: Request/response schema validation +- **Documentation Generation**: API documentation updates + +## Development Workflow + +### Using the CLI + +#### Development Tasks +```bash +# Start interactive development session +fda shell + +# Run performance benchmarks +fda bench --suite performance + +# Execute ad-hoc development tasks +fda adhoc --task generate_test_data +``` + +#### Shell Commands +```bash +# In interactive shell +> help # Show available commands +> bench list # List available benchmarks +> api validate # Validate API specifications +> test run integration # Run integration tests +``` + +### For New CLI Commands + +1. Add command to `Command` enum in `src/cli.rs` +2. Implement command handler function +3. Add argument parsing with clap +4. Add help text and examples +5. Test command with various inputs +6. Update documentation + +### For New Benchmarks + +1. Add benchmark test to `src/bench/` +2. Define benchmark parameters and metrics +3. Implement benchmark execution logic +4. Add statistical analysis and reporting +5. Test with various workloads +6. Document benchmark purpose and interpretation + +### Testing Strategy + +#### CLI Testing +```bash +# Test CLI commands +./test.bash + +# Test interactive shell +echo "help\nquit" | fda shell + +# Test benchmarks +fda bench --dry-run +``` + +#### Integration Testing +- **End-to-End**: Complete workflow testing +- **API Integration**: Test with real Feldera services +- **Performance**: Validate benchmark accuracy +- **Error Handling**: Test error conditions + +### Configuration + +#### CLI Configuration +```rust +pub struct FdaConfig { + pub default_endpoint: String, + pub benchmark_config: BenchConfig, + pub shell_config: ShellConfig, +} +``` + +#### Configuration Sources +- **Config Files**: TOML configuration files +- **Environment Variables**: Runtime configuration +- **Command Line**: Override configuration options +- **Interactive**: Set options in shell mode + +### Build Configuration + +#### Build Script Integration +```rust +// build.rs +fn main() { + // Generate CLI completion scripts + generate_completions(); + + // Build benchmark assets + build_bench_assets(); +} +``` + +#### Features +- **Completion Scripts**: Shell completion generation +- **Asset Bundling**: Embed benchmark data +- **OpenAPI Integration**: API specification handling + +### Configuration Files + +- `Cargo.toml` - CLI and benchmarking dependencies +- `build.rs` - Build-time asset generation +- `test.bash` - Integration test script +- `bench_openapi.json` - OpenAPI specification for testing + +### Dependencies + +#### Core Dependencies +- `clap` - Command-line argument parsing +- `rustyline` - Interactive readline support +- `serde` - Configuration serialization +- `tokio` - Async runtime for API calls + +#### Benchmarking Dependencies +- `criterion` - Statistical benchmarking +- `reqwest` - HTTP client for API testing +- `openapi` - OpenAPI specification handling + +### Best Practices + +#### CLI Design +- **Consistent Interface**: Follow standard CLI conventions +- **Helpful Errors**: Provide actionable error messages +- **Progressive Disclosure**: Simple defaults, advanced options +- **Documentation**: Comprehensive help and examples + +#### Interactive Shell +- **User-Friendly**: Intuitive commands and feedback +- **Discoverable**: Tab completion and help system +- **Persistent**: Save history and context +- **Scriptable**: Support for automation + +#### Benchmarking +- **Statistical Rigor**: Proper statistical analysis +- **Reproducible**: Consistent benchmark conditions +- **Meaningful Metrics**: Relevant performance indicators +- **Comparative**: Easy comparison across versions + +### Usage Examples + +#### Development Workflow +```bash +# Start development session +fda shell + +# Run quick benchmark +> bench quick + +# Validate API changes +> api validate --endpoint http://localhost:8080 + +# Generate test data +> adhoc generate_data --size 1000 +``` + +#### Performance Testing +```bash +# Run full benchmark suite +fda bench --suite full --iterations 10 + +# Compare with baseline +fda bench --compare baseline.json + +# Profile specific operations +fda bench --profile pipeline_creation +``` + +This crate serves as the development companion tool, providing essential utilities and automation for Feldera development workflows. \ No newline at end of file diff --git a/crates/feldera-types/CLAUDE.md b/crates/feldera-types/CLAUDE.md new file mode 100644 index 00000000000..fbe54ca2f7c --- /dev/null +++ b/crates/feldera-types/CLAUDE.md @@ -0,0 +1,282 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the feldera-types crate +cargo build -p feldera-types + +# Run tests +cargo test -p feldera-types + +# Check documentation +cargo doc -p feldera-types --open + +# Run with all features +cargo build -p feldera-types --all-features +``` + +## Architecture Overview + +### Technology Stack + +- **Serialization**: serde with JSON support +- **Type System**: Rust's type system with trait-based abstractions +- **Configuration**: Structured configuration types +- **Error Handling**: Comprehensive error types and conversions + +### Core Purpose + +Feldera Types provides **shared type definitions** and **configuration structures** used across the entire Feldera platform: + +- **Configuration Types**: Pipeline, connector, and transport configurations +- **Data Format Types**: Schema definitions for supported data formats +- **Error Types**: Standardized error handling across components +- **Transport Types**: Configuration for various transport mechanisms + +### Project Structure + +#### Core Modules + +- `src/config.rs` - Pipeline and runtime configuration +- `src/transport/` - Transport-specific configuration types +- `src/format/` - Data format configuration and schemas +- `src/error.rs` - Error types and conversions +- `src/query.rs` - Query and program schema definitions + +## Important Implementation Details + +### Configuration Architecture + +#### Pipeline Configuration +```rust +use feldera_types::config::PipelineConfig; + +let config = PipelineConfig { + workers: Some(4), + storage: Some(storage_config), + resources: Some(resource_limits), + ..Default::default() +}; +``` + +#### Transport Configuration +```rust +use feldera_types::transport::{KafkaInputConfig, HttpOutputConfig}; + +// Kafka input configuration +let kafka_config = KafkaInputConfig { + brokers: vec!["localhost:9092".to_string()], + topic: "input_topic".to_string(), + group_id: Some("consumer_group".to_string()), + ..Default::default() +}; +``` + +### Data Format Types + +#### Format Configuration +- **CSV**: Field delimiters, headers, escaping rules +- **JSON**: Schema validation, type coercion settings +- **Avro**: Schema registry integration, evolution policies +- **Parquet**: Compression, column selection, batch sizing + +#### Schema Definitions +```rust +use feldera_types::program_schema::{Field, SqlType, Relation}; + +let table_schema = Relation { + name: "users".to_string(), + fields: vec![ + Field { + name: "id".to_string(), + columntype: SqlType::Integer { nullable: false }, + case_sensitive: false, + }, + Field { + name: "name".to_string(), + columntype: SqlType::Varchar { nullable: true, precision: None }, + case_sensitive: false, + }, + ], + ..Default::default() +}; +``` + +### Transport Types + +#### Supported Transports +- **Kafka**: Broker configuration, topic settings, consumer groups +- **HTTP**: Endpoint URLs, authentication, rate limiting +- **File**: Path specifications, file formats, polling intervals +- **PostgreSQL**: Connection strings, table mappings, CDC settings +- **Delta Lake**: Storage locations, partition schemes, versioning + +### Error Handling + +#### Error Categories +- **Configuration Errors**: Invalid settings, missing required fields +- **Validation Errors**: Schema mismatches, type conflicts +- **Runtime Errors**: Transport failures, format conversion errors +- **System Errors**: Resource exhaustion, permission issues + +```rust +use feldera_types::error::DetailedError; + +// Structured error with context +let error = DetailedError::invalid_configuration( + "Invalid Kafka broker configuration", + Some("brokers field cannot be empty"), +); +``` + +## Development Workflow + +### For New Configuration Types + +1. Add configuration struct in appropriate module +2. Implement `Default`, `Serialize`, `Deserialize` traits +3. Add validation logic with `Validate` trait +4. Add comprehensive documentation with examples +5. Add unit tests for serialization/deserialization +6. Update dependent crates to use new configuration + +### For New Data Types + +1. Define type in appropriate module +2. Implement required traits (Clone, Debug, etc.) +3. Add serde support for JSON serialization +4. Add conversion methods to/from other representations +5. Add validation logic if needed +6. Test edge cases and error conditions + +### Testing Strategy + +#### Unit Tests +- Serialization round-trip testing +- Configuration validation testing +- Error message formatting +- Default value behavior + +#### Integration Tests +- Cross-crate compatibility +- Real configuration file parsing +- Error propagation across boundaries + +### Validation Framework + +The crate includes a validation framework for configuration: + +```rust +use feldera_types::config::Validate; + +impl Validate for MyConfig { + fn validate(&self) -> Result<(), DetailedError> { + if self.workers == 0 { + return Err(DetailedError::invalid_configuration( + "workers must be greater than 0", + None, + )); + } + Ok(()) + } +} +``` + +### Serialization Patterns + +#### JSON Serialization +- **Snake Case**: Field names use snake_case convention +- **Optional Fields**: Use `Option` for optional configuration +- **Default Values**: Implement sensible defaults +- **Validation**: Validate after deserialization + +#### Custom Serialization +```rust +use serde::{Deserialize, Serialize}; + +#[derive(Serialize, Deserialize, Clone, Debug)] +#[serde(tag = "type", content = "config")] +pub enum TransportConfig { + Kafka(KafkaConfig), + Http(HttpConfig), + File(FileConfig), +} +``` + +### Configuration Files + +- `Cargo.toml` - Minimal dependencies for type definitions +- Feature flags for optional functionality +- Version compatibility across Feldera components + +### Key Design Principles + +- **Backward Compatibility**: Schema evolution without breaking changes +- **Type Safety**: Leverage Rust's type system for correctness +- **Documentation**: Comprehensive field documentation +- **Validation**: Runtime validation with helpful error messages +- **Modularity**: Separate concerns by transport/format type + +### Dependencies + +#### Core Dependencies +- `serde` - Serialization framework +- `serde_json` - JSON format support +- `chrono` - Date/time types +- `uuid` - UUID generation + +#### Optional Dependencies +- `url` - URL parsing and validation +- `regex` - Pattern matching for validation +- `base64` - Encoding/decoding support + +### Best Practices + +#### Configuration Design +- **Sensible Defaults**: Most fields should have reasonable defaults +- **Clear Naming**: Field names should be self-documenting +- **Validation**: Validate configuration at construction time +- **Documentation**: Include examples in field documentation + +#### Error Handling +- **Structured Errors**: Use DetailedError for rich error information +- **Context**: Provide helpful context in error messages +- **Recovery**: Design errors to be actionable +- **Consistency**: Use consistent error patterns across types + +#### Type Design +- **Composability**: Types should compose well together +- **Extensibility**: Design for future extension +- **Performance**: Avoid unnecessary allocations +- **Testing**: Include comprehensive test coverage + +### Usage Patterns + +#### Configuration Loading +```rust +use feldera_types::config::PipelineConfig; + +// Load from JSON file +let config: PipelineConfig = serde_json::from_str(&json_content)?; +config.validate()?; + +// Merge with defaults +let final_config = PipelineConfig { + workers: config.workers.or(Some(1)), + ..config +}; +``` + +#### Schema Validation +```rust +use feldera_types::program_schema::ProgramSchema; + +// Validate program schema +let schema: ProgramSchema = serde_json::from_str(&schema_json)?; +schema.validate_consistency()?; +``` + +This crate is foundational to the Feldera platform, providing the type system backbone for configuration, data formats, and error handling across all components. \ No newline at end of file diff --git a/crates/fxp/CLAUDE.md b/crates/fxp/CLAUDE.md new file mode 100644 index 00000000000..9d0177153e4 --- /dev/null +++ b/crates/fxp/CLAUDE.md @@ -0,0 +1,293 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the fxp crate +cargo build -p fxp + +# Run tests +cargo test -p fxp + +# Run with specific precision +cargo test -p fxp test_decimal_precision + +# Check documentation +cargo doc -p fxp --open +``` + +## Architecture Overview + +### Technology Stack + +- **Fixed-Point Arithmetic**: High-precision decimal calculations +- **DBSP Integration**: Native integration with DBSP operators +- **Serialization**: rkyv and serde support +- **Performance**: Optimized arithmetic operations + +### Core Purpose + +FXP provides **high-precision fixed-point arithmetic** for financial and decimal computations: + +- **Decimal Precision**: Exact decimal arithmetic without floating-point errors +- **DBSP Integration**: Native support for DBSP circuits +- **SQL Compatibility**: Match SQL DECIMAL semantics +- **Performance**: Optimized for high-throughput operations + +### Project Structure + +#### Core Modules + +- `src/lib.rs` - Public API and core types +- `src/fixed.rs` - Fixed-point arithmetic implementation +- `src/u256.rs` - 256-bit integer arithmetic backend +- `src/dynamic.rs` - Dynamic precision support +- `src/dbsp_impl.rs` - DBSP integration +- `src/serde_impl.rs` - Serialization support +- `src/rkyv_impl.rs` - Zero-copy serialization + +## Important Implementation Details + +### Fixed-Point Types + +#### Core Types +```rust +// Fixed-point decimal with compile-time precision +pub struct Fixed { + value: I256, +} + +// Dynamic precision decimal +pub struct DynamicFixed { + value: I256, + scale: i8, + precision: u8, +} + +// SQL DECIMAL type +pub type SqlDecimal = DynamicFixed; +``` + +#### Type Features +- **Compile-Time Scale**: Zero-cost fixed scale at compile time +- **Dynamic Scale**: Runtime configurable precision +- **Range**: Support for very large numbers (256-bit backend) +- **Exact Arithmetic**: No precision loss in calculations + +### Arithmetic Operations + +#### Core Operations +```rust +impl Fixed { + pub fn add(self, other: Self) -> Self; + pub fn sub(self, other: Self) -> Self; + pub fn mul(self, other: Self) -> Self; + pub fn div(self, other: Self) -> Option; + + // Scaling operations + pub fn rescale(self) -> Fixed; + pub fn round_to_scale(self) -> Fixed; +} +``` + +#### Advanced Operations +- **Rounding Modes**: Various rounding strategies (banker's rounding, etc.) +- **Scale Conversion**: Safe conversion between different scales +- **Overflow Handling**: Saturating or checked arithmetic +- **Comparison**: Total ordering with proper decimal semantics + +### SQL Compatibility + +#### SQL DECIMAL Semantics +```rust +// SQL DECIMAL(precision, scale) +pub fn sql_decimal(precision: u8, scale: i8) -> DynamicFixed { + DynamicFixed::new(0, scale, precision) +} + +// SQL arithmetic follows SQL standard rules +impl DynamicFixed { + // Addition: max(scale1, scale2) + pub fn sql_add(&self, other: &Self) -> Self; + + // Multiplication: scale1 + scale2 + pub fn sql_mul(&self, other: &Self) -> Self; + + // Division: configurable result scale + pub fn sql_div(&self, other: &Self, result_scale: i8) -> Option; +} +``` + +#### SQL Standard Compliance +- **Precision Rules**: Follow SQL standard precision inference +- **Scale Rules**: Appropriate scale handling for all operations +- **Rounding**: SQL-compliant rounding behavior +- **Overflow**: Handle overflow according to SQL semantics + +### DBSP Integration + +#### DBSP Operator Support +```rust +use dbsp::operator::Fold; + +impl Fold for Fixed { + fn fold(&mut self, other: Self) { + *self = self.add(other); + } +} + +// Aggregation support +impl Sum for Fixed { + fn sum>(iter: I) -> Self { + iter.fold(Fixed::zero(), |a, b| a.add(b)) + } +} +``` + +#### Zero-Copy Serialization +```rust +// rkyv support for zero-copy serialization +impl Archive for Fixed { + type Archived = ArchivedFixed; + type Resolver = (); + + fn resolve(&self, _: (), out: &mut Self::Archived); +} +``` + +### Performance Optimization + +#### Arithmetic Optimization +- **Specialized Algorithms**: Optimized multiplication and division +- **Branch Prediction**: Minimize conditional branches +- **SIMD**: Vector operations where applicable +- **Memory Layout**: Optimal data structure layout + +#### Scale Handling +- **Compile-Time Scale**: Zero runtime cost for fixed scales +- **Scale Caching**: Cache scale calculations +- **Batch Operations**: Optimize for batch processing +- **Precision Selection**: Choose optimal precision for operations + +## Development Workflow + +### For Arithmetic Extensions + +1. Implement operation following SQL standard +2. Add appropriate overflow handling +3. Test with boundary conditions +4. Add performance benchmarks +5. Validate against SQL databases +6. Document precision and scale behavior + +### For DBSP Integration + +1. Implement required DBSP traits +2. Add serialization support +3. Test with DBSP circuits +4. Validate incremental behavior +5. Optimize for performance +6. Add comprehensive tests + +### Testing Strategy + +#### Arithmetic Testing +- **Precision**: Test precision preservation +- **Boundary Conditions**: Test with extreme values +- **SQL Compatibility**: Compare with SQL database results +- **Rounding**: Validate rounding behavior +- **Overflow**: Test overflow handling + +#### DBSP Testing +- **Serialization**: Test rkyv round-trip +- **Incremental**: Test incremental computation +- **Aggregation**: Test aggregation operations +- **Performance**: Benchmark DBSP operations + +### Precision Management + +#### Scale Selection +```rust +// Choose appropriate scale for operations +pub fn optimal_scale_for_operation( + op: ArithmeticOp, + left_scale: i8, + right_scale: i8, +) -> i8 { + match op { + ArithmeticOp::Add | ArithmeticOp::Sub => left_scale.max(right_scale), + ArithmeticOp::Mul => left_scale + right_scale, + ArithmeticOp::Div => left_scale - right_scale, + } +} +``` + +#### Precision Control +- **Automatic Scaling**: Automatic scale selection for operations +- **Manual Control**: Explicit scale control when needed +- **Validation**: Validate precision requirements +- **Optimization**: Optimize precision for performance + +### Configuration Files + +- `Cargo.toml` - Fixed-point arithmetic dependencies +- Feature flags for different backends and optimizations + +### Dependencies + +#### Core Dependencies +- `num-bigint` - Large integer arithmetic +- `serde` - Serialization support +- `rkyv` - Zero-copy serialization + +#### DBSP Dependencies +- `dbsp` - DBSP integration +- DBSP traits and operators + +### Best Practices + +#### Arithmetic Design +- **SQL Compliance**: Follow SQL standard precisely +- **Overflow Safety**: Handle overflow conditions safely +- **Performance**: Optimize critical arithmetic paths +- **Precision**: Maintain appropriate precision throughout + +#### API Design +- **Type Safety**: Use compile-time scale where possible +- **Ergonomics**: Provide convenient conversion functions +- **Documentation**: Document precision and scale behavior +- **Testing**: Comprehensive test coverage + +#### DBSP Integration +- **Zero-Copy**: Minimize data copying in serialization +- **Incremental**: Design for incremental computation +- **Aggregation**: Efficient aggregation operations +- **Memory**: Optimize memory usage patterns + +### Usage Examples + +#### Basic Arithmetic +```rust +use fxp::Fixed; + +// Fixed scale at compile time +let a = Fixed::<2>::from_str("123.45").unwrap(); // 2 decimal places +let b = Fixed::<2>::from_str("67.89").unwrap(); +let sum = a.add(b); // 191.34 + +// Dynamic scale +let decimal = SqlDecimal::new(12345, 2, 10); // 123.45 with precision 10, scale 2 +``` + +#### SQL Operations +```rust +use fxp::DynamicFixed; + +let price = DynamicFixed::from_sql_decimal(999, 2, 5); // $9.99 +let quantity = DynamicFixed::from_sql_decimal(3, 0, 3); // 3 +let total = price.sql_mul(&quantity); // $29.97 +``` + +This crate provides the mathematical foundation for precise decimal arithmetic in financial and scientific applications within the DBSP ecosystem. \ No newline at end of file diff --git a/crates/iceberg/CLAUDE.md b/crates/iceberg/CLAUDE.md new file mode 100644 index 00000000000..c0e6a8e82de --- /dev/null +++ b/crates/iceberg/CLAUDE.md @@ -0,0 +1,279 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the iceberg crate +cargo build -p iceberg + +# Run tests (requires test environment) +cargo test -p iceberg + +# Set up test environment +cd src/test +python create_test_table_s3.py + +# Install test dependencies +pip install -r requirements.txt +``` + +## Architecture Overview + +### Technology Stack + +- **Apache Iceberg**: Open table format for large analytic datasets +- **S3 Integration**: AWS S3 and S3-compatible storage +- **Async I/O**: Non-blocking I/O operations with tokio +- **Python Integration**: Test utilities with Python ecosystem + +### Core Purpose + +Iceberg provides **Apache Iceberg table format support** for Feldera: + +- **Table Format**: Support for Iceberg's open table format +- **Cloud Storage**: Integration with S3 and cloud storage +- **Schema Evolution**: Handle schema changes over time +- **Time Travel**: Support for historical data queries + +### Project Structure + +#### Core Modules + +- `src/lib.rs` - Core Iceberg functionality +- `src/input.rs` - Iceberg table input adapter +- `src/test/` - Test utilities and setup scripts + +## Important Implementation Details + +### Iceberg Integration + +#### Table Format Support +```rust +pub struct IcebergTable { + pub metadata: TableMetadata, + pub schema: Schema, + pub partition_spec: PartitionSpec, +} + +impl IcebergTable { + pub async fn load_from_catalog(&self, path: &str) -> Result; + pub async fn scan(&self, filter: Option) -> Result; +} +``` + +#### Features +- **Metadata Management**: Handle Iceberg metadata files +- **Schema Evolution**: Support schema changes over time +- **Partitioning**: Efficient data partitioning schemes +- **File Management**: Track data files and manifests + +### Input Adapter + +#### Data Ingestion +```rust +pub struct IcebergInputAdapter { + table: IcebergTable, + scan: FileScan, + reader: ParquetReader, +} + +impl InputAdapter for IcebergInputAdapter { + async fn read_batch(&mut self) -> Result { + let files = self.scan.next_files().await?; + let batch = self.reader.read_files(files).await?; + Ok(batch) + } +} +``` + +#### Adapter Features +- **Incremental Reading**: Read only new/changed data +- **Parallel Processing**: Concurrent file reading +- **Format Support**: Parquet and other Iceberg-supported formats +- **Filter Pushdown**: Push filters to storage layer + +### Cloud Storage Integration + +#### S3 Support +```rust +pub struct S3IcebergCatalog { + client: S3Client, + warehouse_location: String, +} + +impl IcebergCatalog for S3IcebergCatalog { + async fn load_table(&self, name: &str) -> Result; + async fn create_table(&self, name: &str, schema: Schema) -> Result<(), CatalogError>; +} +``` + +#### Storage Features +- **Multi-Cloud**: Support for AWS S3, Azure, GCS +- **Authentication**: Handle cloud credentials securely +- **Performance**: Optimized for cloud storage patterns +- **Cost Optimization**: Minimize storage operations + +### Test Infrastructure + +#### Python Test Setup +```python +# create_test_table_s3.py +import pyiceberg +from pyiceberg.catalog import load_catalog + +def create_test_table(): + catalog = load_catalog("test") + schema = pyiceberg.schema.Schema([ + pyiceberg.types.NestedField(1, "id", pyiceberg.types.LongType()), + pyiceberg.types.NestedField(2, "name", pyiceberg.types.StringType()), + ]) + + catalog.create_table("test.table", schema) +``` + +#### Test Features +- **Realistic Data**: Generate realistic test datasets +- **Schema Variations**: Test various schema configurations +- **Performance Testing**: Measure ingestion performance +- **Integration Testing**: End-to-end pipeline testing + +## Development Workflow + +### For Iceberg Features + +1. Study Apache Iceberg specification +2. Implement feature following Iceberg standards +3. Add comprehensive tests with real data +4. Test with various storage backends +5. Validate performance characteristics +6. Document compatibility and limitations + +### For Storage Integration + +1. Implement storage backend interface +2. Add authentication and configuration +3. Test with real cloud storage +4. Optimize for performance and cost +5. Add error handling and retry logic +6. Document setup and configuration + +### Testing Strategy + +#### Unit Tests +- **Metadata Parsing**: Test Iceberg metadata handling +- **Schema Evolution**: Test schema change scenarios +- **Partitioning**: Test partition pruning logic +- **Error Handling**: Test various error conditions + +#### Integration Tests +- **Real Storage**: Test with actual S3 buckets +- **Large Data**: Test with realistic data sizes +- **Concurrent Access**: Test parallel reading +- **Schema Evolution**: Test with evolving schemas + +### Configuration + +#### Iceberg Configuration +```rust +pub struct IcebergConfig { + pub catalog_type: CatalogType, + pub warehouse_location: String, + pub s3_endpoint: Option, + pub credentials: CredentialsConfig, +} + +pub enum CatalogType { + Hive, + Hadoop, + S3, + Custom(Box), +} +``` + +#### Storage Configuration +- **Credentials**: AWS credentials, IAM roles, access keys +- **Endpoints**: S3 endpoints, regions, custom endpoints +- **Performance**: Connection pooling, retry policies +- **Security**: Encryption, access control + +### Performance Optimization + +#### Read Optimization +- **Parallel Reading**: Read multiple files concurrently +- **Filter Pushdown**: Apply filters at storage level +- **Column Pruning**: Read only required columns +- **Vectorized Processing**: Efficient data processing + +#### Memory Management +- **Streaming**: Stream large datasets without loading entirely +- **Buffer Management**: Optimize memory usage +- **Resource Control**: Limit concurrent operations +- **Garbage Collection**: Efficient memory cleanup + +### Configuration Files + +- `Cargo.toml` - Iceberg and cloud storage dependencies +- `src/test/requirements.txt` - Python test dependencies +- `src/test/requirements.ci.txt` - CI-specific Python dependencies + +### Dependencies + +#### Core Dependencies +- `iceberg-rs` - Rust Iceberg implementation +- `tokio` - Async runtime +- `aws-sdk-s3` - AWS S3 integration +- `parquet` - Parquet file format support + +#### Test Dependencies +- `tempfile` - Temporary test files +- `uuid` - Test data generation +- Python ecosystem for test data setup + +### Best Practices + +#### Iceberg Usage +- **Standards Compliance**: Follow Apache Iceberg specification +- **Schema Design**: Design schemas for evolution +- **Partitioning**: Choose appropriate partitioning strategies +- **Metadata Management**: Handle metadata efficiently + +#### Cloud Integration +- **Cost Awareness**: Minimize cloud storage costs +- **Performance**: Optimize for cloud storage patterns +- **Security**: Follow cloud security best practices +- **Reliability**: Handle transient cloud failures + +#### Error Handling +- **Transient Errors**: Retry transient cloud failures +- **Schema Errors**: Handle schema incompatibilities gracefully +- **Resource Errors**: Handle resource exhaustion +- **Data Errors**: Handle corrupted or missing data + +### Usage Examples + +#### Basic Table Access +```rust +use iceberg::{IcebergTable, S3IcebergCatalog}; + +let catalog = S3IcebergCatalog::new(s3_config); +let table = catalog.load_table("warehouse.orders").await?; + +let scan = table.scan(None).await?; +let batches = scan.collect().await?; +``` + +#### Filtered Reading +```rust +use iceberg::expressions::Expression; + +let filter = Expression::gt("order_date", "2023-01-01"); +let scan = table.scan(Some(filter)).await?; + +for batch in scan { + process_batch(batch?).await?; +} +``` + +This crate enables Feldera to work with modern data lake architectures using the Apache Iceberg table format for large-scale analytics. \ No newline at end of file diff --git a/crates/ir/CLAUDE.md b/crates/ir/CLAUDE.md new file mode 100644 index 00000000000..0c1baab0fdd --- /dev/null +++ b/crates/ir/CLAUDE.md @@ -0,0 +1,343 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the ir crate +cargo build -p ir + +# Run tests +cargo test -p ir + +# Regenerate test samples +./test/regen.bash + +# Check documentation +cargo doc -p ir --open +``` + +## Architecture Overview + +### Technology Stack + +- **Compiler IR**: Multi-level intermediate representation +- **SQL Analysis**: SQL program analysis and transformation +- **Type System**: Rich type information and inference +- **Serialization**: JSON-based IR serialization + +### Core Purpose + +IR provides **intermediate representation layers** for SQL compilation: + +- **HIR (High-level IR)**: Close to original SQL structure +- **MIR (Mid-level IR)**: Optimized and normalized representation +- **LIR (Low-level IR)**: Target-specific optimizations +- **Analysis**: Program analysis and transformation utilities + +### Project Structure + +#### Core Modules + +- `src/hir.rs` - High-level intermediate representation +- `src/mir.rs` - Mid-level intermediate representation +- `src/lir.rs` - Low-level intermediate representation +- `src/lib.rs` - Common IR utilities and traits +- `test/` - Test samples and regeneration scripts + +## Important Implementation Details + +### IR Hierarchy + +#### High-Level IR (HIR) +```rust +pub struct HirProgram { + pub tables: Vec, + pub views: Vec, + pub functions: Vec, +} + +pub struct ViewDefinition { + pub name: String, + pub query: Query, + pub schema: Schema, +} +``` + +HIR Features: +- **SQL-Close**: Maintains SQL structure and semantics +- **Type Information**: Rich type annotations +- **Metadata**: Preserves source location and comments +- **Validation**: Semantic validation and error reporting + +#### Mid-Level IR (MIR) +```rust +pub struct MirProgram { + pub operators: Vec, + pub data_flow: DataFlowGraph, + pub optimizations: Vec, +} + +pub enum Operator { + Filter(FilterOp), + Map(MapOp), + Join(JoinOp), + Aggregate(AggregateOp), +} +``` + +MIR Features: +- **Normalized**: Canonical operator representation +- **Optimized**: Applied optimization transformations +- **Data Flow**: Explicit data flow representation +- **Target Independent**: Platform-agnostic representation + +#### Low-Level IR (LIR) +```rust +pub struct LirProgram { + pub circuits: Vec, + pub schedule: ExecutionSchedule, + pub resources: ResourceRequirements, +} + +pub struct Circuit { + pub operators: Vec, + pub connections: Vec, +} +``` + +LIR Features: +- **Physical**: Target-specific operator selection +- **Scheduled**: Execution order and parallelization +- **Optimized**: Target-specific optimizations +- **Executable**: Ready for code generation + +### Transformation Pipeline + +#### HIR → MIR Transformation +```rust +pub struct HirToMirTransform { + optimizer: Optimizer, + normalizer: Normalizer, +} + +impl HirToMirTransform { + pub fn transform(&self, hir: HirProgram) -> Result { + let normalized = self.normalizer.normalize(hir)?; + let optimized = self.optimizer.optimize(normalized)?; + Ok(optimized) + } +} +``` + +Transformation Steps: +1. **Normalization**: Convert to canonical form +2. **Type Inference**: Infer missing type information +3. **Optimization**: Apply high-level optimizations +4. **Validation**: Ensure correctness preservation + +#### MIR → LIR Transformation +```rust +pub struct MirToLirTransform { + target: CompilationTarget, + scheduler: Scheduler, +} + +impl MirToLirTransform { + pub fn transform(&self, mir: MirProgram) -> Result { + let physical = self.select_operators(mir)?; + let scheduled = self.scheduler.schedule(physical)?; + Ok(scheduled) + } +} +``` + +Transformation Steps: +1. **Operator Selection**: Choose physical operators +2. **Scheduling**: Determine execution order +3. **Resource Planning**: Allocate computational resources +4. **Code Generation**: Prepare for target code generation + +### Analysis Framework + +#### Program Analysis +```rust +pub trait ProgramAnalysis { + type Result; + type Error; + + fn analyze(&self, program: &IR) -> Result; +} + +// Data flow analysis +pub struct DataFlowAnalysis; +impl ProgramAnalysis for DataFlowAnalysis { + type Result = DataFlowInfo; + type Error = AnalysisError; + + fn analyze(&self, program: &MirProgram) -> Result { + // Compute data flow information + } +} +``` + +Analysis Types: +- **Type Analysis**: Type checking and inference +- **Data Flow**: Variable definitions and uses +- **Control Flow**: Program control structure +- **Dependency Analysis**: Operator dependencies + +### Serialization Support + +#### JSON Serialization +```rust +use serde::{Deserialize, Serialize}; + +#[derive(Serialize, Deserialize)] +pub struct SerializableProgram { + pub version: String, + pub hir: Option, + pub mir: Option, + pub lir: Option, +} +``` + +Serialization Features: +- **Version Control**: Track IR format versions +- **Partial Serialization**: Serialize individual IR levels +- **Human Readable**: JSON format for debugging +- **Round-Trip**: Preserve all information through serialization + +## Development Workflow + +### For IR Extensions + +1. Define new IR nodes or transformations +2. Update serialization support +3. Add comprehensive tests +4. Update transformation passes +5. Regenerate test samples +6. Document changes and compatibility + +### For Analysis Passes + +1. Define analysis trait implementation +2. Add analysis-specific data structures +3. Implement analysis algorithm +4. Add validation and error handling +5. Test with various program patterns +6. Integrate with compilation pipeline + +### Testing Strategy + +#### Round-Trip Testing +- **Serialization**: Test JSON serialization round-trip +- **Transformation**: Test HIR→MIR→LIR transformations +- **Preservation**: Ensure semantic preservation +- **Error Handling**: Test error conditions + +#### Sample Programs +```bash +# Regenerate test samples +cd test +./regen.bash + +# Test with specific samples +cargo test test_sample_a +cargo test test_sample_b +``` + +### IR Validation + +#### Semantic Validation +```rust +pub trait IrValidator { + fn validate(&self, program: &IR) -> Vec; +} + +pub struct HirValidator; +impl IrValidator for HirValidator { + fn validate(&self, program: &HirProgram) -> Vec { + let mut errors = Vec::new(); + + // Check type consistency + errors.extend(self.check_types(program)); + + // Check name resolution + errors.extend(self.check_names(program)); + + errors + } +} +``` + +#### Validation Categories +- **Type Safety**: Type consistency checking +- **Name Resolution**: Variable and function binding +- **Control Flow**: Reachability and termination +- **Resource Usage**: Memory and computation bounds + +### Configuration Files + +- `Cargo.toml` - IR processing dependencies +- `test/regen.bash` - Test sample regeneration script +- Sample files: `sample_*.sql` and `sample_*.json` + +### Dependencies + +#### Core Dependencies +- `serde` - Serialization framework +- `serde_json` - JSON support +- `thiserror` - Error handling + +### Best Practices + +#### IR Design +- **Immutable**: Design IR nodes as immutable +- **Type Rich**: Include comprehensive type information +- **Serializable**: Ensure all IR is serializable +- **Validated**: Include validation at each level + +#### Transformation Design +- **Correctness**: Preserve program semantics +- **Composable**: Design transformations to compose +- **Reversible**: Consider round-trip transformations +- **Tested**: Comprehensive transformation testing + +#### Analysis Design +- **Modular**: Design analyses to be composable +- **Incremental**: Support incremental analysis +- **Error Rich**: Provide detailed error information +- **Performance**: Optimize for large programs + +### Usage Examples + +#### Basic Transformation +```rust +use ir::{HirProgram, HirToMirTransform, MirToLirTransform}; + +// Load HIR from JSON +let hir: HirProgram = serde_json::from_str(&json_content)?; + +// Transform through pipeline +let hir_to_mir = HirToMirTransform::new(); +let mir = hir_to_mir.transform(hir)?; + +let mir_to_lir = MirToLirTransform::new(target); +let lir = mir_to_lir.transform(mir)?; +``` + +#### Analysis Usage +```rust +use ir::{DataFlowAnalysis, ProgramAnalysis}; + +let analysis = DataFlowAnalysis::new(); +let flow_info = analysis.analyze(&mir_program)?; + +// Use analysis results for optimization +let optimizer = Optimizer::new(flow_info); +let optimized = optimizer.optimize(mir_program)?; +``` + +This crate provides the compiler infrastructure that enables sophisticated analysis and optimization of SQL programs during the compilation process. \ No newline at end of file diff --git a/crates/nexmark/CLAUDE.md b/crates/nexmark/CLAUDE.md new file mode 100644 index 00000000000..d9c35f6885e --- /dev/null +++ b/crates/nexmark/CLAUDE.md @@ -0,0 +1,263 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the nexmark crate +cargo build -p nexmark + +# Run data generation example +cargo run --example generate -p nexmark + +# Run benchmarks +cargo bench -p nexmark + +# Generate NEXMark data +cargo run -p nexmark --bin generate -- --events 1000000 +``` + +## Architecture Overview + +### Technology Stack + +- **Benchmark Suite**: Industry-standard streaming benchmark +- **Data Generation**: Realistic auction data simulation +- **Query Implementation**: 22 standard benchmark queries +- **Performance Testing**: Throughput and latency measurement + +### Core Purpose + +NEXMark provides **streaming benchmark capabilities** for the DBSP engine: + +- **Data Generation**: Realistic auction, bidder, and person data +- **Query Suite**: 22 standardized streaming queries +- **Performance Measurement**: Benchmark execution and metrics +- **Testing Framework**: Validate DBSP streaming performance + +### Project Structure + +#### Core Modules + +- `src/generator/` - Data generation for auctions, bids, and people +- `src/queries/` - Implementation of all 22 NEXMark queries +- `src/model.rs` - Data model definitions +- `src/config.rs` - Benchmark configuration parameters + +## Important Implementation Details + +### Data Model + +#### Core Entities +```rust +// Person entity (bidders and sellers) +pub struct Person { + pub id: usize, + pub name: String, + pub email: String, + pub credit_card: String, + pub city: String, + pub state: String, + pub date_time: u64, +} + +// Auction entity +pub struct Auction { + pub id: usize, + pub seller: usize, + pub category: usize, + pub initial_bid: usize, + pub date_time: u64, + pub expires: u64, +} + +// Bid entity +pub struct Bid { + pub auction: usize, + pub bidder: usize, + pub price: usize, + pub date_time: u64, +} +``` + +### Data Generation + +#### Realistic Data Patterns +The generator produces realistic auction data: +- **Temporal Patterns**: Realistic time distributions +- **Price Models**: Market-based pricing behavior +- **Geographic Distribution**: Realistic location data +- **Correlation**: Realistic relationships between entities + +#### Configuration Options +```rust +pub struct Config { + pub events_per_second: usize, + pub auction_proportion: f64, + pub bid_proportion: f64, + pub person_proportion: f64, + pub num_categories: usize, + pub auction_length_seconds: usize, +} +``` + +### Query Suite + +#### Standard NEXMark Queries + +**Q0: Pass Through** +- Simple data throughput measurement +- No computation, pure I/O benchmark + +**Q1: Currency Conversion** +- Convert bid prices to different currency +- Tests map operations performance + +**Q2: Selection** +- Filter auctions by category +- Tests filtering performance + +**Q3: Local Item Suggestion** +- Join people and auctions by location +- Tests join performance + +**Q4: Average Price for Category** +- Windowed aggregation over auction categories +- Tests windowing and aggregation + +**Q5-Q22**: Complex streaming queries testing various aspects: +- Complex joins across multiple streams +- Windowed aggregations with various time bounds +- Pattern matching and sequence detection +- Multi-stage processing pipelines + +### Benchmark Execution + +#### Performance Metrics +```rust +pub struct BenchmarkResults { + pub throughput_events_per_second: f64, + pub latency_percentiles: LatencyDistribution, + pub memory_usage: MemoryStats, + pub cpu_utilization: f64, +} +``` + +#### Query Categories +- **Simple Queries (Q0-Q2)**: Basic operations +- **Join Queries (Q3, Q5, Q7-Q11)**: Multi-stream joins +- **Aggregation Queries (Q4, Q6, Q12-Q15)**: Windowed aggregates +- **Complex Queries (Q16-Q22)**: Advanced streaming patterns + +## Development Workflow + +### For Query Implementation + +1. Study NEXMark specification for query semantics +2. Implement query using DBSP operators +3. Add comprehensive test cases with known results +4. Benchmark performance against reference implementations +5. Optimize for DBSP's incremental computation model +6. Validate correctness with various data distributions + +### For Data Generation + +1. Modify generator in `src/generator/` +2. Ensure realistic data distributions +3. Test with various configuration parameters +4. Validate data consistency and relationships +5. Benchmark generation performance +6. Test with different event rates and patterns + +### Testing Strategy + +#### Correctness Testing +- **Known Results**: Test queries with pre-computed results +- **Cross-Validation**: Compare with reference implementations +- **Edge Cases**: Empty streams, single events, boundary conditions +- **Data Consistency**: Validate generated data relationships + +#### Performance Testing +- **Throughput**: Events processed per second +- **Latency**: End-to-end processing delay +- **Memory Usage**: Peak and steady-state memory +- **Scalability**: Performance across different data rates + +### Configuration Options + +#### Data Generation Config +```rust +let config = Config { + events_per_second: 10_000, + auction_proportion: 0.1, + bid_proportion: 0.8, + person_proportion: 0.1, + num_categories: 100, + auction_length_seconds: 600, +}; +``` + +#### Benchmark Config +- **Event Rate**: Target events per second +- **Duration**: Benchmark runtime +- **Warmup**: Warmup period before measurement +- **Data Size**: Total events to generate + +### Query Implementations + +Each query demonstrates different DBSP capabilities: + +#### Stream Processing Patterns +- **Filtering**: Select relevant events +- **Mapping**: Transform event data +- **Joining**: Correlate events across streams +- **Aggregating**: Compute statistics over windows +- **Windowing**: Time-based event grouping + +#### Advanced Features +- **Watermarks**: Handle out-of-order events +- **Late Data**: Process delayed events +- **State Management**: Maintain query state +- **Result Updates**: Incremental result computation + +### Configuration Files + +- `Cargo.toml` - Benchmark and data generation dependencies +- Benchmark configuration in code (no external config files) +- Query-specific parameters in individual query modules + +### Dependencies + +#### Core Dependencies +- `rand` - Random data generation +- `chrono` - Date/time handling +- `serde` - Data serialization +- `csv` - Data export formats + +#### DBSP Integration +- `dbsp` - Core streaming engine +- Benchmark harness integration +- Performance measurement tools + +### Best Practices + +#### Query Implementation +- **Incremental Friendly**: Design for incremental computation +- **Resource Aware**: Consider memory and CPU usage +- **Realistic**: Match real-world query patterns +- **Tested**: Comprehensive correctness validation + +#### Data Generation +- **Realistic Distribution**: Match real auction behavior +- **Configurable**: Support various benchmark scenarios +- **Repeatable**: Deterministic generation for testing +- **Scalable**: Handle various event rates efficiently + +#### Performance Measurement +- **Warm-up**: Allow system to reach steady state +- **Statistical Significance**: Multiple runs with confidence intervals +- **Resource Monitoring**: Track all relevant metrics +- **Reproducible**: Consistent measurement methodology + +This crate provides industry-standard benchmarking for streaming systems, enabling performance validation and optimization of DBSP-based applications. \ No newline at end of file diff --git a/crates/pipeline-manager/CLAUDE.md b/crates/pipeline-manager/CLAUDE.md new file mode 100644 index 00000000000..05da6db1a82 --- /dev/null +++ b/crates/pipeline-manager/CLAUDE.md @@ -0,0 +1,309 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the pipeline manager +cargo build -p pipeline-manager + +# Run tests +cargo test -p pipeline-manager + +# Run integration tests +cargo test -p pipeline-manager --test integration_test + +# Build with all features +cargo build -p pipeline-manager --all-features +``` + +### Running the Service + +```bash +# Run the pipeline manager server +cargo run -p pipeline-manager + +# Run with specific configuration +RUST_LOG=debug cargo run -p pipeline-manager -- --config config.toml + +# Run database migrations +cargo run -p pipeline-manager -- --migrate-database + +# Dump OpenAPI specification +cargo run -p pipeline-manager -- --dump-openapi +``` + +### Development Tools + +```bash +# Run with hot reload using cargo-watch +cargo watch -x "run -p pipeline-manager" + +# Check database connectivity +cargo run -p pipeline-manager -- --probe-db + +# Generate banner +cargo run -p pipeline-manager -- --print-banner +``` + +## Architecture Overview + +### Technology Stack + +- **Web Framework**: Actix-web for HTTP server +- **Database**: PostgreSQL with SQLx for async database operations +- **Authentication**: JWT-based authentication with configurable providers +- **API Documentation**: OpenAPI/Swagger specification generation +- **Build Tools**: Custom build scripts for banner generation + +### Service Components + +- **API Server**: RESTful HTTP API for pipeline management +- **Database Layer**: PostgreSQL storage for pipelines, programs, and metadata +- **Compiler Integration**: SQL-to-DBSP and Rust compilation orchestration +- **Runner Service**: Pipeline execution and lifecycle management +- **Authentication**: Multi-provider authentication system + +### Project Structure + +#### Core Directories + +- `src/api/` - HTTP API endpoints and request handling +- `src/db/` - Database operations and schema management +- `src/compiler/` - SQL and Rust compilation integration +- `src/runner/` - Pipeline execution and management +- `src/auth.rs` - Authentication and authorization +- `migrations/` - Database migration scripts + +#### Key Components + +- **API Endpoints**: CRUD operations for pipelines and programs +- **Database Abstraction**: Type-safe database operations +- **Compiler Services**: Integration with SQL-to-DBSP compiler +- **Pipeline Execution**: Runtime management and monitoring + +## Important Implementation Details + +### Database Schema + +The service uses PostgreSQL with versioned migrations: + +- **Programs**: SQL program definitions and compilation status +- **Pipelines**: Pipeline configurations and runtime state +- **API Keys**: Authentication credentials and permissions +- **Tenants**: Multi-tenancy support + +### API Structure + +#### Core Endpoints + +``` +GET /api/programs - List programs +POST /api/programs - Create program +GET /api/programs/{id} - Get program details +PATCH /api/programs/{id} - Update program +DELETE /api/programs/{id} - Delete program + +GET /api/pipelines - List pipelines +POST /api/pipelines - Create pipeline +GET /api/pipelines/{id} - Get pipeline details +PATCH /api/pipelines/{id} - Update pipeline +DELETE /api/pipelines/{id} - Delete pipeline + +GET /v0/config - Get configuration (includes tenant info) +GET /config/authentication - Get authentication provider configuration +GET /config/demos - Get list of available demos +``` + +#### Pipeline Lifecycle + +``` +POST /api/pipelines/{id}/start - Start pipeline +POST /api/pipelines/{id}/pause - Pause pipeline +POST /api/pipelines/{id}/shutdown - Stop pipeline +GET /api/pipelines/{id}/stats - Get runtime statistics +``` + +### Compilation Pipeline + +1. **SQL Parsing**: Validate SQL program syntax +2. **DBSP Generation**: Convert SQL to DBSP circuit +3. **Rust Compilation**: Compile generated Rust code +4. **Binary Packaging**: Create executable pipeline binary +5. **Deployment**: Deploy and manage pipeline execution + +### Authentication System + +The pipeline-manager supports multiple authentication providers through a unified OIDC/OAuth2 framework: + +#### **Supported Providers** +- **None**: No authentication (development/testing) +- **AWS Cognito** +- **Generic OIDC** - Okta + +#### **Authentication Mechanisms** +- **OIDC Tokens**: OIDC-compliant Access token validation with RS256 signature verification +- **API Keys**: User-generated keys that + +Authentication of HTTP API requests is performed through Authorization header via `Bearer ` value + +Additional features: +- JWK Caching: Automatic public key fetching and caching from provider endpoints +- Bearer Token Authorization: Client sends Access OIDC token as Bearer token; all claims (tenant, groups, etc.) are extracted from this Access token + +#### **Configuration** +```bash +# Environment variables for OIDC providers +FELDERA_AUTH_ISSUER=https://your-domain.okta.com/oauth2/ +FELDERA_AUTH_CLIENT_ID=your-client-id + +# For AWS Cognito (additional variables) +AWS_COGNITO_LOGIN_URL=https://your-domain.auth.region.amazoncognito.com/login +AWS_COGNITO_LOGOUT_URL=https://your-domain.auth.region.amazoncognito.com/logout +``` + +#### **Authorization mechanisms** + +The authentication system supports flexible tenant assignment strategies across all supported OIDC providers (AWS Cognito, Okta): + +**Tenant Assignment Strategies:** + +1. **Multi-tenant Access** (new) + - OIDC Access token contains `tenants` claim with array of authorized tenant names + - User must include `Feldera-Tenant` HTTP header to select which tenant to access + - Web console provides tenant selector dropdown + - Example token claim: `{"tenants": ["feldera-engineering", "feldera-dev", "feldera-staging"]}` + - Alternative format: `{"tenants": "feldera-engineering,feldera-dev,feldera-staging"}` + +2. **Single-tenant Access** (traditional) + - OIDC Access token contains single `tenant` claim + - No header required, tenant is automatically selected + - Example token claim: `{"tenant": "feldera-engineering"}` + +3. **Fallback Strategies** (when no explicit tenant claims present) + - `--individual-tenant` (default: true) - Creates individual tenants based on user's `sub` claim + - `--issuer-tenant` (default: false) - Derives tenant name from auth issuer hostname (e.g., `company.okta.com`) + +**Tenant Resolution Priority (all providers):** +1. `tenants` array claim + `Feldera-Tenant` header (multi-tenant access) +2. `tenant` claim (single tenant assignment via OIDC provider) +3. Issuer domain extraction (when `--issuer-tenant` enabled) +4. User `sub` claim (when `--individual-tenant` enabled) + +**Group-based authorization:** +If `--authorized-groups` is configured, the user must have at least one of these groups in the `groups` claim of the OIDC Access token. + +**HTTP Headers:** +- `Authorization: Bearer ` - Required for all authenticated requests +- `Feldera-Tenant: ` - Required only when access token contains `tenants` array with multiple values + +#### **Enterprise Features** +- **Fault tolerance**: Mechanism to recover from a crash by making periodic checkpoints, identifying and replaying lost state + +## Development Workflow + +### For API Changes + +1. Modify endpoint handlers in `src/api/endpoints/` +2. Update database operations in `src/db/operations/` +3. Add/update database migrations in `migrations/` +4. Update OpenAPI specification +5. Add integration tests + +### For Database Changes + +1. Create migration in `migrations/V{n}__{description}.sql` +2. Update corresponding types in `src/db/types/` +3. Modify database operations in `src/db/operations/` +4. Test migration rollback scenarios +5. Update integration tests + +### Testing Strategy + +#### Unit Tests +- Database operation testing with test fixtures +- API endpoint testing with mock dependencies +- Authentication and authorization testing + +#### Integration Tests +- End-to-end API testing with real database +- Pipeline lifecycle testing +- Multi-tenant isolation verification + +### Configuration Management + +The service supports multiple configuration sources: + +- **Environment Variables**: Runtime configuration +- **Configuration Files**: TOML-based configuration +- **Command Line Arguments**: Override configuration +- **Database Configuration**: Dynamic configuration storage + +### Error Handling + +- **Structured Errors**: Type-safe error propagation +- **HTTP Error Mapping**: Appropriate HTTP status codes +- **Database Error Handling**: Transaction rollback and recovery +- **Compilation Error Reporting**: Detailed error messages + +### Configuration Files + +- `Cargo.toml` - Package configuration with database features +- `build.rs` - Build-time banner and asset generation +- `migrations/` - Database schema evolution +- `openapi.json` - Generated API specification + +### Key Features + +- **Multi-tenancy**: Isolated environments per tenant +- **High Availability**: Database connection pooling and retry logic +- **Monitoring**: Structured logging and metrics +- **Security**: Input validation and SQL injection prevention + +### Dependencies + +#### Core Dependencies +- `actix-web` - HTTP server framework +- `sqlx` - Async PostgreSQL client +- `serde` - Serialization/deserialization +- `tokio` - Async runtime + +#### Database Dependencies +- `sqlx-postgres` - PostgreSQL driver +- `uuid` - UUID generation +- `chrono` - Date/time handling + +#### Authentication Dependencies +- `jsonwebtoken` - JWT token handling +- `argon2` - Password hashing +- `oauth2` - OAuth2 client support + +### Performance Considerations + +- **Connection Pooling**: Database connection management +- **Async Processing**: Non-blocking I/O operations +- **Caching**: In-memory caching of frequently accessed data +- **Batch Operations**: Efficient bulk database operations + +### Security Best Practices + +- **Input Validation**: Comprehensive request validation +- **SQL Injection Prevention**: Parameterized queries +- **Authentication**: Secure token management +- **Authorization**: Fine-grained access control +- **Audit Logging**: Security event tracking + +### Development Tools Integration + +- **Hot Reload**: Development server with automatic restart +- **Database Migrations**: Version-controlled schema changes +- **OpenAPI Generation**: Automatic API documentation +- **Banner Generation**: Build-time asset creation + +### Monitoring and Observability + +- **Structured Logging**: JSON-formatted log output +- **Metrics Collection**: Performance and usage metrics +- **Health Checks**: Service health monitoring endpoints +- **Distributed Tracing**: Request tracing across services \ No newline at end of file diff --git a/crates/rest-api/CLAUDE.md b/crates/rest-api/CLAUDE.md new file mode 100644 index 00000000000..c5f9857fe64 --- /dev/null +++ b/crates/rest-api/CLAUDE.md @@ -0,0 +1,234 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the rest-api crate +cargo build -p rest-api + +# Generate OpenAPI specification +cargo run -p rest-api --bin generate_openapi + +# Validate OpenAPI spec +cargo test -p rest-api +``` + +## Architecture Overview + +### Technology Stack + +- **OpenAPI Generation**: Automatic API specification generation +- **Type Definitions**: Shared types for REST API +- **Build Integration**: Build-time API specification generation + +### Core Purpose + +REST API provides **API specification and type definitions**: + +- **OpenAPI Spec**: Machine-readable API specification +- **Type Safety**: Shared types between server and client +- **Documentation**: API documentation generation +- **Code Generation**: Support for client SDK generation + +### Project Structure + +#### Core Files + +- `src/lib.rs` - Core API type definitions +- `build.rs` - Build-time OpenAPI generation +- `openapi.json` - Generated OpenAPI specification + +## Important Implementation Details + +### OpenAPI Specification + +#### Generated Specification +The crate generates a complete OpenAPI 3.0 specification including: +- **Endpoints**: All REST API endpoints +- **Schemas**: Request/response data structures +- **Authentication**: Security scheme definitions +- **Examples**: Request/response examples + +#### Specification Features +```json +{ + "openapi": "3.0.0", + "info": { + "title": "Feldera API", + "version": "0.115.0" + }, + "paths": { + "/api/pipelines": { + "get": { + "summary": "List pipelines", + "responses": { ... } + } + } + }, + "components": { + "schemas": { ... } + } +} +``` + +### Type Definitions + +#### Core API Types +```rust +// Pipeline management types +pub struct Pipeline { + pub id: PipelineId, + pub name: String, + pub description: Option, + pub config: PipelineConfig, + pub status: PipelineStatus, +} + +// Program management types +pub struct Program { + pub id: ProgramId, + pub name: String, + pub code: String, + pub schema: Option, +} +``` + +#### Type Features +- **Serialization**: Full serde support for JSON +- **Validation**: Input validation with detailed errors +- **Documentation**: Comprehensive field documentation +- **Compatibility**: Version compatibility across releases + +### Build-Time Generation + +#### Build Script Integration +```rust +// build.rs +fn main() { + generate_openapi_spec(); + validate_spec_completeness(); + update_client_bindings(); +} +``` + +#### Generation Process +1. **Extract Types**: Analyze Rust type definitions +2. **Generate Schemas**: Convert to OpenAPI schemas +3. **Validate Spec**: Ensure specification completeness +4. **Write Output**: Generate `openapi.json` file + +## Development Workflow + +### For API Changes + +1. Update type definitions in `src/lib.rs` +2. Add appropriate serde annotations +3. Add documentation comments +4. Run build to regenerate OpenAPI spec +5. Validate specification completeness +6. Update client SDKs if needed + +### For New Endpoints + +1. Define request/response types +2. Add appropriate validation +3. Document all fields and examples +4. Ensure consistent naming patterns +5. Test serialization/deserialization +6. Update API documentation + +### Testing Strategy + +#### Type Testing +- **Serialization**: Round-trip JSON serialization +- **Validation**: Input validation edge cases +- **Schema**: OpenAPI schema compliance +- **Compatibility**: Backward compatibility testing + +#### Specification Testing +- **Completeness**: All endpoints documented +- **Validity**: Valid OpenAPI 3.0 specification +- **Examples**: All examples are valid +- **Consistency**: Consistent naming and patterns + +### OpenAPI Features + +#### Schema Generation +- **Automatic**: Generated from Rust types +- **Comprehensive**: Complete type information +- **Validated**: Ensures spec correctness +- **Examples**: Includes realistic examples + +#### Documentation Integration +- **Interactive**: Swagger UI integration +- **Searchable**: Full-text search support +- **Versioned**: Version-specific documentation +- **Client Generation**: Support for multiple languages + +### Configuration Files + +- `Cargo.toml` - Minimal dependencies for type definitions +- `build.rs` - OpenAPI generation logic +- `openapi.json` - Generated API specification + +### Dependencies + +#### Core Dependencies +- `serde` - Serialization/deserialization +- `serde_json` - JSON support +- `uuid` - UUID type support +- `chrono` - Date/time types + +#### Build Dependencies +- `openapi` - OpenAPI specification generation +- `schemars` - JSON schema generation +- `utoipa` - OpenAPI derive macros + +### Best Practices + +#### Type Design +- **Consistency**: Follow consistent naming patterns +- **Documentation**: Document all public types and fields +- **Validation**: Include appropriate validation rules +- **Examples**: Provide realistic examples + +#### API Design +- **RESTful**: Follow REST principles +- **Consistent**: Consistent response formats +- **Versioned**: Support for API versioning +- **Error Handling**: Structured error responses + +#### Documentation +- **Comprehensive**: Document all endpoints and types +- **Examples**: Include request/response examples +- **Error Cases**: Document error conditions +- **Changelog**: Track API changes across versions + +### Usage Examples + +#### Type Usage +```rust +use rest_api::{Pipeline, PipelineStatus}; + +let pipeline = Pipeline { + id: PipelineId::new(), + name: "my-pipeline".to_string(), + status: PipelineStatus::Running, + ..Default::default() +}; + +let json = serde_json::to_string(&pipeline)?; +``` + +#### OpenAPI Integration +```bash +# Generate client SDK from spec +openapi-generator generate \ + -i openapi.json \ + -g typescript-fetch \ + -o client/typescript +``` + +This crate provides the foundational API types and specifications that enable consistent API usage across all Feldera components and client SDKs. \ No newline at end of file diff --git a/crates/sqllib/CLAUDE.md b/crates/sqllib/CLAUDE.md new file mode 100644 index 00000000000..6930ee9a976 --- /dev/null +++ b/crates/sqllib/CLAUDE.md @@ -0,0 +1,309 @@ +## Overview + +## Key Development Commands + +### Building and Testing + +```bash +# Build the sqllib crate +cargo build -p sqllib + +# Run tests +cargo test -p sqllib + +# Run specific test modules +cargo test -p sqllib test_decimal +cargo test -p sqllib test_string_functions + +# Check documentation +cargo doc -p sqllib --open +``` + +## Architecture Overview + +### Technology Stack + +- **Numeric Computing**: High-precision decimal arithmetic +- **String Processing**: Unicode-aware string operations +- **Date/Time**: Comprehensive temporal data support +- **Spatial Data**: Geographic point operations +- **Type System**: Rich SQL type mapping to Rust types + +### Core Purpose + +SQL Library provides **runtime support functions** for SQL operations compiled by the SQL-to-DBSP compiler: + +- **Aggregate Functions**: SUM, COUNT, AVG, MIN, MAX, etc. +- **Scalar Functions**: String manipulation, date/time operations, mathematical functions +- **Type System**: SQL type representations with null handling +- **Operators**: Arithmetic, comparison, logical operations +- **Casting**: Type conversions between SQL types + +### Project Structure + +#### Core Modules + +- `src/aggregates.rs` - SQL aggregate function implementations +- `src/operators.rs` - Arithmetic and comparison operators +- `src/casts.rs` - Type conversion functions +- `src/string.rs` - String manipulation functions +- `src/timestamp.rs` - Date and time operations +- `src/decimal.rs` - High-precision decimal arithmetic +- `src/array.rs` - SQL array operations +- `src/map.rs` - SQL map (key-value) operations + +## Important Implementation Details + +### Type System + +#### Core SQL Types +```rust +// Nullable wrapper for all SQL types +pub type SqlOption = Option; + +// String types with proper null handling +pub type SqlString = Option; +pub type SqlChar = Option; + +// Numeric types +pub type SqlInt = Option; +pub type SqlBigInt = Option; +pub type SqlDecimal = Option; + +// Temporal types +pub type SqlTimestamp = Option; +pub type SqlDate = Option; +pub type SqlTime = Option