Skip to content

Conversation

@ByteBaker
Copy link
Contributor

Summary

Implements alert correlation that groups related alerts into incidents based on semantic field matching and temporal proximity. Migrated from PR #9011 (separated from deduplication in PR #9209).

Core Features

  • Semantic Correlation: Match alerts by dimensions (service, host, k8s cluster, etc.)
  • Incident Lifecycle: Track incidents (Open → Acknowledged → Resolved)
  • Confidence Scoring: High/Medium/Low based on match quality
  • Transaction Safety: Atomic incident creation/updates to prevent race conditions

API Endpoints (6)

  • GET /{org_id}/incidents - List incidents with status filter
  • GET /{org_id}/incidents/{id} - Get incident details
  • PUT /{org_id}/incidents/{id}/status - Update status (ack/resolve)
  • GET/POST/DELETE /{org_id}/correlation/config - Manage correlation settings

Database Schema

  • alert_incidents: Incident records with correlation metadata
  • alert_incident_alerts: Many-to-many alert-to-incident mapping
  • Migration: m20251107_000003_create_alert_correlation_schema

UI

  • Incidents tab in Alerts page
  • IncidentList.vue with status filtering
  • IncidentDetailsDrawer.vue for viewing associated alerts

Implementation

  • Scheduler integration: Auto-correlate on alert firing
  • 7 correlation metrics for observability
  • Feature-gated with #[cfg(feature = "enterprise")]

Fixes

  • Fix OrganizationSettingResponse test
  • Fix metering init call signature
  • Comment out data retention code pending enterprise update

@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

…groups

Implements org-level deduplication configuration with semantic field groups
for intelligent alert suppression and batched notifications.

Core capabilities:
- Semantic field groups: Map field name variations (`hostname`/`host`/`node`) to
  canonical dimensions for consistent deduplication across data sources
- Org-level dedup config: Global settings with cross-alert deduplication
  support to suppress alerts sharing semantic dimensions
- Alert grouping: Wait-and-collect batching with three send strategies
  (`FirstWithCount`, `Summary`, `All`)
- HTTP API: Endpoints for org-level deduplication configuration management

Features:
- Per-alert fingerprint-based deduplication with TTL expiration
- Cross-alert semantic matching using org-defined dimension groups
- Configurable time windows (per-alert override or org default)
- Background job processes expired batches every 1 second
- Three notification strategies for grouped alerts

Configuration:
- `SemanticFieldGroup`: Define field equivalences with optional normalization
- `GlobalDeduplicationConfig`: Org-level settings stored at `/alert_config/{org_id}/deduplication`
- `DeduplicationConfig`: Per-alert settings with fingerprint fields and grouping options
- Default presets for common semantic groups (host, IP, service, K8s resources)

API Endpoints:
- `GET/POST/DELETE /{org_id}/alerts/deduplication/config`

Implementation:
- Business logic: Pure algorithms in enterprise layer for fingerprinting and matching
- Service layer: Orchestrates DB operations with algorithm delegation
- HTTP handlers: Feature-gated dual implementations for OSS/enterprise builds
- Background jobs: Batch processor for grouped notification delivery
Implemented wait-and-batch mechanism to group multiple alerts with the same fingerprint before sending a single notification, reducing alert fatigue and improving visibility.

**Alert Grouping/Batching:**
- Added `grouping.rs` module with in-memory batch storage using `DashMap`
- Implemented background worker in `alert_grouping.rs` polling every 1s for expired batches
- Integrated grouping logic in `scheduler/handlers.rs` after deduplication
- Supported all three `SendStrategy` variants: `FirstWithCount`, `Summary`, `All`
- Auto-send when `max_group_size` reached or timer expires after `group_wait_seconds`
- Registered background worker in `job/mod.rs`

**Observability - Prometheus Metrics:**
- Added 8 metrics to `metrics.rs`: dedup suppressions/passed/errors, grouping batches pending/sent/size/wait-time/errors
- Instrumented `deduplication.rs` to track suppressions and passed alerts by type (same-alert vs cross-alert)
- Instrumented `grouping.rs` and `alert_grouping.rs` to track batch lifecycle
- All metrics registered in Prometheus registry and exposed at `/metrics` endpoint

**UI Visibility:**
- Added dedup badges to alert names in `AlertList.vue` showing configuration status
- Added dedup column in `AlertHistory.vue` with visual indicators for sent/suppressed/grouped alerts
- Created `DedupSummaryCards.vue` component displaying org-wide stats (total alerts, dedup enabled count, suppression rate, pending batches)
- Added backend API `dedup_stats.rs` with `/alerts/dedup/summary` endpoint
- Removed legacy View History button from alert list page
- Extended `AlertHistoryEntry` with dedup fields: `dedup_enabled`, `dedup_suppressed`, `dedup_count`, `grouped`, `group_size`

**Logging & Debugging:**
- Comprehensive logging throughout grouping flow with `[grouping]` and `[alert_grouping_worker]` prefixes
- Enhanced deduplication logging with `[dedup]` prefix showing fingerprints and occurrence counts
- Added `get_pending_batch_count()` helper for API consumption

**Technical Details:**
- All features properly gated behind `#[cfg(feature = "enterprise")]`
- Backward compatible: grouping disabled by default
- In-memory batches cleared on restart (acceptable for 30s window)
- Thread-safe implementation using `DashMap` and atomic operations
Implements alert correlation that groups related alerts into incidents based on semantic field matching and temporal proximity.

**Backend:**
- Add `correlation.rs` config with validation for correlation dimensions and matching strategies
- Add `alert_incidents` and `alert_incident_alerts` database entities with SeaORM
- Add database migration `m20251107_000003_create_alert_correlation_schema`
- Add `correlation.rs` service with transaction-safe incident creation and matching
- Add `incidents.rs` HTTP handlers for incident CRUD operations (6 endpoints)
- Integrate correlation into alert scheduler to auto-correlate on alert firing
- Add 7 correlation metrics for observability (incidents created, alerts matched, confidence distribution, processing duration, MTTR)
- Update `org_config.rs` with correlation config persistence functions
- Update organization settings to include deduplication config in response

**Frontend:**
- Add `IncidentList.vue` component with status filtering and sortable table
- Add `IncidentDetailsDrawer.vue` for viewing incident details and associated alerts
- Add Incidents tab to `AlertList.vue` for accessing incident management UI
- Add 6 incident API methods to `alerts.ts` service (list, get, update status, config CRUD)

**Fixes:**
- Fix `OrganizationSettingResponse` test to include `deduplication_config` field
- Fix metering init call signature (remove extra argument)
- Comment out data retention usage code pending enterprise module update

Migrated from PR #9011, separated from deduplication feature (PR #9209).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants