-
Notifications
You must be signed in to change notification settings - Fork 711
feat: add alert correlation and incident management #9297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ByteBaker
wants to merge
3
commits into
main
Choose a base branch
from
feat/alert/corr
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
|
Failed to generate code suggestions for PR |
…groups
Implements org-level deduplication configuration with semantic field groups
for intelligent alert suppression and batched notifications.
Core capabilities:
- Semantic field groups: Map field name variations (`hostname`/`host`/`node`) to
canonical dimensions for consistent deduplication across data sources
- Org-level dedup config: Global settings with cross-alert deduplication
support to suppress alerts sharing semantic dimensions
- Alert grouping: Wait-and-collect batching with three send strategies
(`FirstWithCount`, `Summary`, `All`)
- HTTP API: Endpoints for org-level deduplication configuration management
Features:
- Per-alert fingerprint-based deduplication with TTL expiration
- Cross-alert semantic matching using org-defined dimension groups
- Configurable time windows (per-alert override or org default)
- Background job processes expired batches every 1 second
- Three notification strategies for grouped alerts
Configuration:
- `SemanticFieldGroup`: Define field equivalences with optional normalization
- `GlobalDeduplicationConfig`: Org-level settings stored at `/alert_config/{org_id}/deduplication`
- `DeduplicationConfig`: Per-alert settings with fingerprint fields and grouping options
- Default presets for common semantic groups (host, IP, service, K8s resources)
API Endpoints:
- `GET/POST/DELETE /{org_id}/alerts/deduplication/config`
Implementation:
- Business logic: Pure algorithms in enterprise layer for fingerprinting and matching
- Service layer: Orchestrates DB operations with algorithm delegation
- HTTP handlers: Feature-gated dual implementations for OSS/enterprise builds
- Background jobs: Batch processor for grouped notification delivery
Implemented wait-and-batch mechanism to group multiple alerts with the same fingerprint before sending a single notification, reducing alert fatigue and improving visibility. **Alert Grouping/Batching:** - Added `grouping.rs` module with in-memory batch storage using `DashMap` - Implemented background worker in `alert_grouping.rs` polling every 1s for expired batches - Integrated grouping logic in `scheduler/handlers.rs` after deduplication - Supported all three `SendStrategy` variants: `FirstWithCount`, `Summary`, `All` - Auto-send when `max_group_size` reached or timer expires after `group_wait_seconds` - Registered background worker in `job/mod.rs` **Observability - Prometheus Metrics:** - Added 8 metrics to `metrics.rs`: dedup suppressions/passed/errors, grouping batches pending/sent/size/wait-time/errors - Instrumented `deduplication.rs` to track suppressions and passed alerts by type (same-alert vs cross-alert) - Instrumented `grouping.rs` and `alert_grouping.rs` to track batch lifecycle - All metrics registered in Prometheus registry and exposed at `/metrics` endpoint **UI Visibility:** - Added dedup badges to alert names in `AlertList.vue` showing configuration status - Added dedup column in `AlertHistory.vue` with visual indicators for sent/suppressed/grouped alerts - Created `DedupSummaryCards.vue` component displaying org-wide stats (total alerts, dedup enabled count, suppression rate, pending batches) - Added backend API `dedup_stats.rs` with `/alerts/dedup/summary` endpoint - Removed legacy View History button from alert list page - Extended `AlertHistoryEntry` with dedup fields: `dedup_enabled`, `dedup_suppressed`, `dedup_count`, `grouped`, `group_size` **Logging & Debugging:** - Comprehensive logging throughout grouping flow with `[grouping]` and `[alert_grouping_worker]` prefixes - Enhanced deduplication logging with `[dedup]` prefix showing fingerprints and occurrence counts - Added `get_pending_batch_count()` helper for API consumption **Technical Details:** - All features properly gated behind `#[cfg(feature = "enterprise")]` - Backward compatible: grouping disabled by default - In-memory batches cleared on restart (acceptable for 30s window) - Thread-safe implementation using `DashMap` and atomic operations
Implements alert correlation that groups related alerts into incidents based on semantic field matching and temporal proximity. **Backend:** - Add `correlation.rs` config with validation for correlation dimensions and matching strategies - Add `alert_incidents` and `alert_incident_alerts` database entities with SeaORM - Add database migration `m20251107_000003_create_alert_correlation_schema` - Add `correlation.rs` service with transaction-safe incident creation and matching - Add `incidents.rs` HTTP handlers for incident CRUD operations (6 endpoints) - Integrate correlation into alert scheduler to auto-correlate on alert firing - Add 7 correlation metrics for observability (incidents created, alerts matched, confidence distribution, processing duration, MTTR) - Update `org_config.rs` with correlation config persistence functions - Update organization settings to include deduplication config in response **Frontend:** - Add `IncidentList.vue` component with status filtering and sortable table - Add `IncidentDetailsDrawer.vue` for viewing incident details and associated alerts - Add Incidents tab to `AlertList.vue` for accessing incident management UI - Add 6 incident API methods to `alerts.ts` service (list, get, update status, config CRUD) **Fixes:** - Fix `OrganizationSettingResponse` test to include `deduplication_config` field - Fix metering init call signature (remove extra argument) - Comment out data retention usage code pending enterprise module update Migrated from PR #9011, separated from deduplication feature (PR #9209).
e933661 to
5dead49
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Implements alert correlation that groups related alerts into incidents based on semantic field matching and temporal proximity. Migrated from PR #9011 (separated from deduplication in PR #9209).
Core Features
API Endpoints (6)
GET /{org_id}/incidents- List incidents with status filterGET /{org_id}/incidents/{id}- Get incident detailsPUT /{org_id}/incidents/{id}/status- Update status (ack/resolve)GET/POST/DELETE /{org_id}/correlation/config- Manage correlation settingsDatabase Schema
alert_incidents: Incident records with correlation metadataalert_incident_alerts: Many-to-many alert-to-incident mappingm20251107_000003_create_alert_correlation_schemaUI
IncidentList.vuewith status filteringIncidentDetailsDrawer.vuefor viewing associated alertsImplementation
#[cfg(feature = "enterprise")]Fixes
OrganizationSettingResponsetest