-
Notifications
You must be signed in to change notification settings - Fork 1.1k
perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline #21281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I have read the CLA Document and I hereby sign the CLA 1 out of 2 committers have signed the CLA. |
8eb6651 to
40720fd
Compare
Add two new Prometheus metrics to track workspace creations without expensive
database queries:
1. workspace_creation_attempts_total - Tracks when regular (non-prebuilt)
workspace creation attempts begin (first 'start' build initiated)
- Labels: organization_name, template_name, preset_name
- Increments in coderd/workspaces.go when workspace build is created
2. workspace_creation_outcomes_total - Tracks outcomes of workspace first builds
- Labels: organization_name, template_name, preset_name, status (success/failure)
- Increments in coderd/provisionerdserver when provisioner job completes
Benefits:
- Real-time metrics (no polling delay)
- No expensive database queries
- Event-driven counters in packages where events occur
- Can calculate success rate: outcomes{status="success"} / attempts
Tests:
- Added comprehensive tests for both metrics using prometheus testutil
- Tests verify metrics increment correctly on first builds only
- Tests verify labels track organization, template, and preset names
- Tests verify outcomes track success/failure status
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit removes the GetRegularWorkspaceCreateMetrics database query and replaces it with inline event-driven Prometheus counters that track workspace creation attempts and outcomes. Changes: - Removed SQL query from coderd/database/queries/workspaces.sql - Removed generated method implementations from dbmock, dbauthz, dbmetrics - Removed old metric collection from coderd/prometheusmetrics/prometheusmetrics.go - Fixed missing imports in coderd/database/queries.sql.go (pre-existing issue) - Fixed method names to use GetPresetByID instead of GetTemplateVersionPresetByID The new metrics (WorkspaceCreationAttemptsTotal and WorkspaceCreationOutcomesTotal) track workspace creations inline as they occur, eliminating the need for expensive database aggregations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Simplified the tests to verify basic metric functionality without requiring a full PostgreSQL database. The new tests: - Verify metrics can be incremented and read correctly - Test that different label values (org, template, preset, status) are tracked - Run quickly without needing Docker/PostgreSQL The original integration tests required full workspace creation flows with PostgreSQL. These simplified unit tests focus on verifying the prometheus metric itself works correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Rewrote metric tests to use dbmock instead of requiring PostgreSQL: - Created table-driven tests covering all conditional logic scenarios - Mock database calls with gomock expectations - Test build numbers, transitions, initiators, presets, success/failure - Tests run quickly without Docker/PostgreSQL dependencies Created new test files: - coderd/workspaces_metric_test.go: Tests WorkspaceCreationAttemptsTotal - coderd/provisionerdserver/provisionerdserver_metric_test.go: Tests WorkspaceCreationOutcomesTotal Removed old integration tests that required PostgreSQL and cleaned up unused imports. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extracted the metric increment logic into a reusable function `incrementWorkspaceCreationOutcomesMetric` to avoid duplicating the conditional logic in tests. This ensures the test verifies the actual production code path. Changes: - Added `incrementWorkspaceCreationOutcomesMetric()` function in provisionerdserver.go - Updated test to call the extracted function instead of duplicating logic - Tests now verify the actual implementation, not a recreation of it 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extracted the metric increment logic into a reusable function `incrementWorkspaceCreationAttemptsMetric` to avoid duplicating the conditional logic in tests. This ensures the test verifies the actual production code path. Changes: - Added `incrementWorkspaceCreationAttemptsMetric()` function in workspaces.go - Updated test to call the extracted function instead of duplicating logic - Tests now verify the actual implementation, not a recreation of it 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed metrics registration so they are properly exported via /metrics endpoint. Previously, metrics were created with promauto but never registered with the custom PrometheusRegistry, so they weren't exported in multi-replica deployments. Changes: - WorkspaceCreationAttemptsTotal: Now created with promauto.With(registry) in API initialization - WorkspaceCreationOutcomesTotal: Added to provisionerdserver.Metrics and registered with registry - Renamed test files to follow *_internal_test.go naming pattern for internal package tests - Updated tests to create test metric instances and server/API objects Metrics are now properly registered and will be scraped from all coderd replicas. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The workspace creation metrics were not appearing on the /metrics endpoint due to registry conflicts: 1. WorkspaceCreationAttemptsTotal was only initialized when options.DeploymentValues.Prometheus.Enable was true, but the registry is always available. Removed unnecessary check. 2. ProvisionerdServerMetrics was never initialized. Added initialization and registration with options.PrometheusRegistry. 3. CRITICAL: Global WorkspaceCreationOutcomesTotal variable used promauto.NewCounterVec() which registers with the DEFAULT registry, not the custom options.PrometheusRegistry. This caused duplicate registration attempts and prevented the metric from appearing. Removed the global variable entirely and made the metric instance-based only, matching the pattern used by the working workspaceCreationTimings and workspaceClaimTimings metrics. Both metrics now register correctly with the custom PrometheusRegistry and will appear on the /metrics endpoint after workspace creation events.
Refactored provisionerdserver.NewMetrics() to follow the same pattern as notifications.NewMetrics(): - Takes prometheus.Registerer parameter - Uses promauto.With(reg) to automatically register metrics on creation - Removed separate Register() method (no longer needed) This matches the existing pattern in the codebase and ensures metrics are registered with the correct custom PrometheusRegistry, not the global default registry. Changes: - Updated NewMetrics signature to accept prometheus.Registerer - Changed metric creation from prometheus.New* to promauto.With(reg).New* - Removed Register() method - Updated coderd.go to pass options.PrometheusRegistry when creating metrics - Updated all tests to pass prometheus.NewRegistry() Both workspace creation metrics will now appear correctly on /metrics.
Updated cli/server.go to pass prometheus.Registerer to provisionerdserver.NewMetrics() and removed the separate Register() call since promauto handles registration automatically now.
Add nil check before creating ProvisionerdServerMetrics to avoid duplicate registration when metrics are already provided via options. This fixes panic that occurred because metrics were being created in both cli/server.go and coderd.go.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
4454f50 to
1aca85a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.
Completely agree. However, the reason coderd_workspace_creation_total was implemented as a counter updated with the all-time total is because we already have other metrics like coderd_prebuilt_workspaces_claimed_total that have the same issue, they're also queried from the database and every replica exports the same values. This was a requirement to be able to compare regular workspaces vs prebuilt workspaces (dashboard link) to understand if prebuilds are being adopted.
With this PR's removal of coderd_workspace_creation_total, we lose this comparison since coderd_prebuilt_workspaces_claimed_total is all-time whereas coderd_workspace_creation_attempts_total is since the metric introduction.
I think the main problem with GetRegularWorkspaceCreateMetrics is the first_success_build subquery that attempts to find the first successful build (even if the first build failed). We could follow a simpler approach like we do for coderd_workspace_creation_duration_seconds, where we only report if the first build is successful. This should improve query performance while maintaining consistency with the prebuild metrics (even though we don't get exact values for workspace creation).
Alternatively, we could change the prebuild metrics to also be reported since server start instead of all-time, but that would require more extensive changes and would be a breaking change for those metrics 😕
| if workspaceBuild.BuildNumber == 1 && | ||
| workspaceBuild.Transition == database.WorkspaceTransitionStart && | ||
| workspaceBuild.InitiatorID != database.PrebuildsSystemUserID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do similar checks for the timing metrics coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds on completeWorkspaceBuildJob:
| if s.metrics != nil && workspaceBuild.Transition == database.WorkspaceTransitionStart { |
As well as, getting the presetID from the database, so maybe we could update all metrics in the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look 👍
| workspaceBuild.Transition == database.WorkspaceTransitionStart && | ||
| workspaceBuild.InitiatorID != database.PrebuildsSystemUserID { | ||
| // Determine status based on job completion. | ||
| status := "success" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric already encompasses coderd_workspace_creation_attempts_total, right? Since it reports both success and failure status. Could we consolidate into a single metric instead of having two separate ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind, and it's not dealt with here yet, but technically there can be an error for the workspace build job prior to the insertion of the job into the database. So attempts should potentially be renamed and also track failures there.
Then, via comparison between the two metrics we can determine if workspace builds are failing before they get to a provisioner or within the provisioner, and if there's a large discrepancy between the sums for these two metrics we would be able to track the growing of a backlog of build jobs or potentially build jobs being lost somehow.
Move workspace creation outcomes metric increment into the same section as timing metrics in provisionerdserver.go. This consolidates all metric updates in one place, reuses the same preset lookup logic, and uses fresh job data from the database. Changes: - Add UpdateWorkspaceCreationOutcomesMetric method to Metrics struct - Update provisionerdserver.go to call both timing and outcomes metrics in the same section (lines 2399-2458) - Remove separate incrementWorkspaceCreationOutcomesMetric method - Remove workspaceCreationOutcomesTotal field from server struct - Simplify tests to use the new consolidated approach - Remove unused prometheus import from provisionerdserver.go The consolidation addresses feedback to update all metrics in the same place where preset lookups and job status checks already occur for timing metrics.
Implemented via tasks.
As of Dec 15, even with only ~30k calls per week, the current
GetRegularWorkspaceCreateMetricsquery is either first or second most expensive. The average execution time is ~185ms but the plan P99 is over 2.5s and execution P99 over 4s. Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.This changes the "regular" workspace metric to be split into two, one for workspace creation 'events' and another for workspace creation job successes/failures.
To query correctly end users will now need to sum, by some subset or all of the labels, across their coderd replicas to get the total count of workspace creations and successes/failures. IMO this also more nicely provides insights into failure cases when they happen, as I believe ATM you can only see that the initial creation code path failed by looking at logs.