perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline #21281

cstyan · 2025-12-15T20:31:13Z

Implemented via tasks.

As of Dec 15, even with only ~30k calls per week, the current GetRegularWorkspaceCreateMetrics query is either first or second most expensive. The average execution time is ~185ms but the plan P99 is over 2.5s and execution P99 over 4s. Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.

This changes the "regular" workspace metric to be split into two, one for workspace creation 'events' and another for workspace creation job successes/failures.

To query correctly end users will now need to sum, by some subset or all of the labels, across their coderd replicas to get the total count of workspace creations and successes/failures. IMO this also more nicely provides insights into failure cases when they happen, as I believe ATM you can only see that the initial creation code path failed by looking at logs.

github-actions · 2025-12-15T20:31:31Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

1 out of 2 committers have signed the CLA.
✅ (cstyan)[https://github.com/cstyan]
❌ @callum Styan
Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Add two new Prometheus metrics to track workspace creations without expensive database queries: 1. workspace_creation_attempts_total - Tracks when regular (non-prebuilt) workspace creation attempts begin (first 'start' build initiated) - Labels: organization_name, template_name, preset_name - Increments in coderd/workspaces.go when workspace build is created 2. workspace_creation_outcomes_total - Tracks outcomes of workspace first builds - Labels: organization_name, template_name, preset_name, status (success/failure) - Increments in coderd/provisionerdserver when provisioner job completes Benefits: - Real-time metrics (no polling delay) - No expensive database queries - Event-driven counters in packages where events occur - Can calculate success rate: outcomes{status="success"} / attempts Tests: - Added comprehensive tests for both metrics using prometheus testutil - Tests verify metrics increment correctly on first builds only - Tests verify labels track organization, template, and preset names - Tests verify outcomes track success/failure status 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This commit removes the GetRegularWorkspaceCreateMetrics database query and replaces it with inline event-driven Prometheus counters that track workspace creation attempts and outcomes. Changes: - Removed SQL query from coderd/database/queries/workspaces.sql - Removed generated method implementations from dbmock, dbauthz, dbmetrics - Removed old metric collection from coderd/prometheusmetrics/prometheusmetrics.go - Fixed missing imports in coderd/database/queries.sql.go (pre-existing issue) - Fixed method names to use GetPresetByID instead of GetTemplateVersionPresetByID The new metrics (WorkspaceCreationAttemptsTotal and WorkspaceCreationOutcomesTotal) track workspace creations inline as they occur, eliminating the need for expensive database aggregations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Simplified the tests to verify basic metric functionality without requiring a full PostgreSQL database. The new tests: - Verify metrics can be incremented and read correctly - Test that different label values (org, template, preset, status) are tracked - Run quickly without needing Docker/PostgreSQL The original integration tests required full workspace creation flows with PostgreSQL. These simplified unit tests focus on verifying the prometheus metric itself works correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Rewrote metric tests to use dbmock instead of requiring PostgreSQL: - Created table-driven tests covering all conditional logic scenarios - Mock database calls with gomock expectations - Test build numbers, transitions, initiators, presets, success/failure - Tests run quickly without Docker/PostgreSQL dependencies Created new test files: - coderd/workspaces_metric_test.go: Tests WorkspaceCreationAttemptsTotal - coderd/provisionerdserver/provisionerdserver_metric_test.go: Tests WorkspaceCreationOutcomesTotal Removed old integration tests that required PostgreSQL and cleaned up unused imports. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Extracted the metric increment logic into a reusable function `incrementWorkspaceCreationOutcomesMetric` to avoid duplicating the conditional logic in tests. This ensures the test verifies the actual production code path. Changes: - Added `incrementWorkspaceCreationOutcomesMetric()` function in provisionerdserver.go - Updated test to call the extracted function instead of duplicating logic - Tests now verify the actual implementation, not a recreation of it 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Extracted the metric increment logic into a reusable function `incrementWorkspaceCreationAttemptsMetric` to avoid duplicating the conditional logic in tests. This ensures the test verifies the actual production code path. Changes: - Added `incrementWorkspaceCreationAttemptsMetric()` function in workspaces.go - Updated test to call the extracted function instead of duplicating logic - Tests now verify the actual implementation, not a recreation of it 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fixed metrics registration so they are properly exported via /metrics endpoint. Previously, metrics were created with promauto but never registered with the custom PrometheusRegistry, so they weren't exported in multi-replica deployments. Changes: - WorkspaceCreationAttemptsTotal: Now created with promauto.With(registry) in API initialization - WorkspaceCreationOutcomesTotal: Added to provisionerdserver.Metrics and registered with registry - Renamed test files to follow *_internal_test.go naming pattern for internal package tests - Updated tests to create test metric instances and server/API objects Metrics are now properly registered and will be scraped from all coderd replicas. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The workspace creation metrics were not appearing on the /metrics endpoint due to registry conflicts: 1. WorkspaceCreationAttemptsTotal was only initialized when options.DeploymentValues.Prometheus.Enable was true, but the registry is always available. Removed unnecessary check. 2. ProvisionerdServerMetrics was never initialized. Added initialization and registration with options.PrometheusRegistry. 3. CRITICAL: Global WorkspaceCreationOutcomesTotal variable used promauto.NewCounterVec() which registers with the DEFAULT registry, not the custom options.PrometheusRegistry. This caused duplicate registration attempts and prevented the metric from appearing. Removed the global variable entirely and made the metric instance-based only, matching the pattern used by the working workspaceCreationTimings and workspaceClaimTimings metrics. Both metrics now register correctly with the custom PrometheusRegistry and will appear on the /metrics endpoint after workspace creation events.

Refactored provisionerdserver.NewMetrics() to follow the same pattern as notifications.NewMetrics(): - Takes prometheus.Registerer parameter - Uses promauto.With(reg) to automatically register metrics on creation - Removed separate Register() method (no longer needed) This matches the existing pattern in the codebase and ensures metrics are registered with the correct custom PrometheusRegistry, not the global default registry. Changes: - Updated NewMetrics signature to accept prometheus.Registerer - Changed metric creation from prometheus.New* to promauto.With(reg).New* - Removed Register() method - Updated coderd.go to pass options.PrometheusRegistry when creating metrics - Updated all tests to pass prometheus.NewRegistry() Both workspace creation metrics will now appear correctly on /metrics.

Updated cli/server.go to pass prometheus.Registerer to provisionerdserver.NewMetrics() and removed the separate Register() call since promauto handles registration automatically now.

Add nil check before creating ProvisionerdServerMetrics to avoid duplicate registration when metrics are already provided via options. This fixes panic that occurred because metrics were being created in both cli/server.go and coderd.go.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

ssncferreira

Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.

Completely agree. However, the reason coderd_workspace_creation_total was implemented as a counter updated with the all-time total (not just since server start) is because we already have other metrics like coderd_prebuilt_workspaces_claimed_total that have the same issue, they're also queried from the database and every replica exports the same values. This was a requirement to be able to compare regular workspaces vs prebuilt workspaces (dashboard link) to understand if prebuilds are being adopted.

With this PR's removal of coderd_workspace_creation_total, we lose this comparison since coderd_prebuilt_workspaces_claimed_total is all-time whereas coderd_workspace_creation_attempts_total is since server start.

I think the main problem with GetRegularWorkspaceCreateMetrics is the first_success_build subquery that attempts to find the first successful build (even if the first build failed). We could follow a simpler approach like we do for coderd_workspace_creation_duration_seconds, where we only report if the first build is successful. This should improve query performance while maintaining consistency with the prebuild metrics (even though we don't get exact values for workspace creation).

Alternatively, we could change the prebuild metrics to also be reported since server start instead of all-time, but that would require more extensive changes and would be a breaking change for those metrics 😕

ssncferreira · 2025-12-16T15:53:29Z

coderd/provisionerdserver/provisionerdserver.go

+ if workspaceBuild.BuildNumber == 1 &&
+ workspaceBuild.Transition == database.WorkspaceTransitionStart &&
+ workspaceBuild.InitiatorID != database.PrebuildsSystemUserID {


We already do similar checks for the timing metrics coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds on completeWorkspaceBuildJob:

coder/coderd/provisionerdserver/provisionerdserver.go

Line 2393 in 1aca85a

if s.metrics != nil && workspaceBuild.Transition == database.WorkspaceTransitionStart {

As well as, getting the presetID from the database, so maybe we could update all metrics in the same place?

I'll have a look 👍

ssncferreira · 2025-12-16T16:02:51Z

coderd/provisionerdserver/provisionerdserver.go

+ workspaceBuild.Transition == database.WorkspaceTransitionStart &&
+ workspaceBuild.InitiatorID != database.PrebuildsSystemUserID {
+ // Determine status based on job completion.
+ status := "success"


This metric already encompasses coderd_workspace_creation_attempts_total, right? Since it reports both success and failure status. Could we consolidate into a single metric instead of having two separate ones?

What I had in mind, and it's not dealt with here yet, but technically there can be an error for the workspace build job prior to the insertion of the job into the database. So attempts should potentially be renamed and also track failures there.

Then, via comparison between the two metrics we can determine if workspace builds are failing before they get to a provisioner or within the provisioner, and if there's a large discrepancy between the sums for these two metrics we would be able to track the growing of a backlog of build jobs or potentially build jobs being lost somehow.

Move workspace creation outcomes metric increment into the same section as timing metrics in provisionerdserver.go. This consolidates all metric updates in one place, reuses the same preset lookup logic, and uses fresh job data from the database. Changes: - Add UpdateWorkspaceCreationOutcomesMetric method to Metrics struct - Update provisionerdserver.go to call both timing and outcomes metrics in the same section (lines 2399-2458) - Remove separate incrementWorkspaceCreationOutcomesMetric method - Remove workspaceCreationOutcomesTotal field from server struct - Simplify tests to use the new consolidated approach - Remove unused prometheus import from provisionerdserver.go The consolidation addresses feedback to update all metrics in the same place where preset lookups and job status checks already occur for timing metrics.

cstyan requested a review from ssncferreira December 15, 2025 20:31

github-actions bot assigned cstyan Dec 15, 2025

cstyan force-pushed the callum/workspace-create-metrics branch 2 times, most recently from 8eb6651 to 40720fd Compare December 15, 2025 21:32

Callum Styan and others added 12 commits December 16, 2025 00:02

fix: update cli/server.go to use new NewMetrics signature

bfa79a3

Updated cli/server.go to pass prometheus.Registerer to provisionerdserver.NewMetrics() and removed the separate Register() call since promauto handles registration automatically now.

cleanup + lint/fmt

1aca85a

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan force-pushed the callum/workspace-create-metrics branch from 4454f50 to 1aca85a Compare December 16, 2025 00:29

ssncferreira reviewed Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline #21281

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline #21281

Uh oh!

cstyan commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

ssncferreira left a comment •

edited

Loading

Uh oh!

ssncferreira Dec 16, 2025

Uh oh!

cstyan Dec 16, 2025

Uh oh!

ssncferreira Dec 16, 2025

Uh oh!

cstyan Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline #21281

Are you sure you want to change the base?

perf: Remove GetRegularWorkspaceCreateMetrics, calculate metric values inline #21281

Uh oh!

Conversation

cstyan commented Dec 15, 2025

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssncferreira left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssncferreira Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ssncferreira Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Dec 15, 2025 •

edited

Loading

ssncferreira left a comment •

edited

Loading