Skip to content

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Dec 15, 2025

Implemented via tasks.

As of Dec 15, even with only ~30k calls per week, the current GetRegularWorkspaceCreateMetrics query is either first or second most expensive. The average execution time is ~185ms but the plan P99 is over 2.5s and execution P99 over 4s. Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.

This changes the "regular" workspace metric to be split into two, one for workspace creation 'events' and another for workspace creation job successes/failures.

To query correctly end users will now need to sum, by some subset or all of the labels, across their coderd replicas to get the total count of workspace creations and successes/failures. IMO this also more nicely provides insights into failure cases when they happen, as I believe ATM you can only see that the initial creation code path failed by looking at logs.

@github-actions
Copy link

github-actions bot commented Dec 15, 2025


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


1 out of 2 committers have signed the CLA.
✅ (cstyan)[https://github.com/cstyan]
@callum Styan
Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@cstyan cstyan force-pushed the callum/workspace-create-metrics branch 2 times, most recently from 8eb6651 to 40720fd Compare December 15, 2025 21:32
Callum Styan and others added 12 commits December 16, 2025 00:02
Add two new Prometheus metrics to track workspace creations without expensive
database queries:

1. workspace_creation_attempts_total - Tracks when regular (non-prebuilt)
   workspace creation attempts begin (first 'start' build initiated)
   - Labels: organization_name, template_name, preset_name
   - Increments in coderd/workspaces.go when workspace build is created

2. workspace_creation_outcomes_total - Tracks outcomes of workspace first builds
   - Labels: organization_name, template_name, preset_name, status (success/failure)
   - Increments in coderd/provisionerdserver when provisioner job completes

Benefits:
- Real-time metrics (no polling delay)
- No expensive database queries
- Event-driven counters in packages where events occur
- Can calculate success rate: outcomes{status="success"} / attempts

Tests:
- Added comprehensive tests for both metrics using prometheus testutil
- Tests verify metrics increment correctly on first builds only
- Tests verify labels track organization, template, and preset names
- Tests verify outcomes track success/failure status

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit removes the GetRegularWorkspaceCreateMetrics database query
and replaces it with inline event-driven Prometheus counters that track
workspace creation attempts and outcomes.

Changes:
- Removed SQL query from coderd/database/queries/workspaces.sql
- Removed generated method implementations from dbmock, dbauthz, dbmetrics
- Removed old metric collection from coderd/prometheusmetrics/prometheusmetrics.go
- Fixed missing imports in coderd/database/queries.sql.go (pre-existing issue)
- Fixed method names to use GetPresetByID instead of GetTemplateVersionPresetByID

The new metrics (WorkspaceCreationAttemptsTotal and WorkspaceCreationOutcomesTotal)
track workspace creations inline as they occur, eliminating the need for expensive
database aggregations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Simplified the tests to verify basic metric functionality without requiring
a full PostgreSQL database. The new tests:
- Verify metrics can be incremented and read correctly
- Test that different label values (org, template, preset, status) are tracked
- Run quickly without needing Docker/PostgreSQL

The original integration tests required full workspace creation flows with
PostgreSQL. These simplified unit tests focus on verifying the prometheus
metric itself works correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Rewrote metric tests to use dbmock instead of requiring PostgreSQL:
- Created table-driven tests covering all conditional logic scenarios
- Mock database calls with gomock expectations
- Test build numbers, transitions, initiators, presets, success/failure
- Tests run quickly without Docker/PostgreSQL dependencies

Created new test files:
- coderd/workspaces_metric_test.go: Tests WorkspaceCreationAttemptsTotal
- coderd/provisionerdserver/provisionerdserver_metric_test.go: Tests WorkspaceCreationOutcomesTotal

Removed old integration tests that required PostgreSQL and cleaned up unused imports.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extracted the metric increment logic into a reusable function
`incrementWorkspaceCreationOutcomesMetric` to avoid duplicating the
conditional logic in tests. This ensures the test verifies the actual
production code path.

Changes:
- Added `incrementWorkspaceCreationOutcomesMetric()` function in provisionerdserver.go
- Updated test to call the extracted function instead of duplicating logic
- Tests now verify the actual implementation, not a recreation of it

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extracted the metric increment logic into a reusable function
`incrementWorkspaceCreationAttemptsMetric` to avoid duplicating the
conditional logic in tests. This ensures the test verifies the actual
production code path.

Changes:
- Added `incrementWorkspaceCreationAttemptsMetric()` function in workspaces.go
- Updated test to call the extracted function instead of duplicating logic
- Tests now verify the actual implementation, not a recreation of it

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed metrics registration so they are properly exported via /metrics endpoint.
Previously, metrics were created with promauto but never registered with the
custom PrometheusRegistry, so they weren't exported in multi-replica deployments.

Changes:
- WorkspaceCreationAttemptsTotal: Now created with promauto.With(registry) in API initialization
- WorkspaceCreationOutcomesTotal: Added to provisionerdserver.Metrics and registered with registry
- Renamed test files to follow *_internal_test.go naming pattern for internal package tests
- Updated tests to create test metric instances and server/API objects

Metrics are now properly registered and will be scraped from all coderd replicas.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The workspace creation metrics were not appearing on the /metrics endpoint
due to registry conflicts:

1. WorkspaceCreationAttemptsTotal was only initialized when
   options.DeploymentValues.Prometheus.Enable was true, but the registry
   is always available. Removed unnecessary check.

2. ProvisionerdServerMetrics was never initialized. Added initialization
   and registration with options.PrometheusRegistry.

3. CRITICAL: Global WorkspaceCreationOutcomesTotal variable used
   promauto.NewCounterVec() which registers with the DEFAULT registry,
   not the custom options.PrometheusRegistry. This caused duplicate
   registration attempts and prevented the metric from appearing.

   Removed the global variable entirely and made the metric
   instance-based only, matching the pattern used by the working
   workspaceCreationTimings and workspaceClaimTimings metrics.

Both metrics now register correctly with the custom PrometheusRegistry
and will appear on the /metrics endpoint after workspace creation events.
Refactored provisionerdserver.NewMetrics() to follow the same pattern as
notifications.NewMetrics():
- Takes prometheus.Registerer parameter
- Uses promauto.With(reg) to automatically register metrics on creation
- Removed separate Register() method (no longer needed)

This matches the existing pattern in the codebase and ensures metrics are
registered with the correct custom PrometheusRegistry, not the global
default registry.

Changes:
- Updated NewMetrics signature to accept prometheus.Registerer
- Changed metric creation from prometheus.New* to promauto.With(reg).New*
- Removed Register() method
- Updated coderd.go to pass options.PrometheusRegistry when creating metrics
- Updated all tests to pass prometheus.NewRegistry()

Both workspace creation metrics will now appear correctly on /metrics.
Updated cli/server.go to pass prometheus.Registerer to
provisionerdserver.NewMetrics() and removed the separate Register() call
since promauto handles registration automatically now.
Add nil check before creating ProvisionerdServerMetrics to avoid
duplicate registration when metrics are already provided via options.

This fixes panic that occurred because metrics were being created in
both cli/server.go and coderd.go.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan force-pushed the callum/workspace-create-metrics branch from 4454f50 to 1aca85a Compare December 16, 2025 00:29
Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this is one of our metrics that IMO is semantically incorrect/not following Prometheus metric best practices due to every replica of coderd exposing the same metric value.

Completely agree. However, the reason coderd_workspace_creation_total was implemented as a counter updated with the all-time total (not just since server start) is because we already have other metrics like coderd_prebuilt_workspaces_claimed_total that have the same issue, they're also queried from the database and every replica exports the same values. This was a requirement to be able to compare regular workspaces vs prebuilt workspaces (dashboard link) to understand if prebuilds are being adopted.

Screenshot 2025-12-16 at 16 18 36

With this PR's removal of coderd_workspace_creation_total, we lose this comparison since coderd_prebuilt_workspaces_claimed_total is all-time whereas coderd_workspace_creation_attempts_total is since server start.

I think the main problem with GetRegularWorkspaceCreateMetrics is the first_success_build subquery that attempts to find the first successful build (even if the first build failed). We could follow a simpler approach like we do for coderd_workspace_creation_duration_seconds, where we only report if the first build is successful. This should improve query performance while maintaining consistency with the prebuild metrics (even though we don't get exact values for workspace creation).

Alternatively, we could change the prebuild metrics to also be reported since server start instead of all-time, but that would require more extensive changes and would be a breaking change for those metrics 😕

Comment on lines 81 to 83
if workspaceBuild.BuildNumber == 1 &&
workspaceBuild.Transition == database.WorkspaceTransitionStart &&
workspaceBuild.InitiatorID != database.PrebuildsSystemUserID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do similar checks for the timing metrics coderd_workspace_creation_duration_seconds and coderd_prebuilt_workspace_claim_duration_seconds on completeWorkspaceBuildJob:

if s.metrics != nil && workspaceBuild.Transition == database.WorkspaceTransitionStart {

As well as, getting the presetID from the database, so maybe we could update all metrics in the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look 👍

workspaceBuild.Transition == database.WorkspaceTransitionStart &&
workspaceBuild.InitiatorID != database.PrebuildsSystemUserID {
// Determine status based on job completion.
status := "success"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric already encompasses coderd_workspace_creation_attempts_total, right? Since it reports both success and failure status. Could we consolidate into a single metric instead of having two separate ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind, and it's not dealt with here yet, but technically there can be an error for the workspace build job prior to the insertion of the job into the database. So attempts should potentially be renamed and also track failures there.

Then, via comparison between the two metrics we can determine if workspace builds are failing before they get to a provisioner or within the provisioner, and if there's a large discrepancy between the sums for these two metrics we would be able to track the growing of a backlog of build jobs or potentially build jobs being lost somehow.

Move workspace creation outcomes metric increment into the same section as
timing metrics in provisionerdserver.go. This consolidates all metric
updates in one place, reuses the same preset lookup logic, and uses fresh
job data from the database.

Changes:
- Add UpdateWorkspaceCreationOutcomesMetric method to Metrics struct
- Update provisionerdserver.go to call both timing and outcomes metrics
  in the same section (lines 2399-2458)
- Remove separate incrementWorkspaceCreationOutcomesMetric method
- Remove workspaceCreationOutcomesTotal field from server struct
- Simplify tests to use the new consolidated approach
- Remove unused prometheus import from provisionerdserver.go

The consolidation addresses feedback to update all metrics in the same
place where preset lookups and job status checks already occur for timing
metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants