Skip to content

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Dec 18, 2025

In a number of the AgentAPI sub APIs we have frequent calls to AgentFn (GetWorkspaceAgentByID) even when the functions present in that API will really only use the agent ID (and sometimes the agent Name).

We already have caching of the workspace itself when the workspace is not a prebuild, and use it when needing values from the workspace row and for RBAC fast path purposes. However, the agent can also be cached in cases when we only need to use static fields, unless there is an additional RBAC/auth related purpose to the AgentFn calls that I am missing here.

The main outcome of this PR is that for frequently called AgentAPI endpoints, especially the metadata updates, we will eliminate the need to call GetWorkspaceAgentByID completely. In the case of recent scaletesting for 600 workspaces with 15 metadata each at a 10s interval this was ~300 requests per second.

Callum Styan and others added 13 commits December 18, 2025 01:13
Previously, MetadataAPI, AppsAPI, StatsAPI, and ConnectionLogAPI called
GetWorkspaceAgentByID on every request to retrieve the agent ID and Name.
For deployments with 1000+ agents, this resulted in 4000+ queries/second.

This change introduces agent field caching to eliminate these redundant
database queries. Since agent ID and Name are static fields that never
change during an agent connection lifetime, we can safely cache them
when the agent API is initialized.

Changes:
- Add CachedAgentFields struct with thread-safe ID and Name getters
- Update agentapi.New() to accept agent parameter and initialize cache
- Modify MetadataAPI, AppsAPI, StatsAPI, and ConnectionLogAPI to use
  cached fields instead of calling AgentFn
- Update all test files to initialize agent cache properly
- Add fallback to AgentFn if cache is not populated (safety)

This reduces database query load by ~4000 queries/second for
deployments with 1000 agents, while maintaining backward compatibility
with existing code paths.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit refactors the agent caching implementation to ensure type
safety and prevent functions from accessing non-cached agent fields.

Key changes:

1. **CachedAgentFields API Pattern**: Updated CachedAgentFields to
   follow the same pattern as CachedWorkspaceFields:
   - Added AsAgentID() returning (uuid.UUID, bool)
   - Added AsAgentFields() returning (id, name, ok)
   - Removed IsPopulated() in favor of the new pattern

2. **publishWorkspaceUpdate Signature**: Changed from taking
   *database.WorkspaceAgent to *CachedAgentFields. This enforces at
   compile time that only cached fields can be accessed.

3. **Agent Caching Safety**: Added documentation clarifying that
   agent ID and Name are immutable fields that never change during
   an agent's lifetime, unlike workspace fields which can change
   during prebuild claims.

4. **Sub-API Updates**: Updated AppsAPI, LifecycleAPI, and LogsAPI
   to use the new signature. APIs that need full agent data
   (LifecycleAPI, LogsAPI) create temporary CachedAgentFields
   instances from retrieved agents.

5. **Test Updates**: Updated all test mock functions to match the
   new PublishWorkspaceUpdateFn signature.

All tests pass and the implementation maintains backward compatibility
while improving type safety.
This commit fixes the issues identified in code review:

1. **Removed RWMutex**: CachedAgentFields no longer needs a mutex
   since values are set once at initialization and never modified
   after that, unlike CachedWorkspaceFields which can be updated
   via the refresh loop.

2. **Added proper ok checks**: All API methods now use the ok pattern
   to check if cached fields are populated:
   - BatchUpdateAppHealths: Uses AsAgentID() with ok check
   - UpdateStats: Uses AsAgentFields() with ok check
   - BatchUpdateMetadata: Uses AsAgentID() with ok check
   - Connection: Uses AsAgentFields() with ok check

3. **Removed ID() and Name() methods**: These were removed in favor
   of AsAgentID() and AsAgentFields() which return (value, ok) pairs
   following the same pattern as CachedWorkspaceFields.

4. **Cleaned up imports**: Removed unused uuid import from stats.go.

All tests pass and the implementation now properly follows the
established patterns in the codebase.
When the cache is not populated and we fall back to querying the agent,
we need to create a CachedAgentFields from the fallback agent to pass
to PublishWorkspaceUpdateFn. Previously we were still passing a.Agent
which could be unpopulated, causing the publish function to receive an
empty cache.

Now we create a local agentCache variable that is either set to a.Agent
(if populated) or a newly created CachedAgentFields from the fallback
agent.
LogsAPI and LifecycleAPI don't have cached agent fields, so they
should continue using the database.WorkspaceAgent signature.
Added publishWorkspaceUpdateLegacy wrapper to handle these APIs.
- Remove publishWorkspaceUpdateLegacy wrapper
- publishWorkspaceUpdate now takes uuid.UUID directly instead of *CachedAgentFields
- AppsAPI.PublishWorkspaceUpdateFn signature simplified to take uuid.UUID
- BatchUpdateAppHealths gets agent ID from cache or fallback and passes it directly
- LogsAPI and LifecycleAPI use inline closures to extract agent.ID
- Change cachedAgentFields from *CachedAgentFields to CachedAgentFields (value type)
- Remove AsAgentID() and AsAgentFields() methods with ok pattern
- Add simple ID() and Name() accessors
- Remove all fallback logic in sub-APIs since agent cache is always initialized
- Agent ID and Name are set once at initialization and never change
…-APIs

Removes AgentFn field from AppsAPI, StatsAPI, MetadataAPI, and ConnLogAPI
since all these APIs now get agent ID and name from the cached agent
fields. The AgentFn callback was only needed as a fallback mechanism, but
since the cache is always populated on initialization, the fallback was
dead code.

Updates:
- coderd/agentapi/apps.go: Remove AgentFn field
- coderd/agentapi/stats.go: Remove AgentFn field
- coderd/agentapi/metadata.go: Remove AgentFn field
- coderd/agentapi/connectionlog.go: Remove AgentFn field
- coderd/agentapi/api.go: Remove AgentFn from sub-API initialization
- coderd/agentapi/*_test.go: Update tests to remove AgentFn references

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes two issues from previous commit:

1. In stats.go and connectionlog.go: Remove unnecessary RBAC wrapping since
   when AsWorkspaceIdentity() returns !ok (cache is empty), calling it again
   inside the fallback block will also return !ok, making the RBAC wrapping
   dead code. The RBAC fast path is only useful when the cache HAS a
   workspace identity to inject.

2. In connectionlog_test.go: Correctly initialize workspaceAsCacheFields to
   populate the cache, preventing unnecessary GetWorkspaceByAgentID calls.
   The test comment explicitly states "With agent/workspace caching,
   GetWorkspaceByAgentID is not called", so we need a populated cache.

Changes:
- coderd/agentapi/stats.go: Remove RBAC wrapping, use base ctx directly
- coderd/agentapi/connectionlog.go: Remove RBAC wrapping, use base ctx
- coderd/agentapi/connectionlog_test.go: Restore workspaceAsCacheFields

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Since CachedAgentFields and all its usage are within the agentapi package,
we don't need accessor methods. Make the fields public (ID, Name) and access
them directly instead of using ID() and Name() methods.

This simplifies the code while maintaining the same functionality. The fields
remain read-only in practice since UpdateValues is only called once at
initialization.

Changes:
- coderd/agentapi/cached_agent.go: Remove ID() and Name() methods, make fields public
- coderd/agentapi/apps.go: Use a.Agent.ID instead of a.Agent.ID()
- coderd/agentapi/stats.go: Use a.Agent.ID and a.Agent.Name directly
- coderd/agentapi/metadata.go: Use a.Agent.ID directly
- coderd/agentapi/connectionlog.go: Use a.Agent.ID and a.Agent.Name directly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove the UpdateValues method from CachedAgentFields and use struct
literal initialization instead. This is more idiomatic Go since the
fields are now public.

Changes:
- Remove UpdateValues method from cached_agent.go
- Update api.go to use struct literal initialization
- Update all test files to use inline initialization instead of
  empty struct + UpdateValues call
- Remove unused agentCache variable from metadata_test.go

All tests pass successfully.
Use a.Agent.ID and a.Agent.Name directly instead of allocating
intermediate variables. This reduces unnecessary allocations and
makes the code more concise.

Changes:
- connectionlog.go: Remove agentID and agentName variables
- stats.go: Remove agentID and agentName variables
- metadata.go: Remove agentID variable
- apps.go: Remove agentID variable

All tests pass successfully.
Remove unnecessary database.WorkspaceAgent allocation in stats reporting
by passing only the required fields (agentID and agentName) directly to
ReportAgentStats.

Changes:
- ReportAgentStats: Accept agentID and agentName parameters instead of
  full database.WorkspaceAgent object
- stats.go: Remove minimalAgent variable allocation, pass cached fields
  directly
- agentapi.New: Keep full database.WorkspaceAgent parameter, extract
  fields internally for caching

This eliminates the minimalAgent struct allocation while keeping the
agentapi.New API clean by accepting the full agent object.

All tests pass successfully.
@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants