Skip to content

perf(insights): aggregate insights data in Postgres instead of pulling rows into Node#114

Merged
gkhngyk merged 2 commits into
mainfrom
perf/insights-db-aggregation
May 28, 2026
Merged

perf(insights): aggregate insights data in Postgres instead of pulling rows into Node#114
gkhngyk merged 2 commits into
mainfrom
perf/insights-db-aggregation

Conversation

@gkhngyk

@gkhngyk gkhngyk commented May 28, 2026

Copy link
Copy Markdown
Contributor

Implements phase 1 of #93. The largest, no-tradeoff lever from the issue: move the three insights server-action reducers out of Node and into Postgres without introducing any TTL-bound cache that would delay on-demand refresh updates.

The shape of the bug

`getInsightsSummary`, `getShareOfVoiceData`, and `getCompetitorComparison` all followed the same pattern: `select('*')` from `prompt_results` for the brand + filter set, then `reduce`/`Map`/`for…of` the rows in JS. Each row carries the full AI response text (kilobytes), `citations` jsonb, and `competitor_mentions` jsonb. For a brand with thousands of results that's a multi-hundred-MB transfer + GC-heavy reducers on every insights page load — and the page fires 4–5 of these actions in parallel on mount.

What this PR does

1. Three new RPCs

In a single new migration `supabase/migrations/00006_insights_aggregates.sql`:

Function Replaces Returns
`insights_aggregates` `getInsightsSummary`'s reducer Totals, sums, last_checked_at, per-model breakdown
`competitor_aggregates` `getCompetitorComparison`'s reducer Brand + per-competitor sums, per-provider grids for the table
`share_of_voice_aggregates` `getShareOfVoiceData`'s reducer Totals, per (model_used, platform), per-day buckets

Each returns a single JSONB row of raw sums + counts, never pre-divided averages. The final `/` + `Math.round` still happens in JS exactly as before — guarantees the displayed numbers don't drift by even ±1 from a half-rule mismatch between JS's `Math.round` and Postgres `ROUND`. The provider mapping (`resolveProvider`) also stays in JS so the SQL doesn't ship a duplicate lookup that has to be kept in sync as engines land.

`SECURITY DEFINER` to match the existing `get_latest_prompt_results` pattern; route layer still verifies brand-belongs-to-org access before calling.

2. Composite index

`CREATE INDEX (brand_id, created_at DESC) ON prompt_results` — the exact shape every insights filter needs. Existing indexes are single-column on `brand_id` and `created_at` separately, forcing the planner to pick one and re-sort/filter on the other in the heap. The composite serves both the old query path (still used by listing endpoints like `getPromptResults`, `getVisibilityTrend`) and the new RPC `WHERE`s.

3. Refactor of three action functions

`web/src/lib/actions/tracking.ts` — the three problem functions now do:

  • One RPC call for the displayed window
  • Two parallel RPC calls (current 7-day vs previous 7-day) for the delta
  • A trivial fold over the small grouped arrays the RPC returns
  • `resolveProvider` over those grouped rows for provider buckets

`buildResultsQuery` and `applyModelFilter` stay — they're still used by row-listing endpoints (`getPromptResults`, `getVisibilityTrend`).

Parity verification

Before merging anything, ran the RPCs against a real brand with 5054 prompt_results in prod and diffed every field against the equivalent inline-SQL aggregation that mirrors the JS reducer arithmetic. Every field matched byte-for-byte at decimal precision (e.g. `sum_visibility = 384907.7615259740259744489904` from both paths). Verified for:

  • `insights_aggregates` — unfiltered, plus with a 30-day date filter
  • `competitor_aggregates` — brand_total_mentions, brand_total_citations, sum-of-competitor-mentions, provider combo counts, competitor × provider combo counts (70 distinct combos checked)
  • `share_of_voice_aggregates` — totals, distinct day count, distinct (model_used, platform) combos

Local smoke load of `/dashboard/insights` then renders the same KPIs, breakdowns, and trend lines with the new path — no console errors, no server-side RPC errors, filters work.

Why this is the safe rewrite

  • Pure addition at the schema level: new index, new functions. No table altered, no column changed, no row touched. Drop the functions and the index → table is identical.
  • Read-side only: tracking worker writes, on-demand Run All / Refresh / Test Single Prompt writes — none of these change. `prompt_results.shopping_cards` capture and every other write path stays exactly as-is.
  • No TTL cache introduced: explicit constraint in the issue (Run All / Refresh must reflect immediately). This PR achieves the speedup without hiding fresh data behind any cache window.
  • JS-side rounding preserved: the only practical drift risk on a SQL refactor like this is rounding/half-rule mismatch. By returning raw sums + counts and doing the final divide + round in JS, that risk class is closed.
  • Reversible: `git revert` on the code commit alone returns behavior to the old `select('*')` path; the migration's functions and index sit idle if unused.

Files

File Change
`supabase/migrations/00006_insights_aggregates.sql` New migration: composite index + three RPCs
`web/src/lib/actions/tracking.ts` Three insights actions refactored to call the RPCs; new `modelFilterArray` helper + RPC return-shape interfaces
`web/src/types/supabase.ts` Added `Functions` entries for the three new RPCs so the `.rpc()` call sites type-check

Deploy order

The migration has already been applied to the cloud project (additive, reversible). Self-hosters pick it up at the next release via `supabase db push` — standard prosedür per CONTRIBUTING.

Out of scope (deferred, from #93)

  • Precomputed daily summary table (phase 2) — only worth it if phase 1 doesn't move the needle enough; measure first.
  • Use cache / cacheTag invalidation on insights reads (phase 3) — complement to this PR, layers on top of these RPCs.

…g rows into Node

Closes #93.

The three insights server actions (getInsightsSummary, getShareOfVoiceData,
getCompetitorComparison) all followed the same shape: pull every prompt_results
row matching the brand + filters with select('*'), then reduce in JS. For an
active brand that's thousands of rows transferring on every page load — each
carrying the full AI response text, citations jsonb, and competitor_mentions
jsonb — driving a multi-hundred-MB transfer + GC-heavy reducers per request.

This swap moves the aggregation server-side via three CREATE OR REPLACE FUNCTIONs
in a new migration:

- insights_aggregates    — totals, sums, last_checked_at, per-model breakdown
- competitor_aggregates  — brand + per-competitor sums, plus per-provider grids
                            for the comparison table
- share_of_voice_aggregates — totals, per (model_used, platform) and per-day buckets

Each returns a single JSONB row with raw SUMS + COUNTS so the final divide +
round still happens in JS exactly as before — guarantees the displayed numbers
don't drift by even ±1 from a Math.round half-rule mismatch. The provider
mapping (resolveProvider) stays in JS too, so the SQL doesn't ship a duplicate
lookup that has to be kept in sync as engines land.

Also adds composite index (brand_id, created_at DESC) on prompt_results — the
exact shape every insights query needs, but only single-column indexes existed
before, forcing planner choices that required heap fetch + sort.

Verified end-to-end against a brand with 5054 prompt_results in prod: every
RPC field matches the equivalent inline-SQL aggregation byte-for-byte
(decimal precision preserved). Local smoke load of /dashboard/insights
renders the same KPIs, breakdowns, and trend lines with the new path; no
console errors, no server-side RPC errors.

Migration is purely additive and reversible: dropping the functions and the
index leaves the table untouched. Old server-action code path (buildResultsQuery
+ JS reduce) stays available to non-insights callers (getPromptResults,
getVisibilityTrend) that fetch row lists for display.

Out of scope for this PR (issue's phase 2 + 3, deferred):
- Precomputed daily summary table
- use cache / cacheTag invalidation on insights reads

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements phase 1 of the insights performance work by moving heavy, row-based reducers from Node into Postgres RPC aggregates, reducing transferred data and server-side GC pressure for /dashboard/insights.

Changes:

  • Added three Postgres aggregate RPCs (insights_aggregates, competitor_aggregates, share_of_voice_aggregates) plus a composite (brand_id, created_at DESC) index for prompt_results.
  • Refactored insights server actions to use the RPCs and keep final divide/round + provider bucketing in JS for parity.
  • Updated Supabase generated types to include the new RPC signatures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
supabase/migrations/00006_insights_aggregates.sql Adds composite index and introduces three JSONB-returning aggregate RPC functions with grants.
web/src/lib/actions/tracking.ts Refactors insights actions to call new RPCs; adds model filter helper and RPC return-shape interfaces.
web/src/types/supabase.ts Extends generated Database types with the new RPC function definitions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread supabase/migrations/00006_insights_aggregates.sql
Comment thread web/src/lib/actions/tracking.ts Outdated
Comment on lines +507 to +518
const [{ data: curWinData }, { data: prevWinData }] = await Promise.all([
supabase.rpc('insights_aggregates', {
...baseArgs,
p_date_from: opts?.dateFrom ?? currentFrom.toISOString(),
p_date_to: opts?.dateTo ?? null,
}),
supabase.rpc('insights_aggregates', {
...baseArgs,
p_date_from: prevFrom.toISOString(),
p_date_to: currentFrom.toISOString(),
}),
]);
Comment thread web/src/lib/actions/tracking.ts Outdated
Comment on lines +912 to +923
const [{ data: curWinData }, { data: prevWinData }] = await Promise.all([
supabase.rpc('competitor_aggregates', {
...baseArgs,
p_date_from: opts?.dateFrom ?? currentFrom.toISOString(),
p_date_to: opts?.dateTo ?? null,
}),
supabase.rpc('competitor_aggregates', {
...baseArgs,
p_date_from: prevFrom.toISOString(),
p_date_to: currentFrom.toISOString(),
}),
]);
Comment thread web/src/lib/actions/tracking.ts Outdated
Comment on lines +1097 to +1109
const [{ data: curData, error: curErr }, { data: prevData }] = await Promise.all([
supabase.rpc('share_of_voice_aggregates', {
...baseArgs,
p_date_from: opts?.dateFrom ?? null,
p_date_to: opts?.dateTo ?? null,
}),
supabase.rpc('share_of_voice_aggregates', {
...baseArgs,
p_date_from: prevFrom.toISOString(),
p_date_to: currentFrom.toISOString(),
}),
]);
if (curErr) throw new Error(curErr.message);
Comment thread web/src/lib/actions/tracking.ts Outdated
Comment on lines +1001 to +1009
// compName -> provider -> agg (JS key by display name to match the old
// shape exactly, since the provider row columns are keyed by competitor
// name).
const compByProvider = new Map<string, Map<string, Agg>>();
for (const cp of agg.by_competitor_provider) {
const provider = resolveProvider(cp.model_used, cp.platform);
const name = cp.competitor_name ?? '';
if (!compByProvider.has(name)) compByProvider.set(name, new Map());
const pm = compByProvider.get(name)!;
Comment on lines +101 to +104
(SELECT jsonb_agg(jsonb_build_object(
'model_used', bm.model_used,
'sum_visibility', bm.sum_visibility,
'result_count', bm.result_count))
Five inline fixes; one item explicitly deferred to a follow-up hygiene
issue (out of scope for this PR — covers existing functions too).

Addressed in this commit:

- supabase/migrations/00006: add ORDER BY inside every jsonb_agg(by_*).
  jsonb_agg is otherwise free to return rows in any order, which would
  let the platform list / chart jitter between requests as Postgres
  reordered the grouped rows. by_model orders by result_count DESC then
  model_used; by_competitor by row_count DESC then competitor_id;
  provider grids by (platform, model_used) with NULLS LAST; by_platform
  on SoV same.

- web/src/lib/actions/tracking.ts: the previous-period RPC calls in all
  three insights actions were destructuring only { data }, swallowing
  errors. Capture and throw on .error for both current and previous
  windows so a server-side failure surfaces as an error instead of a
  silent null delta. Three spots: insights_summary delta, competitor
  delta, share_of_voice prev period.

- web/src/lib/actions/tracking.ts: competitor display-name fallback.
  compByProvider was keyed by competitor_name, so two unnamed
  competitors (null/empty name from bad data) would collide under the
  empty-string key and have their per-provider scores silently
  combined. Centralized a competitorDisplayName(name, id) helper that
  falls back to the competitor_id when the name is empty, and applied
  it to both the entries[] name field and the compByProvider key so the
  providerRows column header still lines up with the table row label.

Deferred to a separate hygiene issue:

- SECURITY DEFINER on the aggregate RPCs. Valid concern — the function
  can be called with an arbitrary p_brand_id by any authenticated
  client. This matches the existing pattern used by
  get_latest_prompt_results (same project, same SECURITY DEFINER
  property) and the new functions inherit it deliberately for symmetry.
  The right fix sweeps all aggregate / row-fetch RPCs together with one
  org-membership check pattern, which doesn't belong in this perf PR.
@gkhngyk

gkhngyk commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @copilot-pull-request-reviewer — useful pass, all six points are valid. Addressed five in $(git rev-parse --short HEAD) and tracked the sixth as a follow-up.

# Concern Status
1 `jsonb_agg(by_model)` non-deterministic order → platform chart jitter Fixed — added `ORDER BY` inside every `jsonb_agg(by_*)` across all three RPCs (by_model on result_count DESC, by_competitor on row_count DESC, provider grids on (platform, model_used) NULLS LAST, by_platform on SoV same).
2 `insights_aggregates` delta calls swallow RPC errors Fixed — capture `error` from both current- and previous-window calls, throw on either.
3 `competitor_aggregates` delta calls swallow errors Fixed — same pattern.
4 `share_of_voice_aggregates` previous-period call swallows error Fixed — same pattern.
5 `compByProvider` keyed by `competitor_name` collides under `''` for unnamed competitors Fixed — centralized a `competitorDisplayName(name, id)` helper that falls back to the competitor id when the name is empty, applied consistently to both the `entries[]` name field and the `compByProvider` key so the providerRows column header still matches the table row label.
6 `SECURITY DEFINER` on the RPCs allows cross-org reads with arbitrary `p_brand_id` Deferred — issue #115

On #6 specifically: the concern is real and worth fixing, but the same property exists on the pre-existing `get_latest_prompt_results` RPCs from `00001_initial_schema.sql` — sweeping all of them with one consistent membership-check pattern (or moving them to `SECURITY INVOKER`) belongs in a hygiene PR rather than this perf-focused one. Tracking in #115.

Re-applied the migration to the cloud project via `CREATE OR REPLACE`; parity test against the same 5054-row brand still matches byte-for-byte after the ORDER BY additions.

@gkhngyk gkhngyk merged commit 7849564 into main May 28, 2026
4 checks passed
@gkhngyk gkhngyk deleted the perf/insights-db-aggregation branch May 28, 2026 14:17
gkhngyk added a commit that referenced this pull request May 28, 2026
…) + REST endpoint (#116)

Adds the third A-group MCP read tool from the assistant-readiness list,
after the content-opportunity tools shipped in #74 / #109 / #110.

Combines two existing RPCs (competitor_aggregates and
share_of_voice_aggregates, both DB-side aggregated since #114) so the
tool answers "how do I compare to my competitors?" and "who's gaining
share of voice?" in a single round trip. Ownership-check first via
supabaseAdmin (wrong-org brand → null → 404, no RPC fire). Snapshot
shape — deltas are out of scope for V1 and consistent with the existing
get_visibility_summary / list_topics / list_prompts shape; a caller that
wants a delta issues a second call with an earlier window.

- web/src/lib/mcp/data.ts: getCompetitorComparisonFor(auth, params) —
  parallel RPC calls, computes brand + per-competitor averages
  (when-present semantics matching the UI), and per-(model_used,
  platform) SoV split. Uses the same competitor display-name fallback
  pattern as the existing competitor logic (name ?? competitor_id) so
  unnamed competitors don't collide.

- web/src/lib/mcp/server.ts: registers the get_competitor_comparison
  tool with a description that covers both questions the tool answers
  and the snapshot-vs-delta caveat.

- web/src/app/api/mcp/competitor-comparison/route.ts: GET handler
  sharing the same data fn; query params mirror the MCP tool args.

Verified end-to-end against the largest brand in the cloud project
(5054 prompt result appearances, 7 competitors, 10 distinct model/
platform combos) — returned shape matches the documented interface,
SoV totals + per-platform breakdown reconcile with the existing UI.

Closes #98.
gkhngyk added a commit that referenced this pull request May 28, 2026
Adds the fourth A-group MCP read tool: cited URLs and domains classified
by source type (news / review / owned / social / forum / competitor / you),
plus totals and a per-source breakdown.

Ports the JS-side aggregation from getCitationsOverview in
actions/citations.ts (citations live as JSONB inside prompt_results.citations
and need the classifyDomain + classifyArticleType helpers to label each
URL, so this can't be folded into a Postgres aggregate the way the
competitor / SoV surfaces were in #114).

Ownership-check first — the prompt_results scan only runs after the brand
belongs to the caller's org. Wrong-org or missing brand returns null
(→ 404) before any rows are pulled.

Shape mirrors the existing CitationsOverview but slimmed for tool
consumption:

- totals (domains, urls, citations, results_with_citations, avg)
- source_type_breakdown (per category count + pct)
- top_domains (default 50, max 200) with citation count, results
  citing, usage %, models, article-type guesses
- top_urls (same caps) with title + article type

Reuses the URL-normalization (strip query/fragment + trailing slash) so
the same article cited with tracking params aggregates as one URL.

- web/src/lib/mcp/data.ts: listCitationsFor(auth, params)
- web/src/lib/mcp/server.ts: registers list_citations
- web/src/app/api/mcp/citations/route.ts: GET handler

Closes #97.
gkhngyk added a commit that referenced this pull request May 31, 2026
Bundles the in-product Agent epic (#120, #121), the BYOK rollout on
cloud, the MCP toolkit expansion (#109, #110, #116, #117, #118), the
insights performance fix (#114), and the invite-accept flow rebuild
(#127, #129, #130) into a tagged release.

See CHANGELOG.md for the full notes.
@gkhngyk gkhngyk mentioned this pull request May 31, 2026
gkhngyk added a commit that referenced this pull request May 31, 2026
Bundles the in-product Agent epic (#120, #121), the BYOK rollout on
cloud, the MCP toolkit expansion (#109, #110, #116, #117, #118), the
insights performance fix (#114), and the invite-accept flow rebuild
(#127, #129, #130) into a tagged release.

See CHANGELOG.md for the full notes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants