Skip to content

A more scalable buffer cache#5788

Merged
gz merged 1 commit intomainfrom
sieve
Mar 16, 2026
Merged

A more scalable buffer cache#5788
gz merged 1 commit intomainfrom
sieve

Conversation

@gz
Copy link
Copy Markdown
Contributor

@gz gz commented Mar 10, 2026

storage: add s3-fifo buffer cache, more config options

The buffer cache used to be simple. A single mutex protected it.
That design worked because only one thread accessed the cache.

We now want multiple threads to run merges in parallel.
That requires a cache that many threads can access without collapsing under contention.

This change introduces a new multi-threaded buffer cache. It also switches adds a
supposedly better eviction policy, the S3-FIFO algorithm.

For compatibility, this change also adds configuration flags to revert the behavior:

"dev_tweaks": {
        "buffer_cache_allocation_strategy": "per_thread" | "global" | "shared_per_worker_pair",
        "buffer_cache_strategy": "s3_fifo" | "lru"
      },
    
    // new defaults: s3_fifo AND shared_per_worker_pair
    // previously: lru AND per_thread

Describe Manual Test Plan

Ran a few pipelines, wrote lots of tests and benchmark programs.

Checklist

  • Unit tests added/updated

Breaking Changes?

Potential for performance regressions, since it changes a critical piece of our infra. Benchmarks are promising though.
We can revert back to the old cache with a dev-tweak.

@gz gz requested a review from blp March 10, 2026 06:06
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues to resolve before merge.

@gz gz force-pushed the sieve branch 3 times, most recently from fac284f to 9137a0f Compare March 10, 2026 22:47
@gz

This comment was marked as outdated.

@gz gz marked this pull request as draft March 11, 2026 04:12
@gz

This comment was marked as outdated.

@lalithsuresh

This comment was marked as outdated.

@gz

This comment was marked as outdated.

@blp
Copy link
Copy Markdown
Member

blp commented Mar 11, 2026

@gz If you can put more of the PR description into the commit messages, then that would be great (the PR description is very good). I understand that the graphs, etc. wouldn't be able to go in there.

@gz

This comment was marked as outdated.

@blp

This comment was marked as outdated.

@gz gz force-pushed the sieve branch 8 times, most recently from bdc4f4d to ffdefb1 Compare March 12, 2026 17:21
@gz gz marked this pull request as ready for review March 12, 2026 17:25
@gz gz requested a review from mythical-fred March 12, 2026 18:11
@gz
Copy link
Copy Markdown
Contributor Author

gz commented Mar 12, 2026

New benchmarks for s3-fifo eviction; still promising:

ST

single_threaded_speedup single_threaded_hit_rates

MT

multi_threaded_speedup multi_threaded_scaling multi_threaded_hit_rates

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. is in. I was wrong about the memory default — old code was also 256 MiB per thread instance, so the total is unchanged. Don't forget to pre-create on crates.io with Trusted Publishing set up (both and workflow entries) before the release runs.

@lalithsuresh
Copy link
Copy Markdown
Contributor

@gz these numbers look weaker than sieve right? What changed?

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three minor nits below — none are blockers. Approving.

@gz gz force-pushed the sieve branch 2 times, most recently from cb6acef to 96d3439 Compare March 12, 2026 20:38
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blockers, see inline.

@gz gz force-pushed the sieve branch 2 times, most recently from 4e7bcc3 to d6a59bb Compare March 12, 2026 23:09
@gz
Copy link
Copy Markdown
Contributor Author

gz commented Mar 12, 2026

@gz these numbers look weaker than sieve right? What changed?

the absolute hit rate is lower because in this case I used a smaller cache-size to regenerate the graph
I regenerated the graphs earlier with sieve in it and s3-fifo was matching the perf of sieve for zipf distribution

for scalability, I think (not confirmed) s3-fifo is weaker because the quick_cache underneath uses a "regular RwLock" rather than a sharded rwlock. I did open an issue about it: arthurprs/quick-cache#108 can probably fix it myself but lets see what the maintainer says about it. This is the problem I also had with the sieve cache prototype initially.

@gz
Copy link
Copy Markdown
Contributor Author

gz commented Mar 12, 2026

$some-workload 16GiB CACHE

LRU:
    "buffer_cache_strategy": "lru"
foreground_cache_hit_rate_percent	78.6%	79.3%	79.3%	78.5%	79.2%	79.5%	79.4%	79.4%	78.5%	79.5%
background_cache_hit_rate_percent	71.1%	66.5%	63.8%	77.0%	68.2%	69.5%	67.7%	65.0%	63.8%	77.0%

s3_fifo:
    "buffer_cache_strategy": "s3_fifo"
    "buffer_cache_allocation_strategy": "per_thread"
foreground_cache_hit_rate_percent	78.7%	78.4%	78.4%	78.7%	77.7%	78.0%	78.2%	78.3%	77.7%	78.7%
background_cache_hit_rate_percent	36.7%	36.3%	36.1%	36.9%	36.9%	35.9%	36.1%	36.9%	35.9%	36.9%

    "buffer_cache_strategy": "s3_fifo",
    "buffer_cache_allocation_strategy": "global"
foreground_cache_hit_rate_percent	94.9%	95.4%	94.8%	95.0%	94.7%	95.4%	95.0%	94.9%	94.7%	95.4%
background_cache_hit_rate_percent	76.1%	77.0%	72.6%	75.7%	75.1%	75.2%	73.8%	77.8%	72.6%	77.8%

    "buffer_cache_strategy": "s3_fifo"
    "buffer_cache_allocation_strategy": "shared_per_worker_pair"
foreground_cache_hit_rate_percent	95.6%	96.0%	96.5%	94.6%	96.3%	97.0%	97.4%	96.6%	94.6%	97.4%
background_cache_hit_rate_percent	77.8%	79.8%	79.4%	77.0%	80.2%	79.6%	78.5%	78.4%	77.0%	80.2%

    "buffer_cache_strategy": "s3_fifo"
    "buffer_cache_allocation_strategy": "shared_per_worker_pair"
    "merger": "push_merger"
foreground_cache_hit_rate_percent	97.4%	97.9%	97.8%	98.5%	97.9%	97.4%	97.7%	97.5%	97.4%	98.5%
background_cache_hit_rate_percent	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%

$some-workload cache_mib: null, 8 workers = 8*256*2 (so set to 4096 MiB)

    "buffer_cache_strategy": "lru"
foreground_cache_hit_rate_percent	80.0%	78.8%	79.1%	79.2%	79.6%	78.9%	78.8%	79.0%	78.8%	80.0%
background_cache_hit_rate_percent	21.2%	21.6%	23.6%	21.9%	24.8%	22.2%	22.2%	22.4%	21.2%	24.8%

    "buffer_cache_strategy": "s3_fifo"
    "buffer_cache_allocation_strategy": "per_thread"
foreground_cache_hit_rate_percent	78.9%	79.3%	78.9%	79.0%	78.9%	79.4%	79.5%	78.8%	78.8%	79.5%
background_cache_hit_rate_percent	8.4%	8.6%	8.3%	8.2%	8.6%	9.2%	8.5%	8.6%	8.2%	9.2%

    "buffer_cache_strategy": "s3_fifo",
    "buffer_cache_allocation_strategy": "global"
foreground_cache_hit_rate_percent	80.3%	80.4%	80.3%	80.2%	79.8%	80.3%	81.1%	79.2%	79.2%	81.1%
background_cache_hit_rate_percent	21.0%	20.6%	21.5%	19.7%	20.9%	23.6%	23.9%	21.2%	19.7%	23.9%

    "buffer_cache_strategy": "s3_fifo",
    "buffer_cache_allocation_strategy": "shared_per_worker_pair",
foreground_cache_hit_rate_percent	80.0%	80.5%	80.4%	80.1%	80.4%	80.0%	79.7%	79.9%	79.7%	80.5%
background_cache_hit_rate_percent	20.3%	22.7%	21.4%	21.2%	21.1%	22.2%	22.6%	23.0%	20.3%	23.0%

    "buffer_cache_allocation_strategy": "shared_per_worker_pair",
    "buffer_cache_strategy": "s3_fifo",
    "merger": "push_merger"
foreground_cache_hit_rate_percent	81.8%	82.0%	82.1%	81.7%	81.3%	81.6%	81.6%	81.7%	81.3%	82.1%
background_cache_hit_rate_percent	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%


$some-workload cache_mib: null, 1 workers = 256*2 (so set to 512 MiB)

    "buffer_cache_strategy": "lru"
foreground_cache_hit_rate_percent	77.4%	77.4%	77.4%
background_cache_hit_rate_percent	0.8%	0.8%	0.8%

    "buffer_cache_strategy": "s3_fifo",
    "buffer_cache_allocation_strategy": "per_thread"
foreground_cache_hit_rate_percent	72.6%	72.6%	72.6%
background_cache_hit_rate_percent	5.8%	5.8%	5.8%

    "buffer_cache_strategy": "s3_fifo",
    "buffer_cache_allocation_strategy": "global"
foreground_cache_hit_rate_percent	70.0%	70.0%	70.0%
background_cache_hit_rate_percent	14.5%	14.5%	14.5%

    "buffer_cache_strategy": "s3_fifo",
    "buffer_cache_allocation_strategy": "shared_per_worker_pair",
foreground_cache_hit_rate_percent	72.6%	72.6%	72.6%
background_cache_hit_rate_percent	21.4%	21.4%	21.4%

    "buffer_cache_allocation_strategy": "shared_per_worker_pair",
    "buffer_cache_strategy": "s3_fifo",
    "merger": "push_merger"
foreground_cache_hit_rate_percent	73.5%	73.5%	73.5%
background_cache_hit_rate_percent	100.0%	100.0%	100.0%


 u64Njoin-no-match

    "buffer_cache_strategy": "lru"
 foreground_cache_hit_rate_percent	99.2%	98.9%	99.1%	99.0%	99.3%	99.2%	99.0%	99.2%	98.9%	99.3%
background_cache_hit_rate_percent	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%
 
    "buffer_cache_allocation_strategy": "shared_per_worker_pair",
    "buffer_cache_strategy": "sieve"
foreground_cache_hit_rate_percent	45.4%	39.3%	48.3%	50.2%	42.5%	47.2%	44.3%	46.5%	39.3%	50.2%
background_cache_hit_rate_percent	10.7%	11.0%	11.3%	12.7%	13.4%	12.2%	11.6%	11.7%	10.7%	13.4%

    "buffer_cache_allocation_strategy": "shared_per_worker_pair",
    "buffer_cache_strategy": "s3_fifo"
foreground_cache_hit_rate_percent	96.0%	95.4%	94.9%	95.7%	94.9%	94.9%	95.1%	95.9%	94.9%	96.0%
background_cache_hit_rate_percent	21.7%	20.7%	20.3%	21.2%	19.4%	12.2%	14.1%	13.3%	12.2%	21.7%

    "buffer_cache_allocation_strategy": "global",
    "buffer_cache_strategy": "s3_fifo"
foreground_cache_hit_rate_percent	67.0%	59.0%	50.3%	72.0%	53.8%	59.2%	62.4%	60.6%	50.3%	72.0%
background_cache_hit_rate_percent	22.1%	32.2%	31.7%	22.3%	21.9%	29.3%	27.5%	20.8%	20.8%	32.2%

    "buffer_cache_allocation_strategy": "global",
    "buffer_cache_strategy": "sieve"
foreground_cache_hit_rate_percent	28.4%	28.5%	20.6%	24.2%	26.0%	21.1%	20.4%	23.8%	20.4%	28.5%
background_cache_hit_rate_percent	11.9%	12.3%	11.0%	13.2%	12.9%	11.8%	9.9%	11.3%	9.9%	13.2%

here are some metrics from pipelines

the one interesting takeaway was that for a ingest heavy workload u64Njoin-no-match, having a global cache is worse (in s3_fifo) than having a shared_per_worker_pair

@gz gz requested a review from blp March 12, 2026 23:32
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retracting the prior REQUEST_CHANGES — publish = true is already in the Cargo.toml (I was wrong). One non-blocking nit below. LGTM.

Copy link
Copy Markdown
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read most of this in detail (not all of the tests, and not some of the code that just moved) and it's good work. I especially appreciate how many tests it adds, and the benchmark.

None of my suggestions are important.

@gz gz force-pushed the sieve branch 2 times, most recently from 654d787 to 23e4c36 Compare March 16, 2026 19:26
@gz gz enabled auto-merge March 16, 2026 19:26
The buffer cache used to be simple. A single mutex protected it.
That design worked because only one thread accessed the cache.

We now want multiple threads to run merges in parallel.
That requires a cache that many threads can access without collapsing under contention.

This change introduces a new multi-threaded buffer cache. It also switches adds a
supposedly better eviction policy, the S3-FIFO algorithm.

For compatibility, this change also adds configuration flags to revert the behavior:

```
"dev_tweaks": {
    "buffer_cache_allocation_strategy": "per_thread" | "global" | "shared_per_worker_pair",
    "buffer_cache_strategy": "s3_fifo" | "lru"
  },

// new defaults: s3_fifo AND shared_per_worker_pair
// previously: lru AND per_thread
```

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz added this pull request to the merge queue Mar 16, 2026
Merged via the queue into main with commit 1f4159b Mar 16, 2026
1 check passed
@gz gz deleted the sieve branch March 16, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants