Range filter for batches by gz · Pull Request #5937 · feldera/feldera

gz · 2026-03-27T03:29:16Z

Describe Manual Test Plan

Tested manually with a few pipelines.

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Describe Incompatible Changes

No incompatible changes.

This is a very cheap filter that can speed up ingest significantly. The idea is to track the min/max for every batch. For seek_key_exact if the value we're seeking is not in side of the range we skip the batch. Some ingest heavy benchmarks: just bloom filter: ╭────────────────────────┬───────────┬──────────┬───────╮ │ Metric │ Value │ Lower │ Upper │ ├────────────────────────┼───────────┼──────────┼───────┤ │ Throughput (records/s) │ 2695496 │ - │ - │ │ Memory │ 14.28 GiB │ 1.78 GiB │ - │ │ Storage │ 79.45 GiB │ 110 B │ - │ │ Uptime [ms] │ 302742 │ - │ - │ │ State Amplification │ 0.43 │ - │ - │ ╰────────────────────────┴───────────┴──────────┴───────╯ range+bloom filter: ╭────────────────────────┬────────────┬──────────┬───────╮ │ Metric │ Value │ Lower │ Upper │ ├────────────────────────┼────────────┼──────────┼───────┤ │ Throughput (records/s) │ 4035088 │ - │ - │ │ Memory │ 23.4 GiB │ 2.42 GiB │ - │ │ Storage │ 112.51 GiB │ 110 B │ - │ │ Uptime [ms] │ 303292 │ - │ - │ │ State Amplification │ 0.41 │ - │ - │ ╰────────────────────────┴────────────┴──────────┴───────╯ Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

crates/dbsp/src/circuit/metadata.rs

mihaibudiu · 2026-03-27T03:33:01Z

crates/dbsp/src/storage/file/reader.rs

+        let mut max = key_factory.default_box();
+
+        match root.read::<K, A>(&self.file)? {
+            TreeBlock::Data(data_block) => unsafe {


I think this should be a trait

more importantly needs some eyes from @blp to tell me if this is doing the right thing

@mihaibudiu What do you think should be a trait?

Getting the range should be a trait.

It could be a method on TreeBlock but I don't see the value of defining a trait.

crates/dbsp/src/storage/filter_stats.rs

mihaibudiu · 2026-03-27T03:41:33Z

crates/dbsp/src/trace/ord/file/indexed_wset_batch.rs

            )
            .unwrap_storage(),
            weight: factories.weight_factory().default_box(),
+            key_range: None,


What does None mean? Empty or unknown?

both, I'm not sure if it can ever happen but the APIs are written in a way that such that it's possible

mihaibudiu · 2026-03-27T03:47:48Z

crates/dbsp/src/trace/ord/key_range.rs

+    }
+
+    /// Extends the upper bound when keys arrive in sorted order.
+    pub(crate) fn extend_to(&mut self, max: &K) {


I hope this is called only for the last value in a batch

mihaibudiu · 2026-03-27T03:51:15Z

crates/dbsp/src/trace/ord/file/wset_batch.rs

    }

    fn push_key(&mut self, key: &K) {
+        if let Some(range) = &mut self.key_range {


So it is called for every element added?
How do you know that this is larger than the min?
Could you have a debug_assert for that?

there is an assert in extend_to; maybe it can be done in a less naive way I'll have a look

Writer1::write0() asserts that the keys that it's called with are in order.

mihaibudiu · 2026-03-27T03:52:41Z

crates/dbsp/src/trace/ord/file/wset_batch.rs

    fn seek_key_exact(&mut self, key: &K, hash: Option<u64>) -> bool {
-        let hash = hash.unwrap_or_else(|| key.default_hash());
-        if !self.wset.maybe_contains_key(hash) {
+        if !self.wset.filters.maybe_contains_key(key, hash) {


Is the default_hash() still computed someplace?

yes, few files down

crates/dbsp/src/trace/ord/batch_filter.rs

mythical-fred

Two blockers: (1) two unsafe blocks in reader.rs missing // SAFETY: comments (see inline), and (2) docs.feldera.com/docs/operations/metrics.md should be updated with entries for the four new range filter metrics (range_filter_size_bytes, range_filter_hits_count, range_filter_misses_count, range_filter_hit_rate_percent). Looking at past PRs (e.g., the transaction metrics PR), the pattern is a manual edit to metrics.md alongside the code change.

mythical-fred · 2026-03-27T05:10:16Z

crates/dbsp/src/storage/file/reader.rs

+        let mut max = key_factory.default_box();
+
+        match root.read::<K, A>(&self.file)? {
+            TreeBlock::Data(data_block) => unsafe {


Missing // SAFETY: comment. key_range() is a safe function, so each unsafe {} block inside it needs to document the invariant. For this DataBlock arm: why is calling data_block.key(0, ...) and data_block.key(n_values()-1, ...) sound? (Bounds validity, factory type match.)

Fair point from @blp — if the convention in this file is to propagate unsafe up to the public function signature (rather than wrapping it in a safe function with // SAFETY:), making key_range() unsafe fn with a /// # Safety doc comment works equally well. Either approach satisfies the requirement; I just want the safety contract stated somewhere.

mythical-fred · 2026-03-27T05:10:16Z

crates/dbsp/src/storage/file/reader.rs

+                data_block.key(&factories, 0, min.as_mut());
+                data_block.key(&factories, data_block.n_values() - 1, max.as_mut());
+            },
+            TreeBlock::Index(index_block) => unsafe {


Missing // SAFETY: comment. For the IndexBlock arm: why is index_block.get_bound(0, ...) and get_bound(n_children()*2-1, ...) sound? (Bounds validity, factory type match.)

We add stats for the range filter. This lead to some refactoring: Since we now have two filters (with another one on the way) we consolidate the stats into a single struct that can be re-used across filters. It also revealed a performance issue with the current filter stats. Because this function is extremly hot adding the hit and miss atomics in CachePadded led to a 25% increase for the ingest benchmark. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

gz · 2026-03-27T06:13:03Z

some benchmarking revealed that this optimization alone does not help much for e.g., delta lake connectors because keys get ingested mostly at random (maybe some z ordering or liquid clustering stuff would help but who knows)

it helps for connectors that ingest in semi linear orders (e.g., datagen -- maybe postgres/kafka as well)
if the bloom filter can be skipped almost 100% of the time, it's ~30% faster -- this is promising because having the min-max allows us to use bitmaps which can also save a lot of time compared to bloom filters during lookups

blp · 2026-03-27T17:40:21Z

crates/dbsp/src/storage/file/reader.rs

+        let mut max = key_factory.default_box();
+
+        match root.read::<K, A>(&self.file)? {
+            TreeBlock::Data(data_block) => unsafe {


@mihaibudiu What do you think should be a trait?

blp · 2026-03-27T17:43:41Z

crates/dbsp/src/storage/file/reader.rs

+    ///
+    /// The bounds are loaded from the root node when first requested and can
+    /// then be cached by higher-level batch types.
+    pub fn key_range(&self) -> Result<Option<(Box<K>, Box<K>)>, Error> {


I think we've been making the public functions in the reader unsafe if they are unsafe internally, because these functions don't mask the unsafety; they are as unsafe as the functions they call. (The unsafety is because rkyv deserialization is unsafe.)

blp · 2026-03-27T17:51:21Z

crates/dbsp/src/storage/file/test.rs

 }

+#[test]
+fn one_column_key_range() {


It's good to have a test.

I think it would be better to add to the existing tests, too. The new addition would be analogous to test_bloom(), which is also a check that only applies to the first column in a file. I'd expect that it would use expected0 to get the expected first and last key and then call key_range and verify that the results are the same.

blp · 2026-03-27T18:01:49Z

crates/dbsp/src/trace/ord/file/indexed_wset_batch.rs

+        if let Some(range) = &mut self.key_range {
+            range.extend_to(key);
+        } else {
+            self.key_range = Some(KeyRange::from_refs(key, key));
+        }


This clones every key we write, which will be expensive for large keys. That's easy but it's not necessary, we have at least two ways to avoid it:

The Writer could recover the key range from what it wrote, which is still in memory in Writer1::close and Writer2::close, since it writes the top-level index or data block as the last thing it does there, and then return it from Writer1::close and Writer2::close along with the bloom filter, or from Writer1::into_reader or Writer2::into_reader.

Read it back in Reader::new() since it's probably still in the cache (we just wrote it).

blp · 2026-03-27T18:03:21Z

crates/dbsp/src/trace/ord/file/wset_batch.rs

    }

    fn push_key(&mut self, key: &K) {
+        if let Some(range) = &mut self.key_range {


Writer1::write0() asserts that the keys that it's called with are in order.

blp · 2026-03-27T18:07:19Z

crates/dbsp/src/storage/filter_stats.rs

+impl Add for FilterStats {
+    type Output = Self;
+
+    fn add(mut self, rhs: Self) -> Self::Output {
+        self.add_assign(rhs);
+        self
+    }
+}
+
+impl AddAssign for FilterStats {
+    fn add_assign(&mut self, rhs: Self) {
+        self.size_byte += rhs.size_byte;
+        self.hits += rhs.hits;
+        self.misses += rhs.misses;
+    }
+}


I noticed the other day that we use derive_more. I think we could just write #[derive(Add, Sum)] to get these. It's a matter of taste whether you like that though.

(I just noticed this, I think we could use this elsewhere and don't.)

crates/dbsp/src/storage/filter_stats.rs

blp · 2026-03-27T18:10:45Z

crates/dbsp/src/storage/filter_stats.rs

+    hits: CachePadded<AtomicUsize>,
+    misses: CachePadded<AtomicUsize>,


The padding makes this structure big, probably 384 bytes?

Do you expect hits and misses to be accessed by different CPUs? If not, they could go in the same CachePadded.

gz requested review from blp and ryzhyk March 27, 2026 03:29

gz changed the title ~~Min max batch~~ Range filter for batches Mar 27, 2026

mihaibudiu reviewed Mar 27, 2026

View reviewed changes

mythical-fred suggested changes Mar 27, 2026

View reviewed changes

gz force-pushed the min-max-batch branch from 5a4e8d5 to 27cc389 Compare March 27, 2026 05:22

gz force-pushed the min-max-batch branch from 27cc389 to e987d2e Compare March 27, 2026 05:53

blp approved these changes Mar 27, 2026

View reviewed changes

		hits: CachePadded<AtomicUsize>,
		misses: CachePadded<AtomicUsize>,

Conversation

gz commented Mar 27, 2026

Describe Manual Test Plan

Checklist

Describe Incompatible Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gz commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gz commented Mar 27, 2026 •

edited

Loading