Skip to content

Add Python SDK utilities for benchmarking (similar to fda bench)#5716

Open
Karakatiza666 wants to merge 4 commits intomainfrom
python-sdk-bench
Open

Add Python SDK utilities for benchmarking (similar to fda bench)#5716
Karakatiza666 wants to merge 4 commits intomainfrom
python-sdk-bench

Conversation

@Karakatiza666
Copy link
Contributor

Add benchmarking utilities to the Python SDK (feldera.benchmarking)

The fda bench --upload CLI command collects pipeline performance metrics, formats them as Bencher Metric Format (BMF), and uploads results to a Bencher-compatible server. Until now there was no Python equivalent — users working with Python-based benchmark workloads (e.g. test_tpch.py) had to use the CLI or roll their own polling loop.

This PR adds a feldera/benchmarking.py module that mirrors that functionality.

New public API (all exported from feldera):

  • collect_metrics(pipeline, duration_secs=None) — polls pipeline.stats() in a 1-second loop until pipeline_complete is True or the optional duration elapses. Validates incarnation UUID consistency across samples and raises RuntimeError if any input connector reported errors.
  • BenchmarkMetrics.from_samples(samples) — aggregates the raw snapshots into throughput, peak/min memory, peak/min storage, buffered-input-record statistics, and state amplification ratio.
  • BenchmarkResult — wraps a name, metrics, and timing. Provides to_bmf() (dict), to_json() (pretty-printed BMF string), and format_table() (ASCII table).
  • bench(pipeline, name=None, duration_secs=None) — convenience wrapper that calls collect_metrics and returns a BenchmarkResult.
  • upload_to_bencher(result, project, *, host, token, branch, feldera_client, ...) — POSTs the BMF report to a Bencher-compatible server. Reads BENCHER_API_TOKEN, BENCHER_PROJECT, and BENCHER_HOST from the environment. When feldera_client is provided, enriches the run context with the Feldera instance edition and revision (matching what fda does via get_config()).

Design notes:

  • These utilities are observation-only — they do not start, stop, or otherwise manage pipeline lifetime. Callers retain full control over the pipeline lifecycle, which allows them to configure compilation profile, storage, transactions, etc. before the benchmark window.
  • The metric aggregation logic and BMF structure are a direct translation of crates/fda/src/bench.rs, keeping the two outputs compatible.

@Karakatiza666 Karakatiza666 changed the title Add Python SDK utilities for equivalent Add Python SDK utilities for benchmarking (same as fda bench) Feb 27, 2026
@Karakatiza666 Karakatiza666 changed the title Add Python SDK utilities for benchmarking (same as fda bench) Add Python SDK utilities for benchmarking (similar to fda bench) Feb 27, 2026
last = samples[-1]

uptime_s = last.runtime_elapsed_msecs / 1000.0
throughput = int(last.total_processed_records / uptime_s) if uptime_s > 0 else 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computes average throughput since pipeline start, not throughput during the measurement window. If the pipeline was running for minutes before collect_metrics was called, this dramatically understates the throughput seen during the benchmark.

The correct formula is delta-based:

first = samples[0]
delta_records = last.total_processed_records - first.total_processed_records
delta_secs = (last.runtime_elapsed_msecs - first.runtime_elapsed_msecs) / 1000.0
throughput = int(delta_records / delta_secs) if delta_secs > 0 else 0

Similarly the state_amplification denominator (input_bytes) is a cumulative total — so it has the same issue when the pipeline was pre-warmed.

Copy link
Contributor

@gz gz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but let's not make this public documentation because it likely has little use for someone using it outside of our org

  • so maybe put it e.g., under the testutils module?
  • we should have at least one test that benchmarkes something before we put this in

@gz
Copy link
Contributor

gz commented Feb 27, 2026

for @abhizer and (@snkas @swanandx) it may be useful to release a utilities python package on PyPI that has the the for testing/qa/benchmarking code that's not core feldera python sdk (but depends on SDK functionality)

:header-rows: 1
:widths: 40 60

* - ``fda`` flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wouldn't explain this as "fda equivalent" args just document the args

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
Signed-off-by: Heorhii Bulakh <bulakh.96@gmail.com>
@Karakatiza666 Karakatiza666 force-pushed the python-sdk-bench branch 3 times, most recently from f4216b7 to e5cc06d Compare March 9, 2026 11:26
@snkas
Copy link
Contributor

snkas commented Mar 9, 2026

Could an explanation be added what is being benchmarked and why? Is it the Feldera instance itself, or is it about a user pipeline? Is this more like an additional monitoring service helper that does some regular polling?

Signed-off-by: Heorhii Bulakh <bulakh.96@gmail.com>
@Karakatiza666
Copy link
Contributor Author

Karakatiza666 commented Mar 9, 2026

@snkas are you talking about this description? Not sure what is ambiguous here?

The :mod:`feldera.benchmarking` module provides utilities to collect and upload
benchmark metrics for Feldera pipelines.  It polls :meth:`.Pipeline.stats` in a
loop, aggregates the snapshots into :class:`.BenchmarkMetrics`, and can
optionally upload a
`Bencher Metric Format (BMF) <https://bencher.dev/docs/reference/test-harnesses/>`_
report to a Bencher-compatible server.

.. note::
   These utilities only **observe** a running pipeline — they do not start,
   stop, or otherwise manage pipeline lifetime.  The caller is responsible for
   starting the pipeline before calling :func:`.bench` or
   :func:`.collect_metrics`, and for stopping it afterwards.

This is not a standalone service, rather designed to be SDK utils as part of monitoring tools, tests etc.

…time_revision to Python SDK

Signed-off-by: Heorhii Bulakh <bulakh.96@gmail.com>
@Karakatiza666 Karakatiza666 marked this pull request as ready for review March 9, 2026 13:33
@Karakatiza666
Copy link
Contributor Author

Karakatiza666 commented Mar 9, 2026

I tested this PR privately, I found it useful when benchmarking a pipeline during a test.

I can move it under feldera.testutils.benchmarking;

we should have at least one test that benchmarkes something before we put this in

Chicken and the egg problem; I suggest merging under e.g. testutils, and then proceeding from there. I don't have a strong opinion on whether it needs to be in a separate Python package, but it would be a new artifact (I did not expect to introduce that), and "Feldera Python SDK" implies some useful tools, not just a thin API wrapper

@snkas
Copy link
Contributor

snkas commented Mar 9, 2026

It seems to me the equivalent of fda bench would be to have pipeline.benchmark(), what's the motivation to have it be separate?

If the pipeline start and stop are not done by the function itself, the utilities seem to be more about monitoring with a specific end condition rather than benchmarking.

One nice-to-have would be to have the ending condition also support a user-defined one via a lambda function or so on the pipeline, as for the benchmarks that are about completely processing some data, many connectors do not become completed (as there is not a guarantee that they end) but they still should complete (rather than just having a timeout).

@snkas
Copy link
Contributor

snkas commented Mar 9, 2026

It's generally difficult to capture what "benchmarking" means across pipelines, as in benchmarking implies capturing how well it performs, which can be subjective. Makes sense to keep it in a separate module or testutils until it's more settled, such that it doesn't become an API that we need to keep backward compatible. It might be worthwhile for the latter case to prefix _ to the functions to indicate they are for internal usage.

Copy link
Collaborator

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests. 723 lines of new logic with zero test coverage. Functions like _stddev, _human_readable_bytes, BenchmarkMetrics.from_samples, _averaged_metrics, and format_table are pure functions — no pipeline, no infrastructure needed. They should have unit tests before this ships.



def _stddev(values: list[float]) -> float:
"""Population standard deviation."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_stddev, _human_readable_bytes, BenchmarkMetrics.from_samples, _averaged_metrics, and format_table are all pure functions with no external dependencies. These should have unit tests. Edge cases worth covering: empty sample list, 1-sample list (delta = 0, throughput = 0), runs with mismatched state_amplification = None, formatting with zero bytes, and multi-run stddev correctness.

if edition == "Open source":
context["bencher.dev/v0/repo/hash"] = (
"de8879fbda0c9e9392e3b94064c683a1b4bae216"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these hardcoded hashes? They look like git commit SHAs but bencher.dev/v0/repo/hash is a permanent identifier — these will be wrong the moment anything changes. If this is an internal Bencher convention, add a comment explaining what they represent and why they're static.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants