Add Python SDK utilities for benchmarking (similar to fda bench)#5716
Add Python SDK utilities for benchmarking (similar to fda bench)#5716Karakatiza666 wants to merge 4 commits intomainfrom
fda bench)#5716Conversation
fda bench)
fda bench)fda bench)
5665e3e to
918052b
Compare
python/feldera/benchmarking.py
Outdated
| last = samples[-1] | ||
|
|
||
| uptime_s = last.runtime_elapsed_msecs / 1000.0 | ||
| throughput = int(last.total_processed_records / uptime_s) if uptime_s > 0 else 0 |
There was a problem hiding this comment.
This computes average throughput since pipeline start, not throughput during the measurement window. If the pipeline was running for minutes before collect_metrics was called, this dramatically understates the throughput seen during the benchmark.
The correct formula is delta-based:
first = samples[0]
delta_records = last.total_processed_records - first.total_processed_records
delta_secs = (last.runtime_elapsed_msecs - first.runtime_elapsed_msecs) / 1000.0
throughput = int(delta_records / delta_secs) if delta_secs > 0 else 0Similarly the state_amplification denominator (input_bytes) is a cumulative total — so it has the same issue when the pipeline was pre-warmed.
There was a problem hiding this comment.
looks good but let's not make this public documentation because it likely has little use for someone using it outside of our org
- so maybe put it e.g., under the
testutilsmodule? - we should have at least one test that benchmarkes something before we put this in
python/docs/examples.rst
Outdated
| :header-rows: 1 | ||
| :widths: 40 60 | ||
|
|
||
| * - ``fda`` flag |
There was a problem hiding this comment.
i wouldn't explain this as "fda equivalent" args just document the args
918052b to
4bb7827
Compare
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
4bb7827 to
ccf7f2f
Compare
Signed-off-by: Heorhii Bulakh <bulakh.96@gmail.com>
f4216b7 to
e5cc06d
Compare
|
Could an explanation be added what is being benchmarked and why? Is it the Feldera instance itself, or is it about a user pipeline? Is this more like an additional monitoring service helper that does some regular polling? |
Signed-off-by: Heorhii Bulakh <bulakh.96@gmail.com>
e5cc06d to
11efd85
Compare
|
@snkas are you talking about this description? Not sure what is ambiguous here? This is not a standalone service, rather designed to be SDK utils as part of monitoring tools, tests etc. |
…time_revision to Python SDK Signed-off-by: Heorhii Bulakh <bulakh.96@gmail.com>
bea556e to
d59b4f7
Compare
|
I tested this PR privately, I found it useful when benchmarking a pipeline during a test. I can move it under
Chicken and the egg problem; I suggest merging under e.g. |
|
It seems to me the equivalent of If the pipeline start and stop are not done by the function itself, the utilities seem to be more about monitoring with a specific end condition rather than benchmarking. One nice-to-have would be to have the ending condition also support a user-defined one via a lambda function or so on the pipeline, as for the benchmarks that are about completely processing some data, many connectors do not become |
|
It's generally difficult to capture what "benchmarking" means across pipelines, as in benchmarking implies capturing how well it performs, which can be subjective. Makes sense to keep it in a separate module or testutils until it's more settled, such that it doesn't become an API that we need to keep backward compatible. It might be worthwhile for the latter case to prefix |
mythical-fred
left a comment
There was a problem hiding this comment.
No tests. 723 lines of new logic with zero test coverage. Functions like _stddev, _human_readable_bytes, BenchmarkMetrics.from_samples, _averaged_metrics, and format_table are pure functions — no pipeline, no infrastructure needed. They should have unit tests before this ships.
|
|
||
|
|
||
| def _stddev(values: list[float]) -> float: | ||
| """Population standard deviation.""" |
There was a problem hiding this comment.
_stddev, _human_readable_bytes, BenchmarkMetrics.from_samples, _averaged_metrics, and format_table are all pure functions with no external dependencies. These should have unit tests. Edge cases worth covering: empty sample list, 1-sample list (delta = 0, throughput = 0), runs with mismatched state_amplification = None, formatting with zero bytes, and multi-run stddev correctness.
| if edition == "Open source": | ||
| context["bencher.dev/v0/repo/hash"] = ( | ||
| "de8879fbda0c9e9392e3b94064c683a1b4bae216" | ||
| ) |
There was a problem hiding this comment.
What are these hardcoded hashes? They look like git commit SHAs but bencher.dev/v0/repo/hash is a permanent identifier — these will be wrong the moment anything changes. If this is an internal Bencher convention, add a comment explaining what they represent and why they're static.
Add benchmarking utilities to the Python SDK (feldera.benchmarking)
The fda bench --upload CLI command collects pipeline performance metrics, formats them as Bencher Metric Format (BMF), and uploads results to a Bencher-compatible server. Until now there was no Python equivalent — users working with Python-based benchmark workloads (e.g. test_tpch.py) had to use the CLI or roll their own polling loop.
This PR adds a feldera/benchmarking.py module that mirrors that functionality.
New public API (all exported from feldera):
Design notes: