Skip to content

Support configuring flush_count and max_row_bytes of WriteToBigTable#34761

Merged
liferoad merged 1 commit into
apache:masterfrom
nguymin4:nguymin4/python-bigtable
Jun 11, 2025
Merged

Support configuring flush_count and max_row_bytes of WriteToBigTable#34761
liferoad merged 1 commit into
apache:masterfrom
nguymin4:nguymin4/python-bigtable

Conversation

@nguymin4

Copy link
Copy Markdown
Contributor

Support configuring flush_count and max_row_bytes of WriteToBigTable


Related to this issue: #34760

Our pipeline write 1.3 millions data points every 5 mins with a short burst in < 1 min. Since upgrading to 2.64.0 we started observing this error google.api_core.exceptions.ResourceExhausted: 429 You have reached the limit of total mutations in your queue. Throttle your usage and wait for operations to finish

I guess with current FLUSH_COUNT=1000 it's too low for our usage and we want to be able to configure this flush_count option which will probably solve the 429 error above.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions

Copy link
Copy Markdown
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@nguymin4 nguymin4 force-pushed the nguymin4/python-bigtable branch 6 times, most recently from 7262e6d to 19c2071 Compare April 28, 2025 14:57
@nguymin4 nguymin4 force-pushed the nguymin4/python-bigtable branch from 19c2071 to b8ecbb0 Compare April 28, 2025 15:05
@mutianf

mutianf commented Apr 28, 2025

Copy link
Copy Markdown
Contributor

Hi @nguymin4 , the 429 error is coming from the bigtable server side because there are too many outstanding requests. And I don't think updating the flsuh_count will solve this issue. In the issue description you mentioned that retrying the mutations are causing the increased latencies. Should the mutations be retried? Would disabling retry help?

@nguymin4

nguymin4 commented Apr 28, 2025

Copy link
Copy Markdown
Contributor Author

Hi @mutianf, @andre-sampaio, we currently have the same output size for all Dataflow jobs but with 2.46.0 there is no error at all, comparing to 2.64.0. So I don't know what wrong with this besides trying to increase the flush_count -> reduce number of requests -> avoid 429 errors.

Does this make sense to you? In my opinion, if upgrading SDK version cause performance issue then the error is with SDK and not Bigtable server itself, isn't it?

I'm out of ideas as well. This is the biggest blocker that I need to pin our version to 2.46.0.

@andre-sampaio

Copy link
Copy Markdown

(sorry I accidentaly deleted my last comment, adding it again)

Thanks for your contribution!

The error you are seeing indicates you are sending more mutations than your bt cluster can handle, which causes requests to pile up on the server side until they can be processed and once too many requests get queued you start seeing this error message.

Generally speaking increasing your batch sizes can make the problem worse, specially in bursty workloads. A good signal for whether or not this is what is happening for you is checking the cpu usage for your cluster during these bursts and seeing if they are at ~100%.

I don't see anything wrong with this PR, but you may want to instead add knobs for MAX_OUTSTANDING_ELEMENTS and MAX_OUTSTANDING_BYTES and try reducing those (which one depends on your use case, if you are writing many small rows reduce elements, if you are writing large rows reduce outstanding bytes). Or alternatively reduce the number of workers on your beam job.

Let me know if this explanation helps

@nguymin4

nguymin4 commented Apr 28, 2025

Copy link
Copy Markdown
Contributor Author

@andre-sampaio Could you check my answer above if it makes sense to you?

In my opinion, with same output size, upgrading SDK version SHOULD NOT cause any performance degradation. About those two settings you suggested, I actually don't know, it may or may not help. But in general I feel like python-bigtable library does not welcome for contributor (I tried in the past) so it will take months or years or not ever for this issue to be resolved.

I also created a Google support ticket for this as this is consider a huge business impact to us. Should I escalate this to someone else?

@andre-sampaio

Copy link
Copy Markdown

hey there @nguymin4!

I agree with you that the SDK is likely the culprit here. I'm sorry about your experience with python-bigtable, I'll create an issue for exposing those 2 parameters and will link it here in a bit, though this isn't a guarantee we can do it (there might be some reason we haven't exposed that I haven't thought about).

In the meantime I don't mind if we merge this PR but I'm afraid it won't help much. I would also suggest trying temporarily to reduce the number of workers in your job (if this is something feasible).

If you want to open a support ticket for this it would help us investigating your particular case, which can help us to make more directed suggestions.

@nguymin4

Copy link
Copy Markdown
Contributor Author

Thanks @andre-sampaio, I believe you understand apache-beam and python-bigtable better than me, so I would let you decide the fate of this PR. As I mentioned, I'm totally out of idea now because this is a known issue to me for more than 1.5 years. This PR is simply a desperate attempt of mine as I didn't see anything news has been added.

I did contact Google support, with details about successful and erroneous Dataflow job. I do hope to get this resolved soon. For now I roll back to 2.46.0 to avoid triggering our error alert.

@liferoad

Copy link
Copy Markdown
Contributor

@nguymin4 can you ask the support ticket owner to consult the BigTable team?

@nguymin4

Copy link
Copy Markdown
Contributor Author

@andre-sampaio @liferoad I got an suggestion from Vasant D - Bigtable team.

Reduce the Batch Size: Consider lowering the number of mutations sent in each batch. This can be an effective way to stay within the limit.

So I guess he is referring to lower the current flush_count from 1000 to 100? But again I don't know why this doesn't work as expected after 2.46.0.

@liferoad

Copy link
Copy Markdown
Contributor

@andre-sampaio @liferoad I got an suggestion from Vasant D - Bigtable team.

Reduce the Batch Size: Consider lowering the number of mutations sent in each batch. This can be an effective way to stay within the limit.

So I guess he is referring to lower the current flush_count from 1000 to 100? But again I don't know why this doesn't work as expected after 2.46.0.

Since I do not know the support ticket number to get more details about your job, I guess you could lower flush_count or reduce the number of workers as @andre-sampaio mentioned above.

If you want to check whether your PR works, you could easily test this by overriding start_bundle with your own WriteToBigTable.

@nguymin4

Copy link
Copy Markdown
Contributor Author

@liferoad @andre-sampaio We only have max 1 dataflow job at a time. Each job has max 1 worker with machine type t2d-standard-48. We cannot lower this anymore because it will hurt our SLO and performance.

@liferoad

Copy link
Copy Markdown
Contributor

@liferoad @andre-sampaio We only have max 1 dataflow job at a time. Each job has max 1 worker with machine type t2d-standard-48. We cannot lower this anymore because it will hurt our SLO and performance.

We got the support ticket now and will check these jobs with more details. Please also share the successful job ids and the failed ones through the support ticket.

@andre-sampaio

Copy link
Copy Markdown

We will migrate the batcher that beam is using the the new MutationsBatcher which provides control of total inflight mutations.

In the meantime I think we can merge this PR to get this mitigation in for the next release in case the migration doesn't happen by then.

SchemaAwareExternalTransform.discover_config(
self._expansion_service, self.URN))

self._flush_count = flush_count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to log warnings if using cross-lang is true since these two parameters are not supported.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liferoad liferoad requested a review from derrickaw June 4, 2025 14:13
@liferoad liferoad added this to the 2.66.0 Release milestone Jun 4, 2025
@liferoad

liferoad commented Jun 4, 2025

Copy link
Copy Markdown
Contributor

We will migrate the batcher that beam is using the the new MutationsBatcher which provides control of total inflight mutations.

In the meantime I think we can merge this PR to get this mitigation in for the next release in case the migration doesn't happen by then.

Note if you migrate to MutationsBatcher, you are not supposed to remove these public parameters.

@liferoad

Copy link
Copy Markdown
Contributor

I am going to merge this for now to unblock the users.

@liferoad liferoad merged commit 8a0c08b into apache:master Jun 11, 2025
@nguymin4 nguymin4 deleted the nguymin4/python-bigtable branch June 22, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants