Support configuring flush_count and max_row_bytes of WriteToBigTable by nguymin4 · Pull Request #34761 · apache/beam

nguymin4 · 2025-04-28T09:47:44Z

Support configuring flush_count and max_row_bytes of WriteToBigTable

Related to this issue: #34760

Our pipeline write 1.3 millions data points every 5 mins with a short burst in < 1 min. Since upgrading to 2.64.0 we started observing this error google.api_core.exceptions.ResourceExhausted: 429 You have reached the limit of total mutations in your queue. Throttle your usage and wait for operations to finish

I guess with current FLUSH_COUNT=1000 it's too low for our usage and we want to be able to configure this flush_count option which will probably solve the 429 error above.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2025-04-28T11:22:50Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

mutianf · 2025-04-28T15:07:07Z

Hi @nguymin4 , the 429 error is coming from the bigtable server side because there are too many outstanding requests. And I don't think updating the flsuh_count will solve this issue. In the issue description you mentioned that retrying the mutations are causing the increased latencies. Should the mutations be retried? Would disabling retry help?

nguymin4 · 2025-04-28T15:23:05Z

Hi @mutianf, @andre-sampaio, we currently have the same output size for all Dataflow jobs but with 2.46.0 there is no error at all, comparing to 2.64.0. So I don't know what wrong with this besides trying to increase the flush_count -> reduce number of requests -> avoid 429 errors.

Does this make sense to you? In my opinion, if upgrading SDK version cause performance issue then the error is with SDK and not Bigtable server itself, isn't it?

I'm out of ideas as well. This is the biggest blocker that I need to pin our version to 2.46.0.

andre-sampaio · 2025-04-28T15:26:18Z

(sorry I accidentaly deleted my last comment, adding it again)

Thanks for your contribution!

The error you are seeing indicates you are sending more mutations than your bt cluster can handle, which causes requests to pile up on the server side until they can be processed and once too many requests get queued you start seeing this error message.

Generally speaking increasing your batch sizes can make the problem worse, specially in bursty workloads. A good signal for whether or not this is what is happening for you is checking the cpu usage for your cluster during these bursts and seeing if they are at ~100%.

I don't see anything wrong with this PR, but you may want to instead add knobs for MAX_OUTSTANDING_ELEMENTS and MAX_OUTSTANDING_BYTES and try reducing those (which one depends on your use case, if you are writing many small rows reduce elements, if you are writing large rows reduce outstanding bytes). Or alternatively reduce the number of workers on your beam job.

Let me know if this explanation helps

nguymin4 · 2025-04-28T15:43:55Z

@andre-sampaio Could you check my answer above if it makes sense to you?

In my opinion, with same output size, upgrading SDK version SHOULD NOT cause any performance degradation. About those two settings you suggested, I actually don't know, it may or may not help. But in general I feel like python-bigtable library does not welcome for contributor (I tried in the past) so it will take months or years or not ever for this issue to be resolved.

I also created a Google support ticket for this as this is consider a huge business impact to us. Should I escalate this to someone else?

andre-sampaio · 2025-04-28T16:14:30Z

hey there @nguymin4!

I agree with you that the SDK is likely the culprit here. I'm sorry about your experience with python-bigtable, I'll create an issue for exposing those 2 parameters and will link it here in a bit, though this isn't a guarantee we can do it (there might be some reason we haven't exposed that I haven't thought about).

In the meantime I don't mind if we merge this PR but I'm afraid it won't help much. I would also suggest trying temporarily to reduce the number of workers in your job (if this is something feasible).

If you want to open a support ticket for this it would help us investigating your particular case, which can help us to make more directed suggestions.

nguymin4 · 2025-04-28T16:43:29Z

Thanks @andre-sampaio, I believe you understand apache-beam and python-bigtable better than me, so I would let you decide the fate of this PR. As I mentioned, I'm totally out of idea now because this is a known issue to me for more than 1.5 years. This PR is simply a desperate attempt of mine as I didn't see anything news has been added.

I did contact Google support, with details about successful and erroneous Dataflow job. I do hope to get this resolved soon. For now I roll back to 2.46.0 to avoid triggering our error alert.

liferoad · 2025-04-28T17:43:04Z

@nguymin4 can you ask the support ticket owner to consult the BigTable team?

nguymin4 · 2025-04-29T20:49:44Z

@andre-sampaio @liferoad I got an suggestion from Vasant D - Bigtable team.

Reduce the Batch Size: Consider lowering the number of mutations sent in each batch. This can be an effective way to stay within the limit.

So I guess he is referring to lower the current flush_count from 1000 to 100? But again I don't know why this doesn't work as expected after 2.46.0.

liferoad · 2025-04-30T00:34:52Z

@andre-sampaio @liferoad I got an suggestion from Vasant D - Bigtable team.

Reduce the Batch Size: Consider lowering the number of mutations sent in each batch. This can be an effective way to stay within the limit.

So I guess he is referring to lower the current flush_count from 1000 to 100? But again I don't know why this doesn't work as expected after 2.46.0.

Since I do not know the support ticket number to get more details about your job, I guess you could lower flush_count or reduce the number of workers as @andre-sampaio mentioned above.

If you want to check whether your PR works, you could easily test this by overriding start_bundle with your own WriteToBigTable.

nguymin4 · 2025-04-30T11:30:47Z

@liferoad @andre-sampaio We only have max 1 dataflow job at a time. Each job has max 1 worker with machine type t2d-standard-48. We cannot lower this anymore because it will hurt our SLO and performance.

liferoad · 2025-04-30T13:22:46Z

@liferoad @andre-sampaio We only have max 1 dataflow job at a time. Each job has max 1 worker with machine type t2d-standard-48. We cannot lower this anymore because it will hurt our SLO and performance.

We got the support ticket now and will check these jobs with more details. Please also share the successful job ids and the failed ones through the support ticket.

andre-sampaio · 2025-06-04T14:02:47Z

We will migrate the batcher that beam is using the the new MutationsBatcher which provides control of total inflight mutations.

In the meantime I think we can merge this PR to get this mitigation in for the next release in case the migration doesn't happen by then.

liferoad · 2025-06-04T14:13:30Z

          SchemaAwareExternalTransform.discover_config(
              self._expansion_service, self.URN))

+    self._flush_count = flush_count


we probably need to log warnings if using cross-lang is true since these two parameters are not supported.

@andre-sampaio

liferoad · 2025-06-04T14:15:16Z

We will migrate the batcher that beam is using the the new MutationsBatcher which provides control of total inflight mutations.

In the meantime I think we can merge this PR to get this mitigation in for the next release in case the migration doesn't happen by then.

Note if you migrate to MutationsBatcher, you are not supposed to remove these public parameters.

liferoad · 2025-06-11T16:43:28Z

I am going to merge this for now to unblock the users.

github-actions Bot added python io gcp bigtable labels Apr 28, 2025

nguymin4 force-pushed the nguymin4/python-bigtable branch from c507fe4 to 483ba64 Compare April 28, 2025 09:49

nguymin4 force-pushed the nguymin4/python-bigtable branch 6 times, most recently from 7262e6d to 19c2071 Compare April 28, 2025 14:57

Support configuring flush_count and max_row_bytes of WriteToBigTable

b8ecbb0

nguymin4 force-pushed the nguymin4/python-bigtable branch from 19c2071 to b8ecbb0 Compare April 28, 2025 15:05

andre-sampaio approved these changes Jun 4, 2025

View reviewed changes

liferoad reviewed Jun 4, 2025

View reviewed changes

liferoad requested a review from derrickaw June 4, 2025 14:13

liferoad added this to the 2.66.0 Release milestone Jun 4, 2025

liferoad merged commit 8a0c08b into apache:master Jun 11, 2025

nguymin4 deleted the nguymin4/python-bigtable branch June 22, 2025 12:29

Conversation

nguymin4 commented Apr 28, 2025

GitHub Actions Tests Status (on master branch)

Uh oh!

github-actions Bot commented Apr 28, 2025

Uh oh!

mutianf commented Apr 28, 2025

Uh oh!

nguymin4 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andre-sampaio commented Apr 28, 2025

Uh oh!

nguymin4 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andre-sampaio commented Apr 28, 2025

Uh oh!

nguymin4 commented Apr 28, 2025

Uh oh!

liferoad commented Apr 28, 2025

Uh oh!

nguymin4 commented Apr 29, 2025

Uh oh!

liferoad commented Apr 30, 2025

Uh oh!

nguymin4 commented Apr 30, 2025

Uh oh!

liferoad commented Apr 30, 2025

Uh oh!

andre-sampaio commented Jun 4, 2025

Uh oh!

liferoad Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

liferoad Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

liferoad commented Jun 4, 2025

Uh oh!

liferoad commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nguymin4 commented Apr 28, 2025 •

edited

Loading

nguymin4 commented Apr 28, 2025 •

edited

Loading