-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Enable partial updates for CDC work payload #16354
Copy link
Copy link
Open
0 / 50 of 5 issues completedLabels
from-jirapriority:highSignificant impact; potential bugsSignificant impact; potential bugsstatus:pr-availablePull request availablePull request availabletype:devtaskDevelopment tasks and maintenance workDevelopment tasks and maintenance work
Metadata
Metadata
Assignees
Labels
from-jirapriority:highSignificant impact; potential bugsSignificant impact; potential bugsstatus:pr-availablePull request availablePull request availabletype:devtaskDevelopment tasks and maintenance workDevelopment tasks and maintenance work
Type
Fields
Give feedbackNo fields configured for issues without a type.
OLTP workloads on upstream databases, often update/delete/insert different columns in the table on each operation. Currently, Hudi can only supporting partial updates in cases where the same columns are being mutated in a given write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we explore what it takes to support a smarter storage format, that can only encode the changed columns into log along with the different implementations.
h2. Goals
Enable partial update functionality for all existing and potential future CDC workloads without huge modification or duplication.
Performance parity with current full-record updates or partial updates across the same set of columns
Exhibit reduction in storage costs, by only storing the changed columns.
Should also result in computation cost reductions by scanning/processing less data
Should not affect the scalability of the existing system ingestion system. The number of files generated for partial update should not increase dramatically.
JIRA info
Comments
03/May/24 02:13;vinoth;Punting this to 1.1
[1.1] Implement support on top of data blocks.
we need to pass change columns information and operation all the way to write handles, using a field in HoodieRecord
...
[1.1] Implement support on top of cdc data blocks.
we can track similar bitmaps for cdc data blocks as well
we need to extend the new file group reader to also merge base and cdc blocks. (not just base and data blocks).;;;