Skip to content

Enable partial updates for CDC work payload #16354

@hudi-bot

Description

@hudi-bot

OLTP workloads on upstream databases, often update/delete/insert different columns in the table on each operation. Currently, Hudi can only supporting partial updates in cases where the same columns are being mutated in a given write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we explore what it takes to support a smarter storage format, that can only encode the changed columns into log along with the different implementations.
h2. Goals

Enable partial update functionality for all existing and potential future CDC workloads without huge modification or duplication.

Performance parity with current full-record updates or partial updates across the same set of columns

Exhibit reduction in storage costs, by only storing the changed columns.

Should also result in computation cost reductions by scanning/processing less data

Should not affect the scalability of the existing system ingestion system. The number of files generated for partial update should not increase dramatically.

 

JIRA info


Comments

03/May/24 02:13;vinoth;Punting this to 1.1 

[1.1] Implement support on top of data blocks.

we need to pass change columns information and operation all the way to write handles, using a field in HoodieRecord

... 

[1.1] Implement support on top of cdc data blocks.

we can track similar bitmaps for cdc data blocks as well

we need to extend the new file group reader to also merge base and cdc blocks. (not just base and data blocks).;;;

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions