Skip to content

Consolidate the CDC Formats (changelog format, RFC-51) #16429

@hudi-bot

Description

@hudi-bot

For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming|
|CDC|No|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|

This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates. 

 

(A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows)

  • when before is null, emit I
  • when after is null, emit D
  • when both are non-null, emit two records +U and -U

(B) New writes in 1.0 will ONLY produce .cdc changelog format, and stops publishing to _hoodie_operation field 

this means, anyone querying this field, using a snapshot query, will break.

we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.

(C) To support backwards compatibilty, we fallback to reading _hoodie_operation in 0.X tables. 

For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has {{_hoodie_operation}} already, we fallback to reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error for other modes. 

(D) Snapshot queries from spark, presto, trino etc all work with tables, that have _hoodie_operation published. 

 This is already completed for Spark. so others should be easy to do. 

 

(E) We need to complete a review of the CDC schema

ts - should be completion time or instant time?

 

 

JIRA info

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions