Consolidate the CDC Formats (changelog format, RFC-51)

For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|

This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates. 

 

(A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows)
 * when before is `null`, emit I
 * when after is `null`, emit D
 * when both are non-null, emit two records +U and -U

(B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops publishing to _hoodie_operation field 
 # this means, anyone querying this field, using a snapshot query, will break.
 # we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.

(C) To support backwards compatibilty, we fallback to reading `_hoodie_operation` in 0.X tables. 

For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has {{_hoodie_operation}} already, we fallback to reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error for other modes. 



(D) Snapshot queries from spark, presto, trino etc all work with tables, that have `_hoodie_operation` published. 

 This is already completed for Spark. so others should be easy to do. 

 

(E) We need to complete a review of the CDC schema

ts - should be completion time or instant time?

 

 

## JIRA info

- Link: https://issues.apache.org/jira/browse/HUDI-7538
- Type: Improvement
- Epic: https://issues.apache.org/jira/browse/HUDI-6242
- Fix version(s):
  - 1.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate the CDC Formats (changelog format, RFC-51) #16429

this means, anyone querying this field, using a snapshot query, will break.

we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.

JIRA info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Consolidate the CDC Formats (changelog format, RFC-51) #16429

Description

this means, anyone querying this field, using a snapshot query, will break.

we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.

JIRA info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions