For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink)
|Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming|
|CDC|No|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|
This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates.
(A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows)
- when before is
null, emit I
- when after is
null, emit D
- when both are non-null, emit two records +U and -U
(B) New writes in 1.0 will ONLY produce .cdc changelog format, and stops publishing to _hoodie_operation field
this means, anyone querying this field, using a snapshot query, will break.
we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.
(C) To support backwards compatibilty, we fallback to reading _hoodie_operation in 0.X tables.
For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has {{_hoodie_operation}} already, we fallback to reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error for other modes.
(D) Snapshot queries from spark, presto, trino etc all work with tables, that have _hoodie_operation published.
This is already completed for Spark. so others should be easy to do.
(E) We need to complete a review of the CDC schema
ts - should be completion time or instant time?
JIRA info
For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink)
|Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming|
|CDC|No|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|
This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates.
(A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows)
null, emit Inull, emit D(B) New writes in 1.0 will ONLY produce .cdc changelog format, and stops publishing to _hoodie_operation field
this means, anyone querying this field, using a snapshot query, will break.
we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog.
(C) To support backwards compatibilty, we fallback to reading
_hoodie_operationin 0.X tables.For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has {{_hoodie_operation}} already, we fallback to reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error for other modes.
(D) Snapshot queries from spark, presto, trino etc all work with tables, that have
_hoodie_operationpublished.This is already completed for Spark. so others should be easy to do.
(E) We need to complete a review of the CDC schema
ts - should be completion time or instant time?
JIRA info