PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…#900
Conversation
|
@mwong38, can you put more information in the Jira on why/what is changed? This is pretty big change and it would help people to review your code. |
|
@shangxinli @mwong38 any way I can help with merge? Currently, proto-to-parquet conversions do not support proper timestamp handling, and it looks like this work addresses the issue. Let me know. |
|
After proto3 made everything optional, there is no way to know whether a primitive has been set or not. That is, you could no longer represent a "nullable" primitive. (They later brought Parquet supports nullable primitives natively. If we were to represent these well-known Protobuf types directly, it will be nested inside a deeper data structure; very inconvenient and a waste of space. What I did here is "unwrap" these primitives and make use of Parquet's nullaibility. Additionally, I convert other well-known types such as Timestamp and Date to Parquet's. Again, rather than represent the data structure in its raw nested form (seconds/nanos or year, month, day, etc), they are converted to Parquet's native or logical representation of Timestamp and Date. You can turn on or off this features in the |
|
@shangxinli is there any reason this work shouldn't be merged, given @mwong38 explanation? Is there something I can help with? |
|
I don't see a reason why this cannot be merge. I will have a look soon. |
|
Can you squash all the commits? |
| @Override | ||
| public void addInt(int value) { | ||
| LocalDate localDate = LocalDate.ofEpochDay(value); | ||
| com.google.type.Date date = com.google.type.Date.newBuilder() |
There was a problem hiding this comment.
I think in this particular case it's more clear to be explicit that this is the Google Protobuf Date, rather than, say, the Java built-in Date.
There was a problem hiding this comment.
If there is no collision on the imports, we should use imports generally.
|
@mwong38 let me know if you want me to help in any way |
|
I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged? @belugabehr might have thoughts too. |
1 similar comment
|
I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged? @belugabehr might have thoughts too. |
|
@shangxinli any news about merging this version? Are there still any blockers? |
|
Will have another look soon. |
| if (unwrapProtoWrappers) { | ||
| String typeName = descriptor.getMessageType().getFullName(); | ||
| if (typeName.equals(PROTOBUF_TIMESTAMP_TYPE)) { | ||
| return builder.primitive(INT64, getRepetition(descriptor)).as(timestampType(true, TimeUnit.NANOS)); |
There was a problem hiding this comment.
should TimeUnit be something configurable? I know protobuf timestamps support nanosecond resolution but might not be what applications want to store.
There was a problem hiding this comment.
I don't think it's worth complicating the API. The Timestamp common Proto stores time in nanoseconds. There's no good reason to deviate from that or to truncate the resolution. If the user wishes to do more manipulation, it can be done downstream.
There was a problem hiding this comment.
In that case, you can use the default 'TimeUnit.NANOS' while having that configured.
There was a problem hiding this comment.
Isn't that outside the scope of ProtoSchemaConverter/ParquetWriter? I don't think it should go into the business of doing transformations. If the point of ProtoSchemaConverter is to give the closest representation of the Protobuf object in Parquet, then it should be NANOS and nothing more.
There was a problem hiding this comment.
@emkornfield Do you still have a comment on this?
| private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) { | ||
| if (descriptor.getJavaType() == JavaType.MESSAGE) { | ||
| if (unwrapProtoWrappers) { | ||
| String typeName = descriptor.getMessageType().getFullName(); |
There was a problem hiding this comment.
is there a reason you are using type equality vs descriptor equality? (I haven't checked but I would think .Equals would work? Maybe add a comment if this is intentional.
There was a problem hiding this comment.
That's a good point. I'll change it to compare the Descriptor directly to the Wrapper's Descriptor rather than the name.
|
@mwong38 Can you address the feedback from @emkornfield before we can merge? |
|
@mwong38 anyway I can help with finalizing this PR? |
|
I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge. |
9bf1961 to
9672ee7
Compare
Done |
|
@shangxinli any reason why this PR hasn't been merged yet? |
|
any reason why this PR hasn't been merged yet? |
|
I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks! |
… logical Timestamps, Date, TimeOfDay
9672ee7 to
95b45d3
Compare
Done. I really hope it's the last time. |
wgtmac
left a comment
There was a problem hiding this comment.
Thanks @mwong38!
@shangxinli @emkornfield I see you have reviewed this PR and there is no more comment to resolve. Do you need to take a look for another pass?
|
Merged |
… logical Timestamps, Date, TimeOfDay
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation