Skip to content

PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…#900

Merged
shangxinli merged 1 commit into
apache:masterfrom
mwong38:PARQUET-2042
Nov 27, 2023
Merged

PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…#900
shangxinli merged 1 commit into
apache:masterfrom
mwong38:PARQUET-2042

Conversation

@mwong38

@mwong38 mwong38 commented May 3, 2021

Copy link
Copy Markdown
Contributor

… logical Timestamps, Date, TimeOfDay

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@shangxinli

Copy link
Copy Markdown
Contributor

@mwong38, can you put more information in the Jira on why/what is changed? This is pretty big change and it would help people to review your code.

@sheinbergon

Copy link
Copy Markdown

@shangxinli @mwong38 any way I can help with merge? Currently, proto-to-parquet conversions do not support proper timestamp handling, and it looks like this work addresses the issue.

Let me know.

@mwong38

mwong38 commented Mar 6, 2022

Copy link
Copy Markdown
Contributor Author

After proto3 made everything optional, there is no way to know whether a primitive has been set or not. That is, you could no longer represent a "nullable" primitive. (They later brought optional keyword back, but the damage is done). The solution was to wrap a primitive in a message. For example, to represent a nullable double, there is the DoubleValue. It's analogous to Java's boxed Double.

Parquet supports nullable primitives natively. If we were to represent these well-known Protobuf types directly, it will be nested inside a deeper data structure; very inconvenient and a waste of space. What I did here is "unwrap" these primitives and make use of Parquet's nullaibility.

Additionally, I convert other well-known types such as Timestamp and Date to Parquet's. Again, rather than represent the data structure in its raw nested form (seconds/nanos or year, month, day, etc), they are converted to Parquet's native or logical representation of Timestamp and Date. You can turn on or off this features in the ProtoSchemaConverter constructor.

@sheinbergon

sheinbergon commented Mar 6, 2022

Copy link
Copy Markdown

@shangxinli is there any reason this work shouldn't be merged, given @mwong38 explanation? Is there something I can help with?

@shangxinli

Copy link
Copy Markdown
Contributor

I don't see a reason why this cannot be merge. I will have a look soon.

@shangxinli

Copy link
Copy Markdown
Contributor

Can you squash all the commits?

@Override
public void addInt(int value) {
LocalDate localDate = LocalDate.ofEpochDay(value);
com.google.type.Date date = com.google.type.Date.newBuilder()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use import?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this particular case it's more clear to be explicit that this is the Google Protobuf Date, rather than, say, the Java built-in Date.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no collision on the imports, we should use imports generally.

@sheinbergon

Copy link
Copy Markdown

@mwong38 let me know if you want me to help in any way

@dossett

dossett commented Mar 22, 2022

Copy link
Copy Markdown
Contributor

I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?

@belugabehr might have thoughts too.

1 similar comment
@dossett

dossett commented Mar 22, 2022

Copy link
Copy Markdown
Contributor

I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?

@belugabehr might have thoughts too.

@sheinbergon

Copy link
Copy Markdown

@shangxinli any news about merging this version? Are there still any blockers?

@shangxinli

Copy link
Copy Markdown
Contributor

Will have another look soon.

Comment thread parquet-protobuf/pom.xml
if (unwrapProtoWrappers) {
String typeName = descriptor.getMessageType().getFullName();
if (typeName.equals(PROTOBUF_TIMESTAMP_TYPE)) {
return builder.primitive(INT64, getRepetition(descriptor)).as(timestampType(true, TimeUnit.NANOS));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should TimeUnit be something configurable? I know protobuf timestamps support nanosecond resolution but might not be what applications want to store.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth complicating the API. The Timestamp common Proto stores time in nanoseconds. There's no good reason to deviate from that or to truncate the resolution. If the user wishes to do more manipulation, it can be done downstream.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, you can use the default 'TimeUnit.NANOS' while having that configured.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that outside the scope of ProtoSchemaConverter/ParquetWriter? I don't think it should go into the business of doing transformations. If the point of ProtoSchemaConverter is to give the closest representation of the Protobuf object in Parquet, then it should be NANOS and nothing more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Do you still have a comment on this?

private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
if (descriptor.getJavaType() == JavaType.MESSAGE) {
if (unwrapProtoWrappers) {
String typeName = descriptor.getMessageType().getFullName();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are using type equality vs descriptor equality? (I haven't checked but I would think .Equals would work? Maybe add a comment if this is intentional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'll change it to compare the Descriptor directly to the Wrapper's Descriptor rather than the name.

@shangxinli

Copy link
Copy Markdown
Contributor

@mwong38 Can you address the feedback from @emkornfield before we can merge?

@sheinbergon

Copy link
Copy Markdown

@mwong38 anyway I can help with finalizing this PR?

@shangxinli

Copy link
Copy Markdown
Contributor

I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.

@mwong38

mwong38 commented Sep 4, 2022

Copy link
Copy Markdown
Contributor Author

I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.

Done

@sheinbergon

Copy link
Copy Markdown

@shangxinli any reason why this PR hasn't been merged yet?

@zhaolihe

Copy link
Copy Markdown

any reason why this PR hasn't been merged yet?

@wgtmac

wgtmac commented Oct 18, 2023

Copy link
Copy Markdown
Member

I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!

@mwong38

mwong38 commented Nov 26, 2023

Copy link
Copy Markdown
Contributor Author

I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!

Done. I really hope it's the last time.

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mwong38!

@shangxinli @emkornfield I see you have reviewed this PR and there is no more comment to resolve. Do you need to take a look for another pass?

@shangxinli shangxinli merged commit c8487c7 into apache:master Nov 27, 2023
@shangxinli

Copy link
Copy Markdown
Contributor

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants