Skip to content

BUG: Write compliant Parquet with pyarrow #43690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

judahrand
Copy link

@judahrand judahrand commented Sep 21, 2021

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm we can discuss updating pyarrow version though i think we just did

@jreback
Copy link
Contributor

jreback commented Sep 22, 2021

pls try to solve this conditionally based on the pyarrow version

@judahrand
Copy link
Author

judahrand commented Sep 22, 2021

Doesn't it feel a bit odd to have a different output format depending on the version of PyArrow that just happens to be installed?

(I believe fastparquet doesn't matter as they don't support any nested types)

@judahrand judahrand force-pushed the compliant-parquet branch 3 times, most recently from 6837a30 to 996452e Compare September 22, 2021 08:59
@pep8speaks
Copy link

pep8speaks commented Sep 22, 2021

Hello @judahrand! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-30 21:10:49 UTC

@judahrand judahrand requested a review from jreback September 22, 2021 09:06
@judahrand judahrand force-pushed the compliant-parquet branch 2 times, most recently from 20e8518 to 2c404b6 Compare September 22, 2021 09:12
@judahrand
Copy link
Author

Do you think this is something which can / should be back ported to 1.3.4 or should it only appear in 1.4.0?


notable_bug_fix3
Write compliant Parquet nested types if possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to add a whole sub-section on this. a single line note is fine.

Copy link
Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just figured that since it could break people's stuff it was worth calling out but shall remove if you'd prefer?

Let me know.

Copy link
Contributor

@jreback jreback Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the entire point is this wont' break anyone, yes just a single entry is fine

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to expand on this? This change will change the data that Pandas outputs. If a user is expecting and handling the current output (not necessarily back into Pandas as Parquet is a cross platform format) then this could break their code elsewhere?

Am I misunderstanding your point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a single line is fine here

@jreback
Copy link
Contributor

jreback commented Sep 22, 2021

Do you think this is something which can / should be back ported to 1.3.4 or should it only appear in 1.4.0?

no

@jreback jreback added IO Parquet parquet, feather Bug labels Sep 22, 2021
@judahrand judahrand force-pushed the compliant-parquet branch 2 times, most recently from a0149d8 to 52e579c Compare September 22, 2021 18:15
df.to_parquet(path, pa)
result = pyarrow.parquet.read_table(path)

assert str(result.schema.field_by_name("a").type) == "list<element: int64>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use check_round_trip(df, pa) as this will assert the results

Copy link
Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't what I'm testing though... I'm testing that the Parquet that is written is correct...

ie. "list<element: int64>" not "list<item: int64>"

The serializer and deserializer always worked.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well if the serialize & de-serialize works then i am not sure i understand the problem. you can certainly assert the meta-data if you want. the most important point is that you need this option to have correct results yes?

Copy link
Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that to_parquet doesn't write Parquet compliant data. Parquet is a 'standard' after all. Pandas doesn't follow the standard.

What if I want to load data from Pandas into a different language?

Copy link
Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of a place where this is an issue: googleapis/python-bigquery#980
And the use_compliant_nested_type has to be passed manually.

Really use_compliant_nested_type should be the default in PyArrow but it isn't for compatibility reasons. It's all in the Arrow PR I linked.

Copy link
Author

@judahrand judahrand Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary: yes, use_compliant_nested_type is needed in order to output the correct, compliant Parquet format.

But then that comes back around to changing output format and so my point about calling users attention to that in What's New. It they're using the data elsewhere (not Pandas) it could cause them problems.

@judahrand
Copy link
Author

Test timed out rather than failed.

@judahrand judahrand requested a review from jreback September 23, 2021 06:54

notable_bug_fix3
Write compliant Parquet nested types if possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a single line is fine here

@judahrand
Copy link
Author

@jreback I've found your feedback on this change quite unhelpful so far. Are you happy with this change in general? I'm not going to put more time into this if the change isn't likely to be merged anyway. I've still not really had any acknowledgement that you understand the problem that this change addresses. Do you?

If I adjust the release note to a single entry and get the tests passing are you happy with this change?

@jreback
Copy link
Contributor

jreback commented Sep 29, 2021

@jreback I've found your feedback on this change quite unhelpful so far. Are you happy with this change in general? I'm not going to put more time into this if the change isn't likely to be merged anyway. I've still not really had any acknowledgement that you understand the problem that this change addresses. Do you?

If I adjust the release note to a single entry and get the tests passing are you happy with this change?

@judahrand yes the change is acceptable for code & tests. a 1-line whatsnew note is all that is needed here.

ps. a helpful attitude is a good thing. We have many PRs.

@judahrand
Copy link
Author

judahrand commented Sep 29, 2021

ps. a helpful attitude is a good thing. We have many PRs.
Completely agree.

@judahrand yes the change is acceptable for code & tests. a 1-line whatsnew note is all that is needed here.

Great, I'll deal with this when I get a chance.

@judahrand judahrand requested a review from jreback September 30, 2021 21:00
@jreback jreback added this to the 1.4 milestone Sep 30, 2021
@jreback
Copy link
Contributor

jreback commented Sep 30, 2021

@jorisvandenbossche if good here

@jorisvandenbossche
Copy link
Member

Thanks for the ping.

My first reaction is: if we think this is the better default, we should bring that up to change it in pyarrow, and not in pandas (otherwise you get inconsistencies between when using pandas or pyarrow, or eg when using dask which can use pyarrow directly instead of going through the pandas layer).
In https://issues.apache.org/jira/browse/ARROW-11497 it was actually also mentioned that long term Arrow would like to switch to conforming the spec by default, but on the short term it only exposed this keyword.

Now, it's certainly harder / higher barrier to change this in Arrow (and you put already a lot of effort in this PR, thanks for that!), but I would personally still prefer to at least first bring it up, and see what the reaction is of other Arrow devs.

@jorisvandenbossche
Copy link
Member

I opened https://issues.apache.org/jira/browse/ARROW-14196 for this

@judahrand
Copy link
Author

I opened https://issues.apache.org/jira/browse/ARROW-14196 for this

Haha! I didn't clock that had only just been opened when I commented on it!

@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 2, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

closing as this going to be addressed as above in pyarrow

@jreback jreback closed this Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame().to_parquet() does not write Parquet compliant data for nested arrays
5 participants