-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Write compliant Parquet with pyarrow
#43690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm we can discuss updating pyarrow version though i think we just did
pls try to solve this conditionally based on the pyarrow version |
Doesn't it feel a bit odd to have a different output format depending on the version of PyArrow that just happens to be installed? (I believe |
6837a30
to
996452e
Compare
Hello @judahrand! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-09-30 21:10:49 UTC |
996452e
to
c3ead23
Compare
20e8518
to
2c404b6
Compare
2c404b6
to
39010eb
Compare
69e2f01
to
cc7e6a8
Compare
Do you think this is something which can / should be back ported to |
doc/source/whatsnew/v1.4.0.rst
Outdated
|
||
notable_bug_fix3 | ||
Write compliant Parquet nested types if possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to add a whole sub-section on this. a single line note is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just figured that since it could break people's stuff it was worth calling out but shall remove if you'd prefer?
Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the entire point is this wont' break anyone, yes just a single entry is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to expand on this? This change will change the data that Pandas outputs. If a user is expecting and handling the current output (not necessarily back into Pandas as Parquet is a cross platform format) then this could break their code elsewhere?
Am I misunderstanding your point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a single line is fine here
no |
a0149d8
to
52e579c
Compare
996a4e5
to
17b4d17
Compare
df.to_parquet(path, pa) | ||
result = pyarrow.parquet.read_table(path) | ||
|
||
assert str(result.schema.field_by_name("a").type) == "list<element: int64>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use check_round_trip(df, pa)
as this will assert the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That isn't what I'm testing though... I'm testing that the Parquet that is written is correct...
ie. "list<element: int64>" not "list<item: int64>"
The serializer and deserializer always worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well if the serialize & de-serialize works then i am not sure i understand the problem. you can certainly assert the meta-data if you want. the most important point is that you need this option to have correct results yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that to_parquet
doesn't write Parquet compliant data. Parquet is a 'standard' after all. Pandas doesn't follow the standard.
What if I want to load data from Pandas into a different language?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of a place where this is an issue: googleapis/python-bigquery#980
And the use_compliant_nested_type
has to be passed manually.
Really use_compliant_nested_type
should be the default in PyArrow but it isn't for compatibility reasons. It's all in the Arrow PR I linked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary: yes, use_compliant_nested_type
is needed in order to output the correct, compliant Parquet format.
But then that comes back around to changing output format and so my point about calling users attention to that in What's New. It they're using the data elsewhere (not Pandas) it could cause them problems.
Test timed out rather than failed. |
doc/source/whatsnew/v1.4.0.rst
Outdated
|
||
notable_bug_fix3 | ||
Write compliant Parquet nested types if possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a single line is fine here
@jreback I've found your feedback on this change quite unhelpful so far. Are you happy with this change in general? I'm not going to put more time into this if the change isn't likely to be merged anyway. I've still not really had any acknowledgement that you understand the problem that this change addresses. Do you? If I adjust the release note to a single entry and get the tests passing are you happy with this change? |
@judahrand yes the change is acceptable for code & tests. a 1-line whatsnew note is all that is needed here. ps. a helpful attitude is a good thing. We have many PRs. |
Great, I'll deal with this when I get a chance. |
7c810f4
to
4025014
Compare
@jorisvandenbossche if good here |
Thanks for the ping. My first reaction is: if we think this is the better default, we should bring that up to change it in pyarrow, and not in pandas (otherwise you get inconsistencies between when using pandas or pyarrow, or eg when using dask which can use pyarrow directly instead of going through the pandas layer). Now, it's certainly harder / higher barrier to change this in Arrow (and you put already a lot of effort in this PR, thanks for that!), but I would personally still prefer to at least first bring it up, and see what the reaction is of other Arrow devs. |
I opened https://issues.apache.org/jira/browse/ARROW-14196 for this |
Haha! I didn't clock that had only just been opened when I commented on it! |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
closing as this going to be addressed as above in pyarrow |
DataFrame().to_parquet()
does not write Parquet compliant data for nested arrays #43689