-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST: Fix test_parquet failures for pyarrow 1.0 #35814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: Fix test_parquet failures for pyarrow 1.0 #35814
Conversation
Fix tests for pyarrow 1.0.0 Revert "Add new core members" This reverts commit 7ef7c12
May be best asking @jorisvandenbossche about intended behaviour |
@alimcmaster1 thanks for looking into this! Yes, so with pyarrow 1.0 and using the new datasets implementation, we kept the default for category type if your partition field is string, but not for integers (and indeed also the default now is int32, not int64) So I suppose out other roundtrip tests are using string partition fields (since this test is the only one that is failing) |
Actually it doesn't see we have other full roundtrip tests that use partitioning .. (only tests checking that it is written correctly) |
pandas/tests/io/test_parquet.py
Outdated
expected_df = df_compat.copy() | ||
|
||
# read_table uses the new Arrow Datasets API since pyarrow 1.0.0 | ||
# Previous behaviour was pyarrow partitioned columns become 'categorical' dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so only for integer columns the behaviour changed, for string columns it is still using category type
Gotcha makes sense - thanks for the info! |
@alimcmaster1 can you rebase. I marked for 1.1.2 i think is ok (not sure if it is failing on that). |
@jorisvandenbossche lgtm. |
@meeseeksdev backport to 1.1.x |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
Co-authored-by: Ali McMaster <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
cc @martindurant -> looks like this is a pyarrow 1.0.0 compat issue (read_table uses the new API) - https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
I noticed the partition cols are casted from int64 -> int32 is that expected pyarrow behaviour? From the write_table docs looking at version 1.0/2.0 it suggests it is https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table