Skip to content

TST: Fix test_parquet failures for pyarrow 1.0 #35814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 25, 2020

Conversation

alimcmaster1
Copy link
Member

cc @martindurant -> looks like this is a pyarrow 1.0.0 compat issue (read_table uses the new API) - https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html

I noticed the partition cols are casted from int64 -> int32 is that expected pyarrow behaviour? From the write_table docs looking at version 1.0/2.0 it suggests it is https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table

Fix tests for pyarrow 1.0.0

Revert "Add new core members"

This reverts commit 7ef7c12
@alimcmaster1 alimcmaster1 added Testing pandas testing functions or related to the test suite IO Parquet parquet, feather labels Aug 20, 2020
@martindurant
Copy link
Contributor

May be best asking @jorisvandenbossche about intended behaviour

@jorisvandenbossche
Copy link
Member

@alimcmaster1 thanks for looking into this!

Yes, so with pyarrow 1.0 and using the new datasets implementation, we kept the default for category type if your partition field is string, but not for integers (and indeed also the default now is int32, not int64)

So I suppose out other roundtrip tests are using string partition fields (since this test is the only one that is failing)

@jorisvandenbossche
Copy link
Member

So I suppose out other roundtrip tests are using string partition fields (since this test is the only one that is failing)

Actually it doesn't see we have other full roundtrip tests that use partitioning .. (only tests checking that it is written correctly)

expected_df = df_compat.copy()

# read_table uses the new Arrow Datasets API since pyarrow 1.0.0
# Previous behaviour was pyarrow partitioned columns become 'categorical' dtypes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so only for integer columns the behaviour changed, for string columns it is still using category type

@alimcmaster1
Copy link
Member Author

@alimcmaster1 thanks for looking into this!

Yes, so with pyarrow 1.0 and using the new datasets implementation, we kept the default for category type if your partition field is string, but not for integers (and indeed also the default now is int32, not int64)

So I suppose out other roundtrip tests are using string partition fields (since this test is the only one that is failing)

Gotcha makes sense - thanks for the info!

@alimcmaster1 alimcmaster1 changed the title WIP TST: Fix test_parquet failures for pyarrow 1.0 TST: Fix test_parquet failures for pyarrow 1.0 Aug 20, 2020
@jreback jreback added this to the 1.2 milestone Aug 21, 2020
@jreback
Copy link
Contributor

jreback commented Aug 21, 2020

@alimcmaster1 can you rebase. I marked for 1.1.2 i think is ok (not sure if it is failing on that).

@jreback jreback modified the milestones: 1.2, 1.1.2 Aug 21, 2020
@jreback
Copy link
Contributor

jreback commented Aug 24, 2020

@jorisvandenbossche lgtm.

@jorisvandenbossche jorisvandenbossche merged commit d3d74c5 into pandas-dev:master Aug 25, 2020
@jorisvandenbossche
Copy link
Member

@meeseeksdev backport to 1.1.x

@lumberbot-app
Copy link

lumberbot-app bot commented Aug 25, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.1.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 d3d74c590e2578988a2be48d786ddafa89f91454
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #35814: TST: Fix test_parquet failures for pyarrow 1.0'
  1. Push to a named branch :
git push YOURFORK 1.1.x:auto-backport-of-pr-35814-on-1.1.x
  1. Create a PR against branch 1.1.x, I would have named this PR:

"Backport PR #35814 on branch 1.1.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants