Skip to content

BUG: avoid attribute error with pyarrow >=0.16.0 and <1.0.0 #38803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 5, 2021

Conversation

ADraginda
Copy link
Contributor

@ADraginda ADraginda commented Dec 30, 2020

Problem: The minimum pyarrow listed as an optional dependency is 0.15.1. Pyarrow added the compute module that is imported in the change here with pyarrow 0.16.0 but attributes imported by pandas are not available in that module until 1.0.0 thus anyone using pyarrow [0.16.0, 1.0.0) will get an Attribute error.

This PR adds as try/expect around accessing the attributes such that they are only accessed if available (i.e. pyarrow >=1.0.0)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just raise the min for string_arrow to 1.0 rather than do this

we do have multiple builds testing these versions of pyarrow, what exactly is failing to cause this change?

cc @jorisvandenbossche @xhochy @simonjayhawkins

@jreback jreback added Dependencies Required and optional dependencies ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Dec 30, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.2.1 milestone Dec 30, 2020
@simonjayhawkins
Copy link
Member

Thanks @ADraginda for the report and the PR.

to having testing for this, and prevent future issues, we should pin one of the ci environments to the >=0.16.0 and <1.0.0 range.

indeed, creation of the ARROW_CMP_FUNCS was moved to the module level, #35259 (comment), but was initially guarded with a version check (in object creation).

the try/except is OK, or we could duplicate the version check or refactor the version check out to avoid duplication.

@jorisvandenbossche
Copy link
Member

@ADraginda Thanks for the PR! Your analysis sounds correct, and this seems a good fix for it.

Now, I am also curious how you ran into it, because in principle we should only import this file (at least in the current version of pandas) if you are explicitly importing it, since it's not yet being used internally or publicly exposed.

(but your fix is useful anyway, because once we will start exposing it publicly, it shouldn't error like that when having pyarrow 0.16 installed)

@simonjayhawkins
Copy link
Member

Now, I am also curious how you ran into it, because in principle we should only import this file (at least in the current version of pandas) if you are explicitly importing it

we have special casing in pandas/core/arrays/base.py that imports ArrowStringDtype

@jorisvandenbossche
Copy link
Member

Ah, indeed, I forgot about that one (so runtime, I was only thinking about on import)

@jorisvandenbossche
Copy link
Member

to having testing for this, and prevent future issues, we should pin one of the ci environments to the >=0.16.0 and <1.0.0 range.

The actions-37-locale.yaml now has >=0.17 (which results in 1.0.1 currently, but that version is pinned elsewhere), so maybe can change this to a pin to 0.17 (=0.17):

- pyarrow>=0.17

"le": pc.less_equal,
"ge": pc.greater_equal,
}
# pyarrow 0.16.0 adds a compute module (thus the above compute import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than a try/except why don't we just simply check the version at this point. this nested try/except is very messy. or way better is just put a min of 1.0.0 on this functionaility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that on the original PR. The version check already happens elsewhere when actually trying to create an array (and already uses a minimum version of 1.0.0), so it's not needed to do a version check here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember, and obviously this is fragile. so this check needs to happen here as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still -1 on this soln. i would just check that we are >= 1.0 and call it a day.

@ADraginda
Copy link
Contributor Author

@jorisvandenbossche Thanks for the review! I added the pin from >= to ==

@simonjayhawkins I'd prefer to punt on a refactor in favor of passing off to someone who knows this code better if that's ok.

@jreback I'm hoping to avoid a 1.0.0 minimum version if at all possible.
You can reproduce with pyarrow 0.17.1 (that's what I'm on, but breaks for other versions too) with the following example. Seems to be related to merging tow tables, one on a column of type string and the other of type object.

import pandas as pd

example_dict = {'i': {0: 'foo'}, 'j': {0: 1}}

foo = pd.DataFrame(example_dict)
foo['i'] = foo['i'].astype(pd.StringDtype.name)
foo['j'] = foo['j'].astype(pd.Int64Dtype.name)

bar = pd.DataFrame(example_dict)
bar['j'] = bar['j'].astype(pd.Int64Dtype.name)

foo.merge(bar)

@jorisvandenbossche
Copy link
Member

Hmm, apparently my suggestion of pinning to 0.17 in that environment was not as simple as that .. as it gives an unsolvable environment. Will do a check locally of what might work.

@simonjayhawkins
Copy link
Member

@jorisvandenbossche

=========================== short test summary info ============================
FAILED pandas/tests/io/test_user_agent.py::test_server_and_default_headers[ParquetPyArrowUserAgentResponder-read_parquet-34268-pyarrow]
FAILED pandas/tests/io/test_user_agent.py::test_server_and_custom_headers[JSONUserAgentResponder-read_json-34264-None]
FAILED pandas/tests/io/test_user_agent.py::test_server_and_custom_headers[ParquetPyArrowUserAgentResponder-read_parquet-34270-pyarrow]
FAILED pandas/tests/io/test_user_agent.py::test_server_and_custom_headers[GzippedJSONUserAgentResponder-read_json-34266-None]
= 4 failed, 136679 passed, 14311 skipped, 1054 xfailed, 63 xpassed in 1148.83s (0:19:08) =

@jorisvandenbossche
Copy link
Member

Hmm the build we changed gives a different failure

FAILED pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_filter_row_groups

I think that test can be skipped for pyarrow 0.17, it's complaining about file-like object not yet being implemented.

@jorisvandenbossche
Copy link
Member

I think the other failures on Azure are unrelated (it's not impacted by this PR), probably a flaky test that can be addressed elsewhere. Let's restart the build

@jreback
Copy link
Contributor

jreback commented Dec 31, 2020

@jorisvandenbossche Thanks for the review! I added the pin from >= to ==

@simonjayhawkins I'd prefer to punt on a refactor in favor of passing off to someone who knows this code better if that's ok.

@jreback I'm hoping to avoid a 1.0.0 minimum version if at all possible.
You can reproduce with pyarrow 0.17.1 (that's what I'm on, but breaks for other versions too) with the following example. Seems to be related to merging tow tables, one on a column of type string and the other of type object.

import pandas as pd

example_dict = {'i': {0: 'foo'}, 'j': {0: 1}}

foo = pd.DataFrame(example_dict)
foo['i'] = foo['i'].astype(pd.StringDtype.name)
foo['j'] = foo['j'].astype(pd.Int64Dtype.name)

bar = pd.DataFrame(example_dict)
bar['j'] = bar['j'].astype(pd.Int64Dtype.name)

foo.merge(bar)

ok pls add this as a test.

I don't know why anyone is trying to use StringArray before 1.0 but we need a test for this in any event.

put in tests/reshape/merge somewhere

@ADraginda
Copy link
Contributor Author

@jreback good point. The root of it all is actually just this call:

from pandas.core.arrays.string_arrow import ArrowStringDtype

So a test for the merge fn feels a bit incorrect for a unit test to me. Moreover, this import is already in the tests so is covered I think?

Tests: I fixed a failing test that had the wrong minimum version of pyarrow for what the test wanted to accomplish (test_filter_row_groups in test_parquet)

@@ -896,7 +896,7 @@ def test_timezone_aware_index(self, pa, timezone_aware_date_list):
# this use-case sets the resolution to 1 minute
check_round_trip(df, pa, check_dtype=False)

@td.skip_if_no("pyarrow", min_version="0.17")
@td.skip_if_no("pyarrow", min_version="1.0.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this mode of read_parquet was not available until 1.0.0

"le": pc.less_equal,
"ge": pc.greater_equal,
}
# pyarrow 0.16.0 adds a compute module (thus the above compute import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still -1 on this soln. i would just check that we are >= 1.0 and call it a day.

@jorisvandenbossche
Copy link
Member

So a test for the merge fn feels a bit incorrect for a unit test to me. Moreover, this import is already in the tests so is covered I think?

Yes, this is already covered by the StringArray astype tests. We only didn't catch it, because we didn't have any CI build with this specific version of pyarrow. So no need to add an additional test.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ADraginda can you add a whatsnew note for 1.2.1

can you put in bug fixes I/O section. ping on green.

update pyarrow

Update ci/deps/actions-37-locale.yaml

Co-authored-by: Joris Van den Bossche <[email protected]>

fixup

fixup

fixup
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@jorisvandenbossche jorisvandenbossche changed the title BUG: GH38801 avoid attribute error with pyarrow >=0.16.0 and <1.0.0 BUG: avoid attribute error with pyarrow >=0.16.0 and <1.0.0 Jan 5, 2021
@jorisvandenbossche jorisvandenbossche merged commit 6168f99 into pandas-dev:master Jan 5, 2021
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.2.x

@simonjayhawkins
Copy link
Member

can you put in bug fixes I/O section. ping on green.

for recent patch releases we have tended not to split into sub-sections, just regressions, bug-fixes and maybe other.

@jorisvandenbossche
Copy link
Member

(yep, since there were already others, I thought we have to clean it up anyway, so not worth doing another round of changes here)

jreback pushed a commit that referenced this pull request Jan 5, 2021
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas 1.2.0 and Pyarrow [0.16.0, 1.0.0) are incompatible for some column types
4 participants