-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: avoid attribute error with pyarrow >=0.16.0 and <1.0.0 #38803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just raise the min for string_arrow to 1.0 rather than do this
we do have multiple builds testing these versions of pyarrow, what exactly is failing to cause this change?
Thanks @ADraginda for the report and the PR. to having testing for this, and prevent future issues, we should pin one of the ci environments to the >=0.16.0 and <1.0.0 range. indeed, creation of the ARROW_CMP_FUNCS was moved to the module level, #35259 (comment), but was initially guarded with a version check (in object creation). the try/except is OK, or we could duplicate the version check or refactor the version check out to avoid duplication. |
@ADraginda Thanks for the PR! Your analysis sounds correct, and this seems a good fix for it. Now, I am also curious how you ran into it, because in principle we should only import this file (at least in the current version of pandas) if you are explicitly importing it, since it's not yet being used internally or publicly exposed. (but your fix is useful anyway, because once we will start exposing it publicly, it shouldn't error like that when having pyarrow 0.16 installed) |
we have special casing in pandas/core/arrays/base.py that imports ArrowStringDtype |
Ah, indeed, I forgot about that one (so runtime, I was only thinking about on import) |
The pandas/ci/deps/actions-37-locale.yaml Line 33 in 94810d1
|
pandas/core/arrays/string_arrow.py
Outdated
"le": pc.less_equal, | ||
"ge": pc.greater_equal, | ||
} | ||
# pyarrow 0.16.0 adds a compute module (thus the above compute import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than a try/except why don't we just simply check the version at this point. this nested try/except is very messy. or way better is just put a min of 1.0.0 on this functionaility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed that on the original PR. The version check already happens elsewhere when actually trying to create an array (and already uses a minimum version of 1.0.0), so it's not needed to do a version check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember, and obviously this is fragile. so this check needs to happen here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still -1 on this soln. i would just check that we are >= 1.0 and call it a day.
4ddfa93
to
71c4cf1
Compare
@jorisvandenbossche Thanks for the review! I added the pin from @simonjayhawkins I'd prefer to punt on a refactor in favor of passing off to someone who knows this code better if that's ok. @jreback I'm hoping to avoid a 1.0.0 minimum version if at all possible. import pandas as pd
example_dict = {'i': {0: 'foo'}, 'j': {0: 1}}
foo = pd.DataFrame(example_dict)
foo['i'] = foo['i'].astype(pd.StringDtype.name)
foo['j'] = foo['j'].astype(pd.Int64Dtype.name)
bar = pd.DataFrame(example_dict)
bar['j'] = bar['j'].astype(pd.Int64Dtype.name)
foo.merge(bar) |
Hmm, apparently my suggestion of pinning to 0.17 in that environment was not as simple as that .. as it gives an unsolvable environment. Will do a check locally of what might work. |
|
Hmm the build we changed gives a different failure
I think that test can be skipped for pyarrow 0.17, it's complaining about file-like object not yet being implemented. |
I think the other failures on Azure are unrelated (it's not impacted by this PR), probably a flaky test that can be addressed elsewhere. Let's restart the build |
ok pls add this as a test. I don't know why anyone is trying to use StringArray before 1.0 but we need a test for this in any event. put in tests/reshape/merge somewhere |
a376809
to
15f9833
Compare
@jreback good point. The root of it all is actually just this call: from pandas.core.arrays.string_arrow import ArrowStringDtype So a test for the merge fn feels a bit incorrect for a unit test to me. Moreover, this import is already in the tests so is covered I think? Tests: I fixed a failing test that had the wrong minimum version of pyarrow for what the test wanted to accomplish (test_filter_row_groups in test_parquet) |
@@ -896,7 +896,7 @@ def test_timezone_aware_index(self, pa, timezone_aware_date_list): | |||
# this use-case sets the resolution to 1 minute | |||
check_round_trip(df, pa, check_dtype=False) | |||
|
|||
@td.skip_if_no("pyarrow", min_version="0.17") | |||
@td.skip_if_no("pyarrow", min_version="1.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this mode of read_parquet
was not available until 1.0.0
pandas/core/arrays/string_arrow.py
Outdated
"le": pc.less_equal, | ||
"ge": pc.greater_equal, | ||
} | ||
# pyarrow 0.16.0 adds a compute module (thus the above compute import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still -1 on this soln. i would just check that we are >= 1.0 and call it a day.
Yes, this is already covered by the StringArray astype tests. We only didn't catch it, because we didn't have any CI build with this specific version of pyarrow. So no need to add an additional test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ADraginda can you add a whatsnew note for 1.2.1
can you put in bug fixes I/O section. ping on green.
update pyarrow Update ci/deps/actions-37-locale.yaml Co-authored-by: Joris Van den Bossche <[email protected]> fixup fixup fixup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
@meeseeksdev backport 1.2.x |
… >=0.16.0 and <1.0.0
for recent patch releases we have tended not to split into sub-sections, just regressions, bug-fixes and maybe other. |
(yep, since there were already others, I thought we have to clean it up anyway, so not worth doing another round of changes here) |
…and <1.0.0 (#38971) Co-authored-by: Ada Draginda <[email protected]>
…andas-dev#38803) Co-authored-by: Joris Van den Bossche <[email protected]>
Problem: The minimum pyarrow listed as an optional dependency is 0.15.1. Pyarrow added the compute module that is imported in the change here with pyarrow 0.16.0 but attributes imported by pandas are not available in that module until 1.0.0 thus anyone using pyarrow [0.16.0, 1.0.0) will get an Attribute error.
This PR adds as try/expect around accessing the attributes such that they are only accessed if available (i.e. pyarrow >=1.0.0)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff