-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use pyarrow.compute for unique, dropna #46725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mroeschke
commented
Apr 10, 2022
•
edited
Loading
edited
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
Thanks @mroeschke for the PR. my understanding of #42613 is that we should no longer be implementing any fallback behavior. (there is not definitive policy on that there, so have not yet removed the fallbacks already implemented for StringArray) That issue applies specifically to StringArray but with the work you have done/doing to have a common base class for pyarrow backed EAs, we may be adding fallback behaviour for the pyarrow backed StringArray? |
Ah thanks, I wasn't aware of this discussion. My prior, related PR's so far have just been moving existing methods, so no fallback behavior should have been introduced I think. For this PR, I will just remove the super calls for now and reintroduce them later if we decide on a different fallback policy |
comments on the other issues is we can show a PerformanceWarning if we are falling back. But let's do that as a pre-cursor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. if you can move the warning tester to a more general location in a followup
pandas/tests/base/test_unique.py
Outdated
@@ -9,10 +14,20 @@ | |||
from pandas.tests.base.common import allow_na_ops | |||
|
|||
|
|||
def maybe_perf_warn(using_pyarrow): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally move to the _test_decorators.py (or similar) e.g. this is a general testing function.
actually if you can do that move in this PR and I think this also needs a whatsnew note. ping on green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mroeschke lgtm except some questions. any benchmark results?
@@ -37,6 +38,8 @@ | |||
import pyarrow as pa | |||
import pyarrow.compute as pc | |||
|
|||
from pandas.core.arrays.arrow._arrow_utils import fallback_performancewarning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be guarded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. _arrow_utils
doesn't guard import pyarrow
fallback_performancewarning(version="6") | ||
return super().dropna() | ||
else: | ||
return type(self)(pc.drop_null(self._data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we don't actually dispatch to this method from pandas?
I wonder whether there would be any performance gain if we refactored to call this array method instead? (from Series.dropna for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not exactly sure what you mean here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see now. Yeah hooking this up to dropna might be a good idea in a future PR
result = obj.unique() | ||
with tm.maybe_produces_warning( | ||
PerformanceWarning, | ||
pa_version_under2p0 and str(index_or_series_obj.dtype) == "string[pyarrow]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that with a pyarrow backed StringSrray, we are only testing Index here, not Series?
Also don't need the str
cast, dtype equality to the string form should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that with a pyarrow backed StringSrray, we are only testing Index here, not Series?
Looks so based on the fixture in pandas/conftest.py
if has_pyarrow:
idx = Index(pd.array(tm.makeStringIndex(100), dtype="string[pyarrow]"))
indices_dict["string-pyarrow"] = idx
Fixed the comparison
Here are the perf differences (@simonjayhawkins is correct that the array's (or any ExtentionArray's?) dropna is not hooked up to PR
main
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mroeschke lgtm
on the casting to string for the dtype equality in the tests, I thought that we should be able to create a pyarrow StringDtype (but not an ArrowStringArray) without pyarrow installed, but I must be wrong here.
Also I think the index_or_series_obj fixture should include arrow backed Series, we don't have any testing for this at the moment. (or for DataFrame)
thanks @mroeschke really nice |