ENH: Use pyarrow.compute for unique, dropna #46725

mroeschke · 2022-04-10T03:08:10Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.

simonjayhawkins · 2022-04-10T12:22:10Z

Thanks @mroeschke for the PR. my understanding of #42613 is that we should no longer be implementing any fallback behavior. (there is not definitive policy on that there, so have not yet removed the fallbacks already implemented for StringArray)

That issue applies specifically to StringArray but with the work you have done/doing to have a common base class for pyarrow backed EAs, we may be adding fallback behaviour for the pyarrow backed StringArray?

simonjayhawkins · 2022-04-10T12:33:33Z

my understanding of #42613 is that ...

more of the discussion was actually in #42597

mroeschke · 2022-04-10T17:48:38Z

Ah thanks, I wasn't aware of this discussion.

My prior, related PR's so far have just been moving existing methods, so no fallback behavior should have been introduced I think.

For this PR, I will just remove the super calls for now and reintroduce them later if we decide on a different fallback policy

jreback · 2022-04-10T19:32:10Z

Ah thanks, I wasn't aware of this discussion.

My prior, related PR's so far have just been moving existing methods, so no fallback behavior should have been introduced I think.

For this PR, I will just remove the super calls for now and reintroduce them later if we decide on a different fallback policy

comments on the other issues is we can show a PerformanceWarning if we are falling back. But let's do that as a pre-cursor

jreback

lgtm. if you can move the warning tester to a more general location in a followup

jreback · 2022-04-26T00:24:01Z

pandas/tests/base/test_unique.py

@@ -9,10 +14,20 @@
 from pandas.tests.base.common import allow_na_ops


+def maybe_perf_warn(using_pyarrow):


ideally move to the _test_decorators.py (or similar) e.g. this is a general testing function.

jreback · 2022-04-26T00:25:12Z

actually if you can do that move in this PR and I think this also needs a whatsnew note. ping on green.

simonjayhawkins

Thanks @mroeschke lgtm except some questions. any benchmark results?

simonjayhawkins · 2022-04-26T10:53:43Z

pandas/core/arrays/arrow/array.py

@@ -37,6 +38,8 @@
    import pyarrow as pa
    import pyarrow.compute as pc

+    from pandas.core.arrays.arrow._arrow_utils import fallback_performancewarning


does this need to be guarded?

I think so. _arrow_utils doesn't guard import pyarrow

pandas/core/arrays/arrow/array.py

simonjayhawkins · 2022-04-26T11:53:46Z

pandas/core/arrays/arrow/array.py

+            fallback_performancewarning(version="6")
+            return super().dropna()
+        else:
+            return type(self)(pc.drop_null(self._data))


so we don't actually dispatch to this method from pandas?

I wonder whether there would be any performance gain if we refactored to call this array method instead? (from Series.dropna for example)

Hmm not exactly sure what you mean here.

Ah I see now. Yeah hooking this up to dropna might be a good idea in a future PR

pandas/core/arrays/arrow/array.py

pandas/tests/indexes/test_common.py

simonjayhawkins · 2022-04-26T13:27:03Z

pandas/tests/base/test_unique.py

-    result = obj.unique()
+    with tm.maybe_produces_warning(
+        PerformanceWarning,
+        pa_version_under2p0 and str(index_or_series_obj.dtype) == "string[pyarrow]",


It appears that with a pyarrow backed StringSrray, we are only testing Index here, not Series?

Also don't need the str cast, dtype equality to the string form should work.

It appears that with a pyarrow backed StringSrray, we are only testing Index here, not Series?

Looks so based on the fixture in pandas/conftest.py

if has_pyarrow: idx = Index(pd.array(tm.makeStringIndex(100), dtype="string[pyarrow]")) indices_dict["string-pyarrow"] = idx

Fixed the comparison

mroeschke · 2022-04-26T20:24:40Z

Here are the perf differences (@simonjayhawkins is correct that the array's (or any ExtentionArray's?) dropna is not hooked up to Series/index/DataFrame.dropna() yet)

PR

In [1]: import pyarrow as pa
   ...: pa.__version__
   ...:
Out[1]: '7.0.0'

In [2]: ser = pd.Series(["1", np.nan] * 100_000, dtype="string[pyarrow]")

In [4]: %timeit ser.unique()
   ...:
1.3 ms ± 6.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit ser.array.dropna()
820 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

main

In [1]: ser = pd.Series(["1", np.nan] * 100_000, dtype="string[pyarrow]")
   ...:

In [2]: %timeit ser.unique()
   ...:
24.4 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: %timeit ser.array.dropna()
1.21 ms ± 2.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

simonjayhawkins

Thanks @mroeschke lgtm

on the casting to string for the dtype equality in the tests, I thought that we should be able to create a pyarrow StringDtype (but not an ArrowStringArray) without pyarrow installed, but I must be wrong here.

Also I think the index_or_series_obj fixture should include arrow backed Series, we don't have any testing for this at the moment. (or for DataFrame)

jreback · 2022-04-27T22:21:26Z

thanks @mroeschke really nice

ENH: Use pyarrow.compute for unique, dropna

4c010aa

mroeschke added the Arrow pyarrow functionality label Apr 10, 2022

mroeschke added this to the 1.5 milestone Apr 10, 2022

mroeschke added 6 commits April 23, 2022 20:28

Merge remote-tracking branch 'upstream/main' into enh/more_arrow_compute

def3510

Add fallback warning

315f59a

Merge remote-tracking branch 'upstream/main' into enh/more_arrow_compute

3700867

Fix extra warning test

2dc5918

Fix again

ebf62e8

Test some warnings

ea4e9e9

jreback approved these changes Apr 26, 2022

View reviewed changes

mroeschke added 3 commits April 25, 2022 20:28

Merge remote-tracking branch 'upstream/main' into enh/more_arrow_compute

e254528

Add and use maybe_produces_warning

e2a093f

Add additional issue number

ecefbee

simonjayhawkins reviewed Apr 26, 2022

View reviewed changes

Address review

7a5d4fb

mroeschke added 3 commits April 26, 2022 21:31

Merge remote-tracking branch 'upstream/main' into enh/more_arrow_compute

527c0b7

Use str again

de815df

revert another check raises an error

40bd857

simonjayhawkins approved these changes Apr 27, 2022

View reviewed changes

Merge branch 'main' into enh/more_arrow_compute

26b4cdb

jreback merged commit ff51b2f into pandas-dev:main Apr 27, 2022

mroeschke deleted the enh/more_arrow_compute branch April 28, 2022 00:09

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

ENH: Use pyarrow.compute for unique, dropna (pandas-dev#46725)

a55ad54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use pyarrow.compute for unique, dropna #46725

ENH: Use pyarrow.compute for unique, dropna #46725

mroeschke commented Apr 10, 2022 •

edited

Loading

simonjayhawkins commented Apr 10, 2022

simonjayhawkins commented Apr 10, 2022

mroeschke commented Apr 10, 2022

jreback commented Apr 10, 2022

jreback left a comment

jreback Apr 26, 2022

jreback commented Apr 26, 2022

simonjayhawkins left a comment

simonjayhawkins Apr 26, 2022

mroeschke Apr 26, 2022

simonjayhawkins Apr 26, 2022

mroeschke Apr 26, 2022

mroeschke Apr 26, 2022

simonjayhawkins Apr 26, 2022

mroeschke Apr 26, 2022

mroeschke commented Apr 26, 2022 •

edited

Loading

simonjayhawkins left a comment •

edited

Loading

jreback commented Apr 27, 2022

		@@ -9,10 +14,20 @@
		from pandas.tests.base.common import allow_na_ops


		def maybe_perf_warn(using_pyarrow):

ENH: Use pyarrow.compute for unique, dropna #46725

ENH: Use pyarrow.compute for unique, dropna #46725

Conversation

mroeschke commented Apr 10, 2022 • edited Loading

simonjayhawkins commented Apr 10, 2022

simonjayhawkins commented Apr 10, 2022

mroeschke commented Apr 10, 2022

jreback commented Apr 10, 2022

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 26, 2022

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Apr 26, 2022 • edited Loading

simonjayhawkins left a comment • edited Loading

Choose a reason for hiding this comment

jreback commented Apr 27, 2022

mroeschke commented Apr 10, 2022 •

edited

Loading

mroeschke commented Apr 26, 2022 •

edited

Loading

simonjayhawkins left a comment •

edited

Loading