PERF: ArrowExtensionArray.to_numpy #49973

lukemanley · 2022-11-30T12:52:21Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Moved and refactored to_numpy from ArrowStringArray to ArrowExtensionArray to improve performance for other pyarrow types.

       before           after         ratio
     <main>           <arrow-ea-numpy>
-      63.3±0.5ms       3.17±0.2ms     0.05  array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', True)
-        63.8±1ms       3.10±0.2ms     0.05  array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', True)
-        62.1±1ms      1.42±0.02ms     0.02  array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', True)
-        31.2±1ms        122±0.3μs     0.00  array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', False)
-      31.0±0.4ms       33.9±0.8μs     0.00  array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', False)
-        31.7±1ms         30.6±1μs     0.00  array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', False)

This results in perf improvements for a number of ops that go through numpy:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10**5, 10), dtype="float64[pyarrow]")

%timeit df.corr()
402 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
26 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)    <- PR

%timeit df.rolling(100).sum()
378 ms ± 9.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
15.2 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

mroeschke · 2022-11-30T23:59:26Z

pandas/core/arrays/arrow/array.py

+
+        pa_type = self._data.type
+        if pa.types.is_timestamp(pa_type) or pa.types.is_duration(pa_type):
+            result = np.array(self.tolist(), dtype=np.object_)


Can we generally not use self._data.to_numpy()?: https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.to_numpy

I've looked at it and I think it would only be usable for a subset of cases. The inability to specify dtype and na_value make it difficult for general usage in this interface. Also, need to be careful with things like timestamps as it will strip off the timezone.

Gotcha. Yeah where EA.to_numpy() has more functionality (dtype and na_value) than pyarrow's to_numpy() I agree we need to have out custom behavior.

But otherwise I would lean towards using self._data.to_numpy(). IMO at this stage of the ArrowExtensionArray it's better to let pyarrow behavior pass through (and potentially improve) rather than modifying it's behavior and possibly diverging

Agreed with trying to limit divergence from pyarrow.

I was attempting to maintain status quo behavior with better perf here. Current default behavior of ArrowExtensionArray.to_numpy is mostly consistent with other masked/nullable pandas dtypes. If its preferred to let the pyarrow behavior flow through the expected behavior in a number of tests would need to be changed (or xfail)

Here is an example where behavior differs between the current ArrowExtensionArray.to_numpy and pyarrow's to_numpy:

import pandas as pd data = [1, None] pd.array(data, dtype="Int64").to_numpy() # -> array([1, <NA>], dtype=object) pd.array(data, dtype="int64[pyarrow]").to_numpy() # -> array([1, <NA>], dtype=object) pd.array(data, dtype="int64[pyarrow]")._data.to_numpy() # -> array([1.0, nan])

Let me know if you have thoughts or suggestions. Happy to modify or close this if you think it needs more consideration.

I'll also note that np.array(pa_array) goes through pyarrow's to_numpy method. pyarrow implements pa.Array.__array__ which is a small wrapper around the to_numpy method.

So these should be equivalent and both go through pa.Array.to_numpy:

np.array(pa_array) pa_array.to_numpy(zero_copy_only=False, writable=True)

Sounds good. If we go with that change and get closer to pyarrow behavior, is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)? Thinking specifically about string[pyarrow] which would behave differently than string[python] if this change were to be made.

In [2]: pd.array(["foo", pd.NA], dtype="string[python]").to_numpy() Out[2]: array(['foo', <NA>], dtype=object) In [3]: pd.array(["foo", pd.NA], dtype="string[pyarrow]").to_numpy() Out[3]: array(['foo', <NA>], dtype=object) In [4]: pd.array(["foo", pd.NA], dtype="string[pyarrow]")._data.to_numpy() Out[4]: array(['foo', None], dtype=object)

I just pushed the changes discussed above but left ArrowStringArray behavior as is with a TODO note. (see my comment directly above regarding divergence from string[python]).

Agreed it would be nice if duration and timestamp followed the same default behavior. When I tried it, there were a number of failing tests due to loss of timezone, non-ns units, etc. I'd suggest we leave that for a follow-up.

Avoiding the cast to object further improves the timings in the OP.

is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)?

Oh sorry, I think one aspect we want to make consistent is that missing values from to_numpy get returned as pd.NA (if na_value isn't specified) so then object dtype would be necessary i.e. we should diverge from pyarrow (unfortunately) when the data contains NA

Agreed with tackling the duration and timestamp stuff in a followup

Got it. updated to use pd.NA by default for missing values.

mroeschke · 2022-12-16T00:27:13Z

pandas/tests/extension/test_arrow.py

@@ -1406,3 +1406,20 @@ def test_astype_from_non_pyarrow(data):
    assert not isinstance(pd_array.dtype, ArrowDtype)
    assert isinstance(result.dtype, ArrowDtype)
    tm.assert_extension_array_equal(result, data)
+
+
+def test_to_numpy_with_defaults(data, request):


Nit: Looks like request isn't needed anymore

removed it. thanks

mroeschke

One comment otherwise LGTM

mroeschke

LGTM can merge on green

PERF: ArrowExtensionArray.to_numpy

d021410

lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Nov 30, 2022

whatsnew

df3c9ce

mroeschke reviewed Nov 30, 2022

View reviewed changes

fix failing test

49d0d20

mroeschke added this to the 2.0 milestone Dec 2, 2022

lukemanley added 5 commits December 7, 2022 06:35

Merge remote-tracking branch 'upstream/main' into arrow-ea-numpy

c0756c0

default to pyarrow behavior in to_numpy

f425fc0

use pd.NA for missing

685d66a

Merge remote-tracking branch 'upstream/main' into arrow-ea-numpy

5669a06

fix spacing

2f8e156

mroeschke reviewed Dec 16, 2022

View reviewed changes

lukemanley added 2 commits December 15, 2022 19:30

remove unused parameter in test

d6c3d57

Merge remote-tracking branch 'upstream/main' into arrow-ea-numpy

3b5f271

mroeschke approved these changes Dec 16, 2022

View reviewed changes

lukemanley merged commit 026a83e into pandas-dev:main Dec 16, 2022

phofl pushed a commit to phofl/pandas that referenced this pull request Dec 17, 2022

PERF: ArrowExtensionArray.to_numpy (pandas-dev#49973)

4e5cbdc

lukemanley deleted the arrow-ea-numpy branch December 20, 2022 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: ArrowExtensionArray.to_numpy #49973

PERF: ArrowExtensionArray.to_numpy #49973

lukemanley commented Nov 30, 2022 •

edited

Loading

mroeschke Nov 30, 2022

lukemanley Dec 1, 2022

mroeschke Dec 1, 2022

lukemanley Dec 1, 2022

lukemanley Dec 1, 2022

lukemanley Dec 7, 2022

lukemanley Dec 7, 2022

mroeschke Dec 7, 2022

mroeschke Dec 7, 2022

lukemanley Dec 8, 2022

mroeschke Dec 16, 2022

lukemanley Dec 16, 2022

mroeschke left a comment

mroeschke left a comment

PERF: ArrowExtensionArray.to_numpy #49973

PERF: ArrowExtensionArray.to_numpy #49973

Conversation

lukemanley commented Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

lukemanley commented Nov 30, 2022 •

edited

Loading