Skip to content

PERF: ArrowExtensionArray.to_numpy #49973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 16, 2022
Merged

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Nov 30, 2022

Moved and refactored to_numpy from ArrowStringArray to ArrowExtensionArray to improve performance for other pyarrow types.

       before           after         ratio
     <main>           <arrow-ea-numpy>
-      63.3±0.5ms       3.17±0.2ms     0.05  array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', True)
-        63.8±1ms       3.10±0.2ms     0.05  array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', True)
-        62.1±1ms      1.42±0.02ms     0.02  array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', True)
-        31.2±1ms        122±0.3μs     0.00  array.ArrowExtensionArray.time_to_numpy('boolean[pyarrow]', False)
-      31.0±0.4ms       33.9±0.8μs     0.00  array.ArrowExtensionArray.time_to_numpy('int64[pyarrow]', False)
-        31.7±1ms         30.6±1μs     0.00  array.ArrowExtensionArray.time_to_numpy('float64[pyarrow]', False)

This results in perf improvements for a number of ops that go through numpy:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10**5, 10), dtype="float64[pyarrow]")

%timeit df.corr()
402 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
26 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)    <- PR

%timeit df.rolling(100).sum()
378 ms ± 9.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
15.2 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Nov 30, 2022

pa_type = self._data.type
if pa.types.is_timestamp(pa_type) or pa.types.is_duration(pa_type):
result = np.array(self.tolist(), dtype=np.object_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at it and I think it would only be usable for a subset of cases. The inability to specify dtype and na_value make it difficult for general usage in this interface. Also, need to be careful with things like timestamps as it will strip off the timezone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah where EA.to_numpy() has more functionality (dtype and na_value) than pyarrow's to_numpy() I agree we need to have out custom behavior.

But otherwise I would lean towards using self._data.to_numpy(). IMO at this stage of the ArrowExtensionArray it's better to let pyarrow behavior pass through (and potentially improve) rather than modifying it's behavior and possibly diverging

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with trying to limit divergence from pyarrow.

I was attempting to maintain status quo behavior with better perf here. Current default behavior of ArrowExtensionArray.to_numpy is mostly consistent with other masked/nullable pandas dtypes. If its preferred to let the pyarrow behavior flow through the expected behavior in a number of tests would need to be changed (or xfail)

Here is an example where behavior differs between the current ArrowExtensionArray.to_numpy and pyarrow's to_numpy:

import pandas as pd

data = [1, None]

pd.array(data, dtype="Int64").to_numpy()                    # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]").to_numpy()           # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]")._data.to_numpy()     # -> array([1.0, nan])

Let me know if you have thoughts or suggestions. Happy to modify or close this if you think it needs more consideration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also note that np.array(pa_array) goes through pyarrow's to_numpy method. pyarrow implements pa.Array.__array__ which is a small wrapper around the to_numpy method.

So these should be equivalent and both go through pa.Array.to_numpy:

np.array(pa_array)
pa_array.to_numpy(zero_copy_only=False, writable=True)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. If we go with that change and get closer to pyarrow behavior, is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)? Thinking specifically about string[pyarrow] which would behave differently than string[python] if this change were to be made.

In [2]: pd.array(["foo", pd.NA], dtype="string[python]").to_numpy()        
Out[2]: array(['foo', <NA>], dtype=object)

In [3]: pd.array(["foo", pd.NA], dtype="string[pyarrow]").to_numpy()        
Out[3]: array(['foo', <NA>], dtype=object)

In [4]: pd.array(["foo", pd.NA], dtype="string[pyarrow]")._data.to_numpy()
Out[4]: array(['foo', None], dtype=object)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed the changes discussed above but left ArrowStringArray behavior as is with a TODO note. (see my comment directly above regarding divergence from string[python]).

Agreed it would be nice if duration and timestamp followed the same default behavior. When I tried it, there were a number of failing tests due to loss of timezone, non-ns units, etc. I'd suggest we leave that for a follow-up.

Avoiding the cast to object further improves the timings in the OP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)?

Oh sorry, I think one aspect we want to make consistent is that missing values from to_numpy get returned as pd.NA (if na_value isn't specified) so then object dtype would be necessary i.e. we should diverge from pyarrow (unfortunately) when the data contains NA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with tackling the duration and timestamp stuff in a followup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. updated to use pd.NA by default for missing values.

@mroeschke mroeschke added this to the 2.0 milestone Dec 2, 2022
@@ -1406,3 +1406,20 @@ def test_astype_from_non_pyarrow(data):
assert not isinstance(pd_array.dtype, ArrowDtype)
assert isinstance(result.dtype, ArrowDtype)
tm.assert_extension_array_equal(result, data)


def test_to_numpy_with_defaults(data, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Looks like request isn't needed anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it. thanks

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment otherwise LGTM

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM can merge on green

@lukemanley lukemanley merged commit 026a83e into pandas-dev:main Dec 16, 2022
phofl pushed a commit to phofl/pandas that referenced this pull request Dec 17, 2022
@lukemanley lukemanley deleted the arrow-ea-numpy branch December 20, 2022 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants