-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: ArrowExtensionArray.to_numpy #49973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/arrays/arrow/array.py
Outdated
|
||
pa_type = self._data.type | ||
if pa.types.is_timestamp(pa_type) or pa.types.is_duration(pa_type): | ||
result = np.array(self.tolist(), dtype=np.object_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we generally not use self._data.to_numpy()
?: https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.to_numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at it and I think it would only be usable for a subset of cases. The inability to specify dtype
and na_value
make it difficult for general usage in this interface. Also, need to be careful with things like timestamps as it will strip off the timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yeah where EA.to_numpy()
has more functionality (dtype and na_value) than pyarrow's to_numpy()
I agree we need to have out custom behavior.
But otherwise I would lean towards using self._data.to_numpy()
. IMO at this stage of the ArrowExtensionArray it's better to let pyarrow behavior pass through (and potentially improve) rather than modifying it's behavior and possibly diverging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with trying to limit divergence from pyarrow.
I was attempting to maintain status quo behavior with better perf here. Current default behavior of ArrowExtensionArray.to_numpy
is mostly consistent with other masked/nullable pandas dtypes. If its preferred to let the pyarrow behavior flow through the expected behavior in a number of tests would need to be changed (or xfail)
Here is an example where behavior differs between the current ArrowExtensionArray.to_numpy
and pyarrow's to_numpy:
import pandas as pd
data = [1, None]
pd.array(data, dtype="Int64").to_numpy() # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]").to_numpy() # -> array([1, <NA>], dtype=object)
pd.array(data, dtype="int64[pyarrow]")._data.to_numpy() # -> array([1.0, nan])
Let me know if you have thoughts or suggestions. Happy to modify or close this if you think it needs more consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll also note that np.array(pa_array)
goes through pyarrow's to_numpy method. pyarrow implements pa.Array.__array__
which is a small wrapper around the to_numpy method.
So these should be equivalent and both go through pa.Array.to_numpy
:
np.array(pa_array)
pa_array.to_numpy(zero_copy_only=False, writable=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. If we go with that change and get closer to pyarrow behavior, is it ok that (by default) to_numpy
will no longer represent missing values with pd.NA (and cast to object)? Thinking specifically about string[pyarrow]
which would behave differently than string[python]
if this change were to be made.
In [2]: pd.array(["foo", pd.NA], dtype="string[python]").to_numpy()
Out[2]: array(['foo', <NA>], dtype=object)
In [3]: pd.array(["foo", pd.NA], dtype="string[pyarrow]").to_numpy()
Out[3]: array(['foo', <NA>], dtype=object)
In [4]: pd.array(["foo", pd.NA], dtype="string[pyarrow]")._data.to_numpy()
Out[4]: array(['foo', None], dtype=object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed the changes discussed above but left ArrowStringArray behavior as is with a TODO note. (see my comment directly above regarding divergence from string[python]
).
Agreed it would be nice if duration and timestamp followed the same default behavior. When I tried it, there were a number of failing tests due to loss of timezone, non-ns units, etc. I'd suggest we leave that for a follow-up.
Avoiding the cast to object further improves the timings in the OP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ok that (by default) to_numpy will no longer represent missing values with pd.NA (and cast to object)?
Oh sorry, I think one aspect we want to make consistent is that missing values from to_numpy
get returned as pd.NA
(if na_value
isn't specified) so then object
dtype would be necessary i.e. we should diverge from pyarrow (unfortunately) when the data contains NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with tackling the duration and timestamp stuff in a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. updated to use pd.NA by default for missing values.
pandas/tests/extension/test_arrow.py
Outdated
@@ -1406,3 +1406,20 @@ def test_astype_from_non_pyarrow(data): | |||
assert not isinstance(pd_array.dtype, ArrowDtype) | |||
assert isinstance(result.dtype, ArrowDtype) | |||
tm.assert_extension_array_equal(result, data) | |||
|
|||
|
|||
def test_to_numpy_with_defaults(data, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Looks like request
isn't needed anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM can merge on green
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.Moved and refactored
to_numpy
fromArrowStringArray
toArrowExtensionArray
to improve performance for other pyarrow types.This results in perf improvements for a number of ops that go through numpy: