Skip to content

REF: use more explicit to_numpy(object) instead of astype(object) in EA implementation #45521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jorisvandenbossche
Copy link
Member

Somewhat related to #24877 / #43699, but regardless of whether we want to stop astype returning numpy arrays or not, I think when we know we want to get an object-dtype numpy array as result, using to_numpy(object) is more explicit about this when reading the code.

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 21, 2022
@jbrockmendel
Copy link
Member

Can't speak to the general case, but DTA is more performant in astype

In [3]: dti = pd.date_range('2016-01-01', periods=10**5, freq='s', tz="US/Pacific")

In [4]: dta = dti._data

In [5]: %timeit dta.astype(object)
50.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit dta.to_numpy(object)
178 ms ± 6.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jorisvandenbossche
Copy link
Member Author

Good catch. That seems to be because in datetimelike __array__ we explicitly go through a list, which shouldn't be needed:

def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
# used for Timedelta/DatetimeArray, overwritten by PeriodArray
if is_object_dtype(dtype):
return np.array(list(self), dtype=object)

We can probably call astype there (or do the same than astype does). Astype special cases object dtype for datetime64 to directly call ints_to_pydatetime:

if is_object_dtype(dtype):
if self.dtype.kind == "M":
# *much* faster than self._box_values
# for e.g. test_get_loc_tuple_monotonic_above_size_cutoff
i8data = self.asi8.ravel()
converted = ints_to_pydatetime(
i8data,
# error: "DatetimeLikeArrayMixin" has no attribute "tz"
tz=self.tz, # type: ignore[attr-defined]
freq=self.freq,
box="timestamp",
)
return converted.reshape(self.shape)

(which is what list(self) will also call, but then iteratively)

@jbrockmendel
Copy link
Member

xref #44484 for dispatch logic

@jorisvandenbossche
Copy link
Member Author

Yes, although the behaviour differences I note in that issue are for other dtypes, not object dtype (for object, it's only a matter of performance, as far as I see)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 24, 2022
@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants