Skip to content

CLN: refactor block storage selection to use dtype_class._is_numeric #52413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ngoldbaum
Copy link
Contributor

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Numpy 1.25 will add a new _is_numeric attribute on dtype classes that is True for dtypes numpy (and pandas it turns out) considers to be numeric and False otherwise (see numpy/numpy#23190):

In [14]: for dtype in [np.int_, np.float_, np.complex_, np.uint32, np.bool_]:
    ...:     assert(type(np.dtype(dtype))._is_numeric), dtype
    ...: 

In [15]: for dtype in [np.datetime64, np.timedelta64, np.str_, np.bytes_, np.object_]:
    ...:     assert(not type(np.dtype(dtype))._is_numeric), dtype
    ...: 

Besides clarity and avoiding the cryptic dtype.kind codes, this change makes it possible for pandas to ingest numpy dtypes written using the currently experimental NEP 42 dtype API, in particular the new string dtype I'm working on (see #47884).

Note that this isn't sufficient to get arbitrary new dtypes working or my string dtype, more changes are coming to support that once it's more feasible to upstream the changes along with new tests in pandas, so this is more of a cleanup with an eye towards future new features.

@mroeschke mroeschke requested a review from jbrockmendel April 4, 2023 17:00
@mroeschke mroeschke added the Internals Related to non-user accessible pandas implementation label Apr 4, 2023
@jbrockmendel
Copy link
Member

Besides clarity and avoiding the cryptic dtype.kind codes

Will these checks be wrong or just not idiomatic? We use these checks in a ton of places (and are moving towards more rather than less)

How does this affect perf? We've gone through a few iterations with get_block_type trying to wring perf out of it.

@ngoldbaum
Copy link
Contributor Author

Will these checks be wrong or just not idiomatic?

They'll be wrong if someone tries to pass in a non-legacy numpy dtype (e.g. not one of the dtypes that currently ship with numpy or a custom dtype written using the legacy custom dtype API). In principle there will be dtypes shipping with numpy 2.0 next January or a future numpy version that will be non-legacy dtypes. non-legacy dtypes all have dtype.kind set to \0 right now, so unless the plan changes there needs to be a different way to distinguish between dtypes, and numpy is headed towards supporting comparing dtype classes (e.g. numpy/numpy#23358) and supporting a hierarchy of dtype types.

We use these checks in a ton of places (and are moving towards more rather than less)

If the idea is that pandas only wants to whitelist supported dtypes I could just abandon this and whitelist my stringdtype.

How does this affect perf? We've gone through a few iterations with get_block_type trying to wring perf out of it.

It's a teeny bit worse. Another thing I could do is refactor this to first check if dtype.kind is a currently supported dtype and only in cases where it's unknown to pandas rely on _is_numeric.

I'm also not sure how to handle the mypy failure, I'm pretty unfamiliar with writing python type annotations.

@jbrockmendel
Copy link
Member

For DatetimeLikeBlock we specifically want the existing dt64/td64 dtypes, so would want to exclude any custom dtypes. For everything else I think we can just merge ObjectBlock and NumericBlock into a single class and make some of this unnecessary.

Is there a convenient way of checking whether a dtype is standard vs custom?

@ngoldbaum
Copy link
Contributor Author

Is there a convenient way of checking whether a dtype is standard vs custom?

Yes, there is a type(dtype)._is_legacy which is True for legacy dtypes.

For everything else I think we can just merge ObjectBlock and NumericBlock into a single class and make some of this unnecessary.

Another wrinkle is that I’d like to use NumpyBlock for the native numpy variable-length string dtype I’m working on, but I guess we can merge that too and just let the _is_numeric attribute be set by the initializer.

@jbrockmendel
Copy link
Member

Another wrinkle is that I’d like to use NumpyBlock for the native numpy variable-length string dtype I’m working on, but I guess we can merge that too and just let the _is_numeric attribute be set by the initializer.

Yah let's do a refactor and get down to just NumpyBlock for all of these. That'll make the rest of this PR unnecessary right?

@ngoldbaum
Copy link
Contributor Author

Yup! Will try again with a followup soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants