CLN: refactor block storage selection to use dtype_class._is_numeric #52413

ngoldbaum · 2023-04-04T16:49:01Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Numpy 1.25 will add a new _is_numeric attribute on dtype classes that is True for dtypes numpy (and pandas it turns out) considers to be numeric and False otherwise (see numpy/numpy#23190):

In [14]: for dtype in [np.int_, np.float_, np.complex_, np.uint32, np.bool_]:
    ...:     assert(type(np.dtype(dtype))._is_numeric), dtype
    ...: 

In [15]: for dtype in [np.datetime64, np.timedelta64, np.str_, np.bytes_, np.object_]:
    ...:     assert(not type(np.dtype(dtype))._is_numeric), dtype
    ...:

Besides clarity and avoiding the cryptic dtype.kind codes, this change makes it possible for pandas to ingest numpy dtypes written using the currently experimental NEP 42 dtype API, in particular the new string dtype I'm working on (see #47884).

Note that this isn't sufficient to get arbitrary new dtypes working or my string dtype, more changes are coming to support that once it's more feasible to upstream the changes along with new tests in pandas, so this is more of a cleanup with an eye towards future new features.

jbrockmendel · 2023-04-04T17:24:11Z

Besides clarity and avoiding the cryptic dtype.kind codes

Will these checks be wrong or just not idiomatic? We use these checks in a ton of places (and are moving towards more rather than less)

How does this affect perf? We've gone through a few iterations with get_block_type trying to wring perf out of it.

ngoldbaum · 2023-04-04T17:40:06Z

Will these checks be wrong or just not idiomatic?

They'll be wrong if someone tries to pass in a non-legacy numpy dtype (e.g. not one of the dtypes that currently ship with numpy or a custom dtype written using the legacy custom dtype API). In principle there will be dtypes shipping with numpy 2.0 next January or a future numpy version that will be non-legacy dtypes. non-legacy dtypes all have dtype.kind set to \0 right now, so unless the plan changes there needs to be a different way to distinguish between dtypes, and numpy is headed towards supporting comparing dtype classes (e.g. numpy/numpy#23358) and supporting a hierarchy of dtype types.

We use these checks in a ton of places (and are moving towards more rather than less)

If the idea is that pandas only wants to whitelist supported dtypes I could just abandon this and whitelist my stringdtype.

How does this affect perf? We've gone through a few iterations with get_block_type trying to wring perf out of it.

It's a teeny bit worse. Another thing I could do is refactor this to first check if dtype.kind is a currently supported dtype and only in cases where it's unknown to pandas rely on _is_numeric.

I'm also not sure how to handle the mypy failure, I'm pretty unfamiliar with writing python type annotations.

jbrockmendel · 2023-04-05T15:13:24Z

For DatetimeLikeBlock we specifically want the existing dt64/td64 dtypes, so would want to exclude any custom dtypes. For everything else I think we can just merge ObjectBlock and NumericBlock into a single class and make some of this unnecessary.

Is there a convenient way of checking whether a dtype is standard vs custom?

ngoldbaum · 2023-04-05T15:16:32Z

Is there a convenient way of checking whether a dtype is standard vs custom?

Yes, there is a type(dtype)._is_legacy which is True for legacy dtypes.

For everything else I think we can just merge ObjectBlock and NumericBlock into a single class and make some of this unnecessary.

Another wrinkle is that I’d like to use NumpyBlock for the native numpy variable-length string dtype I’m working on, but I guess we can merge that too and just let the _is_numeric attribute be set by the initializer.

jbrockmendel · 2023-04-11T17:41:35Z

Another wrinkle is that I’d like to use NumpyBlock for the native numpy variable-length string dtype I’m working on, but I guess we can merge that too and just let the _is_numeric attribute be set by the initializer.

Yah let's do a refactor and get down to just NumpyBlock for all of these. That'll make the rest of this PR unnecessary right?

ngoldbaum · 2023-04-11T17:49:23Z

Yup! Will try again with a followup soon.

CLN: refactor block storage selection to use dtype_class._is_numeric

2798cf1

mroeschke requested a review from jbrockmendel April 4, 2023 17:00

mroeschke added the Internals Related to non-user accessible pandas implementation label Apr 4, 2023

ngoldbaum closed this Apr 11, 2023

ngoldbaum mentioned this pull request Apr 20, 2023

CLN: unify NumpyBlock, ObjectBlock, and NumericBlock #52817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: refactor block storage selection to use dtype_class._is_numeric #52413

CLN: refactor block storage selection to use dtype_class._is_numeric #52413

ngoldbaum commented Apr 4, 2023

jbrockmendel commented Apr 4, 2023

ngoldbaum commented Apr 4, 2023

jbrockmendel commented Apr 5, 2023

ngoldbaum commented Apr 5, 2023

jbrockmendel commented Apr 11, 2023

ngoldbaum commented Apr 11, 2023

CLN: refactor block storage selection to use dtype_class._is_numeric #52413

CLN: refactor block storage selection to use dtype_class._is_numeric #52413

Conversation

ngoldbaum commented Apr 4, 2023

jbrockmendel commented Apr 4, 2023

ngoldbaum commented Apr 4, 2023

jbrockmendel commented Apr 5, 2023

ngoldbaum commented Apr 5, 2023

jbrockmendel commented Apr 11, 2023

ngoldbaum commented Apr 11, 2023