REF/API: Stricter extension checking. #22031

TomAugspurger · 2018-07-23T19:41:56Z

Removes is_extension_array_dtype's handling of both arrays and dtypes.
Now it handles just arrays, and we provide is_extension_dtype for checking
whether a dtype is an extension dtype. It's the caller's responsibility to know
whether they have an array or dtype.

Closes #22021

Removes is_extension_array_dtype's handling of both arrays and dtypes. Now it handles just arrays, and we provide `is_extension_dtype` for checking whether a dtype is an extension dtype. It's the caller's responsibility to know whether they have an array or dtype. Closes pandas-dev#22021

pep8speaks · 2018-07-23T21:00:45Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 30, 2018 at 13:09 Hours UTC

jreback

why are you changing this? we do this for all other is_*

TomAugspurger · 2018-07-23T21:44:35Z

why are you changing this?

To fix #22021

we do this for all other is_*

Yeah, and that's why is_extension_array_dtype originally followed the same pattern. But I think it's not a good idea to pass values to pandas_dtype. All the call sites knew whether they had values or a dtype.

I can make similar changes for the other methods, but we have backwards compatibility issues there.

jorisvandenbossche · 2018-07-23T22:24:53Z

doc/source/whatsnew/v0.24.0.txt

@@ -323,7 +323,8 @@ ExtensionType Changes
 - Bug in :meth:`Series.get` for ``Series`` using ``ExtensionArray`` and integer index (:issue:`21257`)
 - :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
 - :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`)
-
+- Added :func:`pandas.api.types.is_extension_array_dtype` for testing whether an array is an ExtensionArray and :func:`pandas.api.types.is_extension_dtype` for testing whether a dtype is an ExtensionDtype (:issue:`22021`)


Wouldn't it be more logical to call this is_extension_array instead of is_extension_array_dtype? (I suppose the reason is backwards compatibility for at least one of the two use cases?)
Because I find the 'dtype' part of the name quite confusing.

codecov · 2018-07-23T23:20:46Z

Codecov Report

Merging #22031 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22031      +/-   ##
==========================================
- Coverage   92.07%   92.07%   -0.01%     
==========================================
  Files         170      170              
  Lines       50688    50684       -4     
==========================================
- Hits        46671    46666       -5     
- Misses       4017     4018       +1

Flag	Coverage Δ
#multiple	`90.48% <100%> (-0.01%)`	⬇️
#single	`42.3% <100%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/dtypes.py	`96.05% <100%> (+0.02%)`	⬆️
pandas/core/dtypes/common.py	`94.87% <100%> (-0.34%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26b3e7d...a97cec4. Read the comment docs.

jreback

I can't believe this didn't break anything. I am -1 on this though because this changes the definition of is_*_dtype. All of the routines work on an array correctly. Adding 2 different methods is very confusing as we on't do this for anything else and introduces a special case.

TomAugspurger · 2018-07-24T13:54:15Z

Wouldn't it be more logical to call this is_extension_array instead of is_extension_array_dtype? (I suppose the reason is backwards compatibility for at least one of the two use cases?)

Yes. These weren't in the public namespace earlier, so we don't have to worry about deprecating.

Adding 2 different methods is very confusing

How so? Let's step back to the original issue: pandas_dtype shouldn't pass data to np.dtype. Say I have an object array arr = [('f1', 'f2'), ('f2', 'f4')], as part of an operation that gets sliced to arr[0], and we end up calling pandas_dtype(subarr).

In [28]: subarr = [('f1', 'f2')]

In [29]: pd.api.types.pandas_dtype(subarr)
Out[29]: dtype([('f1', '<f2')])

Is that what you would expect?

we on't do this for anything else and introduces a special case.

As I said,

I can make similar changes for the other methods, but we have backwards compatibility issues there.

jorisvandenbossche · 2018-07-24T13:56:12Z

I can't believe this didn't break anything

Why? As Tom said, "all the call sites knew whether they had values or a dtype.", and there are not yet that many usages in the code base currently for this.
I personally like the fact we are explicit in the code about whether we are checking a dtype or an array-like.

Adding 2 different methods is very confusing as we on't do this for anything else and introduces a special case.

Note that we actually have two methods for some other types as well: we have is_categorical and is_categorical_dtype, is_datetimetz and is_datetime64tz_dtype (where the first act only on the array and not dtype), and is_sparse which is also for array-like.
On the other hand, is_period and is_interval then follow the other scheme where they are for the scalar objects, and the _dtype version for dtypes and array-likes.
Anyway, mainly noting that it is already not fully consistent. But I agree it would be nice to follow the same scheme for all check functions.

jreback · 2018-07-24T22:05:26Z

I think this is a really bad idea to change. We have had this convention for a LONG time esp on the basic types. this is throwing out everything.

TomAugspurger · 2018-07-24T23:54:57Z

Throwing out?

…

________________________________ From: Jeff Reback <[email protected]> Sent: Tuesday, July 24, 2018 5:05:31 PM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] REF/API: Stricter extension checking. (#22031) I think this is a really bad idea to change. We have had this convention for a LONG time esp on the basic types. this is throwing out everything. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#22031 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIlZxP53Zujbodat9MFHLx1yidcOqks5uJ5orgaJpZM4VblLO>.

TomAugspurger · 2018-07-25T12:13:58Z

Are you -1 on this @jreback? If so you'll need an alternative fix to #22021, as I'm not sure how to avoid that without passing data to np.dtype.

jreback · 2018-07-25T12:18:08Z

as I said I am -1 on this. This changes too much. This changes the meaning of the is_* for a single case, which is contrary to all of the existing code. I am not sure what the problem is, but fixing this way is not it.

TomAugspurger · 2018-07-25T12:19:03Z

That's unfortunate.

TomAugspurger · 2018-07-25T13:52:09Z

Sorry, I closed this in frustration with you @jreback. But I think it's a good change for the project, so I'll continue to state my case for it.

Comments like

I am not sure what the problem is

and

This changes the meaning of the is_* for a single case, which is contrary to all of the existing code.

are frustrating, since they've already been addressed. I explain the original problem in the issue (#22021 (comment)) and here (#22031 (comment)).

Joris explains in #22031 (comment), this follows the precedent laid out by categorical and datetimetz.

jorisvandenbossche · 2018-07-26T08:08:05Z

Wouldn't it be more logical to call this is_extension_array instead of is_extension_array_dtype? (I suppose the reason is backwards compatibility for at least one of the two use cases?)

Yes. These weren't in the public namespace earlier, so we don't have to worry about deprecating.

But since they are not yet in the public namespace, we can choose whichever best name we can think of? Or maybe I didn't understand your answer :-)

Anyway, I think we need to think a bit more on the naming scheme.
is_extension_dtype (for the dtype checker) sounds good. But is_extension_array_dtype for array-like checking sounds a bit strange to me (although in the end you check the dtype of the array ...). is_extension might be OK here, but for the other dtypes will conflict with the scalar checkers. And is_extension_array is maybe too specific in "array" as it checks for any array-like that has a dtype (series, index, extensionarray, ..)? Maybe "arraylike" (but doesn't read that nice)?

jorisvandenbossche · 2018-07-26T08:08:33Z

pandas/core/dtypes/common.py

+
+    Parameters
+    ----------
+    arr : object


array -> dtype

jorisvandenbossche · 2018-07-26T08:09:05Z

pandas/core/dtypes/common.py

-    return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))
+
+def is_extension_dtype(dtype):
+    """Check if a dtype object is a pandas extension dtype.


One question is whether we want to accept strings here as well?
Or keep it strictly for actual dtype objects?

That's a good point, I suspect we should accept strings that have been registered.

I'll see what I can do.

meant for the comment here
this is what pandas_dtype does

TomAugspurger · 2018-07-26T11:01:24Z

But since they are not yet in the public namespace, we can choose whichever best name we can think of? Or maybe I didn't understand your answer :-)

Sorry for the confusion. I think you understood my answer, but I didn't push the change :) I think is_extension_array for checking whether an array-like is an extension array is good.

jreback · 2018-07-26T11:06:13Z

this is exactly what pandas_dtype does

TomAugspurger · 2018-07-26T11:11:12Z

Yes, but pandas_dtype also potentially passes data to np.dtype in

pandas/pandas/core/dtypes/common.py

Line 2013 in 2c7c797

npdtype = np.dtype(dtype)

, which we need to avoid.

jorisvandenbossche · 2018-07-26T11:21:25Z

but pandas_dtype also potentially passes data to np.dtype in

Since you now splitted it in two functions, in the new is_extension_dtype, you can only use pandas_dtype when the input is a string, and then you should never run into those problems?

jorisvandenbossche · 2018-07-26T11:23:24Z

pandas/core/dtypes/common.py

-def is_extension_array_dtype(arr_or_dtype):
-    """Check if an object is a pandas extension array type.
+def is_extension_array(arr):
+    """Check if an array object is a pandas extension array type.


Can you clarify (maybe in extended summary paragraph or in parameters section) that this is "array-like" (also Series and Index, in practice anything that has a dtype property)

…y_dtype_no_data

…ray_dtype_no_data

TomAugspurger · 2018-07-26T13:09:42Z

Refactored.

No longer splitting is_extension_dtype and is_extension_array_dtype.
is_extension_array_dtype avoids calling pandas_dtype. There's no reason to go through all the stuff that pandas_dtype does, including np.dtype(object), just to test if it's an extension array or dtype
DatetimeTZDtype and PeriodDtype were previously registered as extension dtypes. But this mixed dtypes that satisfy the interface and our own internal ones. For now, we'll need two registries, but this should be reduced to one by the time 0.24.0 comes out.

TomAugspurger · 2018-07-26T13:12:22Z

pandas/core/dtypes/dtypes.py


-    @classmethod


This wasn't actually a classmethod.

jreback · 2018-07-26T13:18:09Z

@TomAugspurger looks a lot nicer!

TomAugspurger · 2018-07-26T13:21:38Z

I don't think so. pandas will still have the issue of

In [28]: subarr = [('f1', 'f2')]

In [29]: pd.api.types.pandas_dtype(subarr)
Out[29]: dtype([('f1', '<f2')])

this is just a workaround that avoids the problem by not calling pandas_dtype from within is_extension_array_dtype.

jreback · 2018-07-28T13:44:48Z

can you rebase

…y_dtype_no_data

jreback · 2018-07-31T13:22:52Z

thanks!!!

TomAugspurger added 2 commits July 23, 2018 14:41

Added to API test

b908a2f

Linting

34ff2cd

jreback requested changes Jul 23, 2018

View reviewed changes

jorisvandenbossche reviewed Jul 23, 2018

View reviewed changes

gfyoung added API Design Refactor Internal refactoring of code Build Library building on various platforms labels Jul 23, 2018

jreback requested changes Jul 23, 2018

View reviewed changes

Rename

ec31716

TomAugspurger closed this Jul 25, 2018

TomAugspurger reopened this Jul 25, 2018

jorisvandenbossche reviewed Jul 26, 2018

View reviewed changes

Rename is_extension_array

94318e1

jorisvandenbossche reviewed Jul 26, 2018

View reviewed changes

TomAugspurger added 5 commits July 26, 2018 07:46

Take 2

e92d75f

Merge remote-tracking branch 'upstream/master' into is_extension_arra…

7cfbebf

…y_dtype_no_data

revert

2731c70

Merge branch 'is_extension_array_dtype_no_data2' into is_extension_ar…

4d94a3b

…ray_dtype_no_data

Split registries

bc24cd7

TomAugspurger commented Jul 26, 2018

View reviewed changes

pandas/core/dtypes/dtypes.py

@classmethod

Copy link

Contributor Author

TomAugspurger Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't actually a classmethod.

lint

08d1c40

Merge remote-tracking branch 'upstream/master' into is_extension_arra…

a97cec4

…y_dtype_no_data

jreback added this to the 0.24.0 milestone Jul 31, 2018

jreback approved these changes Jul 31, 2018

View reviewed changes

jreback merged commit 4f0392d into pandas-dev:master Jul 31, 2018

dberenbaum pushed a commit to dberenbaum/pandas that referenced this pull request Aug 3, 2018

REF/API: Stricter extension checking. (pandas-dev#22031)

cae328e

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

REF/API: Stricter extension checking. (pandas-dev#22031)

10586fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF/API: Stricter extension checking. #22031

REF/API: Stricter extension checking. #22031

TomAugspurger commented Jul 23, 2018

pep8speaks commented Jul 23, 2018 •

edited

Loading

jreback left a comment

TomAugspurger commented Jul 23, 2018

jorisvandenbossche Jul 23, 2018

codecov bot commented Jul 23, 2018 •

edited

Loading

jreback left a comment

TomAugspurger commented Jul 24, 2018

jorisvandenbossche commented Jul 24, 2018

jreback commented Jul 24, 2018

TomAugspurger commented Jul 24, 2018 via email

TomAugspurger commented Jul 25, 2018

jreback commented Jul 25, 2018 •

edited

Loading

TomAugspurger commented Jul 25, 2018

TomAugspurger commented Jul 25, 2018 •

edited

Loading

jorisvandenbossche commented Jul 26, 2018

jorisvandenbossche Jul 26, 2018

jorisvandenbossche Jul 26, 2018

TomAugspurger Jul 26, 2018

jreback Jul 26, 2018

TomAugspurger commented Jul 26, 2018

jreback commented Jul 26, 2018

TomAugspurger commented Jul 26, 2018

jorisvandenbossche commented Jul 26, 2018

jorisvandenbossche Jul 26, 2018

TomAugspurger commented Jul 26, 2018

TomAugspurger Jul 26, 2018

jreback commented Jul 26, 2018

TomAugspurger commented Jul 26, 2018 •

edited

Loading

jreback commented Jul 28, 2018

jreback commented Jul 31, 2018

REF/API: Stricter extension checking. #22031

REF/API: Stricter extension checking. #22031

Conversation

TomAugspurger commented Jul 23, 2018

pep8speaks commented Jul 23, 2018 • edited Loading

Comment last updated on July 30, 2018 at 13:09 Hours UTC

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jul 23, 2018

jorisvandenbossche Jul 23, 2018

Choose a reason for hiding this comment

codecov bot commented Jul 23, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jul 24, 2018

jorisvandenbossche commented Jul 24, 2018

jreback commented Jul 24, 2018

TomAugspurger commented Jul 24, 2018 via email

TomAugspurger commented Jul 25, 2018

jreback commented Jul 25, 2018 • edited Loading

TomAugspurger commented Jul 25, 2018

TomAugspurger commented Jul 25, 2018 • edited Loading

jorisvandenbossche commented Jul 26, 2018

jorisvandenbossche Jul 26, 2018

Choose a reason for hiding this comment

jorisvandenbossche Jul 26, 2018

Choose a reason for hiding this comment

TomAugspurger Jul 26, 2018

Choose a reason for hiding this comment

jreback Jul 26, 2018

Choose a reason for hiding this comment

TomAugspurger commented Jul 26, 2018

jreback commented Jul 26, 2018

TomAugspurger commented Jul 26, 2018

jorisvandenbossche commented Jul 26, 2018

jorisvandenbossche Jul 26, 2018

Choose a reason for hiding this comment

TomAugspurger commented Jul 26, 2018

TomAugspurger Jul 26, 2018

Choose a reason for hiding this comment

jreback commented Jul 26, 2018

TomAugspurger commented Jul 26, 2018 • edited Loading

jreback commented Jul 28, 2018

jreback commented Jul 31, 2018

pep8speaks commented Jul 23, 2018 •

edited

Loading

codecov bot commented Jul 23, 2018 •

edited

Loading

jreback commented Jul 25, 2018 •

edited

Loading

TomAugspurger commented Jul 25, 2018 •

edited

Loading

TomAugspurger commented Jul 26, 2018 •

edited

Loading