Skip to content

REF/API: Stricter extension checking. #22031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TomAugspurger
Copy link
Contributor

Removes is_extension_array_dtype's handling of both arrays and dtypes.
Now it handles just arrays, and we provide is_extension_dtype for checking
whether a dtype is an extension dtype. It's the caller's responsibility to know
whether they have an array or dtype.

Closes #22021

Removes is_extension_array_dtype's handling of both arrays and dtypes.
Now it handles just arrays, and we provide `is_extension_dtype` for checking
whether a dtype is an extension dtype. It's the caller's responsibility to know
whether they have an array or dtype.

Closes pandas-dev#22021
@pep8speaks
Copy link

pep8speaks commented Jul 23, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 30, 2018 at 13:09 Hours UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this? we do this for all other is_*

@TomAugspurger
Copy link
Contributor Author

why are you changing this?

To fix #22021

we do this for all other is_*

Yeah, and that's why is_extension_array_dtype originally followed the same pattern. But I think it's not a good idea to pass values to pandas_dtype. All the call sites knew whether they had values or a dtype.

I can make similar changes for the other methods, but we have backwards compatibility issues there.

@@ -323,7 +323,8 @@ ExtensionType Changes
- Bug in :meth:`Series.get` for ``Series`` using ``ExtensionArray`` and integer index (:issue:`21257`)
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`)
-
- Added :func:`pandas.api.types.is_extension_array_dtype` for testing whether an array is an ExtensionArray and :func:`pandas.api.types.is_extension_dtype` for testing whether a dtype is an ExtensionDtype (:issue:`22021`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more logical to call this is_extension_array instead of is_extension_array_dtype? (I suppose the reason is backwards compatibility for at least one of the two use cases?)
Because I find the 'dtype' part of the name quite confusing.

@codecov
Copy link

codecov bot commented Jul 23, 2018

Codecov Report

Merging #22031 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22031      +/-   ##
==========================================
- Coverage   92.07%   92.07%   -0.01%     
==========================================
  Files         170      170              
  Lines       50688    50684       -4     
==========================================
- Hits        46671    46666       -5     
- Misses       4017     4018       +1
Flag Coverage Δ
#multiple 90.48% <100%> (-0.01%) ⬇️
#single 42.3% <100%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/dtypes.py 96.05% <100%> (+0.02%) ⬆️
pandas/core/dtypes/common.py 94.87% <100%> (-0.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26b3e7d...a97cec4. Read the comment docs.

@gfyoung gfyoung added API Design Refactor Internal refactoring of code Build Library building on various platforms labels Jul 23, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't believe this didn't break anything. I am -1 on this though because this changes the definition of is_*_dtype. All of the routines work on an array correctly. Adding 2 different methods is very confusing as we on't do this for anything else and introduces a special case.

@TomAugspurger
Copy link
Contributor Author

Wouldn't it be more logical to call this is_extension_array instead of is_extension_array_dtype? (I suppose the reason is backwards compatibility for at least one of the two use cases?)

Yes. These weren't in the public namespace earlier, so we don't have to worry about deprecating.

Adding 2 different methods is very confusing

How so? Let's step back to the original issue: pandas_dtype shouldn't pass data to np.dtype. Say I have an object array arr = [('f1', 'f2'), ('f2', 'f4')], as part of an operation that gets sliced to arr[0], and we end up calling pandas_dtype(subarr).

In [28]: subarr = [('f1', 'f2')]

In [29]: pd.api.types.pandas_dtype(subarr)
Out[29]: dtype([('f1', '<f2')])

Is that what you would expect?

we on't do this for anything else and introduces a special case.

As I said,

I can make similar changes for the other methods, but we have backwards compatibility issues there.

@jorisvandenbossche
Copy link
Member

I can't believe this didn't break anything

Why? As Tom said, "all the call sites knew whether they had values or a dtype.", and there are not yet that many usages in the code base currently for this.
I personally like the fact we are explicit in the code about whether we are checking a dtype or an array-like.

Adding 2 different methods is very confusing as we on't do this for anything else and introduces a special case.

Note that we actually have two methods for some other types as well: we have is_categorical and is_categorical_dtype, is_datetimetz and is_datetime64tz_dtype (where the first act only on the array and not dtype), and is_sparse which is also for array-like.
On the other hand, is_period and is_interval then follow the other scheme where they are for the scalar objects, and the _dtype version for dtypes and array-likes.
Anyway, mainly noting that it is already not fully consistent. But I agree it would be nice to follow the same scheme for all check functions.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2018

I think this is a really bad idea to change. We have had this convention for a LONG time esp on the basic types. this is throwing out everything.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 24, 2018 via email

@TomAugspurger
Copy link
Contributor Author

Are you -1 on this @jreback? If so you'll need an alternative fix to #22021, as I'm not sure how to avoid that without passing data to np.dtype.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2018

as I said I am -1 on this. This changes too much. This changes the meaning of the is_* for a single case, which is contrary to all of the existing code. I am not sure what the problem is, but fixing this way is not it.

@TomAugspurger
Copy link
Contributor Author

That's unfortunate.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 25, 2018

Sorry, I closed this in frustration with you @jreback. But I think it's a good change for the project, so I'll continue to state my case for it.

Comments like

I am not sure what the problem is

and

This changes the meaning of the is_* for a single case, which is contrary to all of the existing code.

are frustrating, since they've already been addressed. I explain the original problem in the issue (#22021 (comment)) and here (#22031 (comment)).

Joris explains in #22031 (comment), this follows the precedent laid out by categorical and datetimetz.

@TomAugspurger TomAugspurger reopened this Jul 25, 2018
@jorisvandenbossche
Copy link
Member

Wouldn't it be more logical to call this is_extension_array instead of is_extension_array_dtype? (I suppose the reason is backwards compatibility for at least one of the two use cases?)

Yes. These weren't in the public namespace earlier, so we don't have to worry about deprecating.

But since they are not yet in the public namespace, we can choose whichever best name we can think of? Or maybe I didn't understand your answer :-)

Anyway, I think we need to think a bit more on the naming scheme.
is_extension_dtype (for the dtype checker) sounds good. But is_extension_array_dtype for array-like checking sounds a bit strange to me (although in the end you check the dtype of the array ...). is_extension might be OK here, but for the other dtypes will conflict with the scalar checkers. And is_extension_array is maybe too specific in "array" as it checks for any array-like that has a dtype (series, index, extensionarray, ..)? Maybe "arraylike" (but doesn't read that nice)?


Parameters
----------
arr : object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array -> dtype

return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))

def is_extension_dtype(dtype):
"""Check if a dtype object is a pandas extension dtype.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question is whether we want to accept strings here as well?
Or keep it strictly for actual dtype objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I suspect we should accept strings that have been registered.

I'll see what I can do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meant for the comment here
this is what pandas_dtype does

@TomAugspurger
Copy link
Contributor Author

But since they are not yet in the public namespace, we can choose whichever best name we can think of? Or maybe I didn't understand your answer :-)

Sorry for the confusion. I think you understood my answer, but I didn't push the change :) I think is_extension_array for checking whether an array-like is an extension array is good.

@jreback
Copy link
Contributor

jreback commented Jul 26, 2018

this is exactly what pandas_dtype does

@TomAugspurger
Copy link
Contributor Author

Yes, but pandas_dtype also potentially passes data to np.dtype in

npdtype = np.dtype(dtype)
, which we need to avoid.

@jorisvandenbossche
Copy link
Member

but pandas_dtype also potentially passes data to np.dtype in

Since you now splitted it in two functions, in the new is_extension_dtype, you can only use pandas_dtype when the input is a string, and then you should never run into those problems?

def is_extension_array_dtype(arr_or_dtype):
"""Check if an object is a pandas extension array type.
def is_extension_array(arr):
"""Check if an array object is a pandas extension array type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify (maybe in extended summary paragraph or in parameters section) that this is "array-like" (also Series and Index, in practice anything that has a dtype property)

@TomAugspurger
Copy link
Contributor Author

Refactored.

  1. No longer splitting is_extension_dtype and is_extension_array_dtype.
  2. is_extension_array_dtype avoids calling pandas_dtype. There's no reason to go through all the stuff that pandas_dtype does, including np.dtype(object), just to test if it's an extension array or dtype
  3. DatetimeTZDtype and PeriodDtype were previously registered as extension dtypes. But this mixed dtypes that satisfy the interface and our own internal ones. For now, we'll need two registries, but this should be reduced to one by the time 0.24.0 comes out.


@classmethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't actually a classmethod.

@jreback
Copy link
Contributor

jreback commented Jul 26, 2018

@TomAugspurger looks a lot nicer!

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 26, 2018

I don't think so. pandas will still have the issue of

In [28]: subarr = [('f1', 'f2')]

In [29]: pd.api.types.pandas_dtype(subarr)
Out[29]: dtype([('f1', '<f2')])

this is just a workaround that avoids the problem by not calling pandas_dtype from within is_extension_array_dtype.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2018

can you rebase

@jreback jreback added this to the 0.24.0 milestone Jul 31, 2018
@jreback jreback merged commit 4f0392d into pandas-dev:master Jul 31, 2018
@jreback
Copy link
Contributor

jreback commented Jul 31, 2018

thanks!!!

dberenbaum pushed a commit to dberenbaum/pandas that referenced this pull request Aug 3, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Build Library building on various platforms Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants