Skip to content

ExtensionDtype should be hashable #22476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Aug 23, 2018 · 6 comments
Closed

ExtensionDtype should be hashable #22476

TomAugspurger opened this issue Aug 23, 2018 · 6 comments
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Using this for sparse in #22325, it'd be good to have generally.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 23, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Aug 23, 2018
@TomAugspurger
Copy link
Contributor Author

I'll also use this for #22994.

It looks like our default ExtensionDtype.eq is pretty dangerous. It's basically wrong for any kind of parametrized type. I'll do some reading about how to handle this.

@TomAugspurger
Copy link
Contributor Author

So, a safe and low-tech option is to have EA authors declare the meaningful attributes of their ExtensionDtype. For PeriodDtype, this is freq. For CategorcialDtype its the categories and order.

class PeriodDtype(ExtensionDtype):
    _attributes = ['freq']

    def __eq__(self, other):
        if isinstance(other, compat.string_types):
            try:
                other = self._construct_from_string(other)
            except TypeError:
                return False
        if isinstance(other, type(self)):
            return all(
                getattr(self, attr) == getattr(other, attr)
                for attr in self._attributes
            )
        return False

    def __hash__(self):
        return hash(tuple(getattr(self, attr) for attr in self._attributes))

(in reality, __hash__ and __eq__ would go on the base ExtensionDtype class, not PeriodDtype.

That requires a couple things from all the _attributes.

  1. Their __eq__ returns a boolean
  2. They're hashable

CategoricalDtype.categories fails on both those counts, so the default implementation would work for it, but that's fine. I think that's a less common case.

The downside is a tiny bit of duplication. We could do fancier things with metaclasses probably, but that seems like overkill.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 4, 2018
TomAugspurger added a commit that referenced this issue Oct 8, 2018
* API: ExtensionDtype Equality and Hashability

Closes #22476
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
* API: ExtensionDtype Equality and Hashability

Closes pandas-dev#22476
@h-vetinari
Copy link
Contributor

h-vetinari commented Nov 20, 2018

I guess I missed this somehow - my attempt at a SetArray (#22382) has been dormant for some time while other issues were more pressing. In any case, having hashability as a general requirement makes it impossible to implement an extension array for sets, unless there is a way to circumvent that requirement somehow.

While implementation is not around the corner, I'd like for this to be a possibility further down the line. In the same spirit, @jreback commented on my first attempt at sets (without using an extension array) in #21547 (comment):

EA is an extension array & a dtype, both of which are first class. That's how I'd like to see sets (and lists and dicts)

EDIT: If need be, I guess the backend could use an array of frozenset or even tuple, but that wouldn't be my first choice, as it makes implementing the methods much more cumbersome.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 20, 2018 via email

@h-vetinari
Copy link
Contributor

@TomAugspurger
Thanks for the quick response. This was my misunderstanding - I read "Pandas now requires that extension dtypes be hashable" in https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#extensiontype-changes, and got the wrong idea.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

2 participants