-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: ExtensionDtype Equality and Hashability #22996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @TomAugspurger! Thanks for submitting the PR.
|
cc @xhochy @jorisvandenbossche, @znicholls (people who seem to have implemented an extension array outside of pandas). This may subtly break your code if you have a parametrized ExtensionDtype and don't update to set |
Codecov Report
@@ Coverage Diff @@
## master #22996 +/- ##
==========================================
- Coverage 92.2% 92.19% -0.01%
==========================================
Files 169 169
Lines 50822 50831 +9
==========================================
+ Hits 46858 46866 +8
- Misses 3964 3965 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks sensible to me
return False | ||
|
||
def __hash__(self): | ||
return hash(tuple(getattr(self, attr) for attr in self._metadata)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this contain something more than only the attributes?
Because this would mean that two classes with the same attributes can give the same hash? (eg a fictional DatetimeDtype and PeriodDtype in case both have a freq
(and only a freq))
Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?
Ah, I see your note about that in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this would mean that two classes with the same attributes can give the same hash?
Yes, but I don't think that's a problem in practice. As I understand it, python expects that if two classes are equal, then they have the same hash. The reverse is not true. I think it's OK for two objects to have the same hash, but not be equal.
With the following
class A(ExtensionDtype):
_freq = ('foo',)
name = "A"
def __init__(self, foo):
self.foo = foo
class B(ExtensionDtype):
_freq = ('foo',)
name = 'B'
def __init__(self, foo):
self.foo = foo
We get
In [2]: a = A('foo')
In [3]: b = B('foo')
In [4]: hash(a) == hash(b)
Out[4]: True
In [5]: d = {a: 'a', b: 'b'}
In [6]: d
Out[6]:
{<pandas.core.dtypes.dtypes.A at 0x11e824080>: 'a',
<pandas.core.dtypes.dtypes.B at 0x11e80a828>: 'b'}
In [7]: d[a]
Out[7]: 'a'
In [8]: len({a, b})
Out[8]: 2
Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?
Yes, that's correct. I think that's OK. If people have non-hashable things that characterize their class, they'll need to implement their own __hash__
Apparently decimal contexts aren't hashable in python 2. Just going to skip. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
give things a perf check; these can often be expensive checks
What's the apprehension against a metaclass for ExtensionTypes? Couldn't we leverage introspection during class creation to do this automatically instead of requiring users to explicitly define that class attribute? |
The docstring mentions performance reasons. In principle we could indeed inspect the signature of the constructor (that's what eg sklearn does to determine what the 'params' of an Estimator are), but personally I am fine with requiring the user to list them explicitly (explicitness also has some value). |
Where do you have in mind? This PR has no change in behavior anywhere in pandas. All of our ExtensionDtypes, other than IntegerArray already defined a
I was mainly thinking unnecessary complexity. In the past, every time I've reached for a metaclass there's been a simpler solution. But as Joris mentions, |
well, run the asv suite. I have changed things like this in the pass and if its not exactly right it can kill perf. These checks are hit a lot if practically everything. |
Ok, I can check on Sunday. I’m away from the computer for a couple days.
But again, the only change in pandas would be to not raise an error when trying to hash IntegerDtype. The rest already override hash.
…________________________________
From: Jeff Reback <[email protected]>
Sent: Friday, October 5, 2018 7:06:49 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] API: ExtensionDtype Equality and Hashability (#22996)
Where do you have in mind? This PR has no change in behavior anywhere in pandas. All of our ExtensionDtypes, other than IntegerArray already defined a hash.
well, run the asv suite. I have changed things like this in the pass and if its not exactly right it can kill perf. These checks are hit a lot if practically everything.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#22996 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIhDxShkStMZH_b71FvRbhwNN-pQbks5uh0tZgaJpZM4XI1XM>.
|
ok so prob nothing, but can't hurt to check. thanks. |
Just some noise it looks like
though I'll be sure to check the asv runner tomorrow. |
Merging tomorrow if no objections. |
if asv is like that then no objections |
* API: ExtensionDtype Equality and Hashability Closes pandas-dev#22476
Implements a default
__eq__
and__hash__
for ExtensionDtype. Adds a test ensure that they're defined.Do people have thoughts on
_metadata
? I've tried to document everywhere the importance of it for parametrized extension dtypes. If you fail to set it correctly and use the default__eq__
and__hash__
, you'll end up with bizarre behavior likePeriod('D') == Period("M")
being true, and having the same hash.A more heavyweight alternative is to do some metaclass trickery to inspect
ExtensionDtype.__init__
for parameters and set_metadata
based on that, but I tend to stay away from metaclasses unless they're absolutely necessary.Closes #22476