API: ExtensionDtype Equality and Hashability #22996

TomAugspurger · 2018-10-04T20:15:26Z

Implements a default __eq__ and __hash__ for ExtensionDtype. Adds a test ensure that they're defined.

Do people have thoughts on _metadata? I've tried to document everywhere the importance of it for parametrized extension dtypes. If you fail to set it correctly and use the default __eq__ and __hash__, you'll end up with bizarre behavior like Period('D') == Period("M") being true, and having the same hash.

A more heavyweight alternative is to do some metaclass trickery to inspect ExtensionDtype.__init__ for parameters and set _metadata based on that, but I tend to stay away from metaclasses unless they're absolutely necessary.

Closes #22476

Closes pandas-dev#22476

pep8speaks · 2018-10-04T20:15:32Z

Hello @TomAugspurger! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/dtypes/base.py !
There are no PEP8 issues in the file pandas/core/dtypes/dtypes.py !
There are no PEP8 issues in the file pandas/tests/extension/base/dtype.py !
There are no PEP8 issues in the file pandas/tests/extension/decimal/array.py !
There are no PEP8 issues in the file pandas/tests/extension/json/array.py !

TomAugspurger · 2018-10-04T20:15:43Z

cc @xhochy @jorisvandenbossche, @znicholls (people who seem to have implemented an extension array outside of pandas). This may subtly break your code if you have a parametrized ExtensionDtype and don't update to set _metadata.

codecov · 2018-10-04T20:46:14Z

Codecov Report

Merging #22996 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22996      +/-   ##
==========================================
- Coverage    92.2%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50822    50831       +9     
==========================================
+ Hits        46858    46866       +8     
- Misses       3964     3965       +1

Flag	Coverage Δ
#multiple	`90.61% <100%> (-0.01%)`	⬇️
#single	`42.35% <42.85%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/dtypes/base.py	`100% <100%> (ø)`	⬆️
pandas/core/dtypes/dtypes.py	`95.54% <100%> (-0.02%)`	⬇️
pandas/util/testing.py	`86.18% <0%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d430195...b7242f6. Read the comment docs.

jorisvandenbossche

This looks sensible to me

jorisvandenbossche · 2018-10-04T20:57:04Z

pandas/core/dtypes/base.py

+        return False
+
+    def __hash__(self):
+        return hash(tuple(getattr(self, attr) for attr in self._metadata))


shouldn't this contain something more than only the attributes?
Because this would mean that two classes with the same attributes can give the same hash? (eg a fictional DatetimeDtype and PeriodDtype in case both have a freq (and only a freq))

Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?

Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?

Ah, I see your note about that in the docs

Because this would mean that two classes with the same attributes can give the same hash?

Yes, but I don't think that's a problem in practice. As I understand it, python expects that if two classes are equal, then they have the same hash. The reverse is not true. I think it's OK for two objects to have the same hash, but not be equal.

With the following

class A(ExtensionDtype): _freq = ('foo',) name = "A" def __init__(self, foo): self.foo = foo class B(ExtensionDtype): _freq = ('foo',) name = 'B' def __init__(self, foo): self.foo = foo

We get

In [2]: a = A('foo') In [3]: b = B('foo') In [4]: hash(a) == hash(b) Out[4]: True In [5]: d = {a: 'a', b: 'b'} In [6]: d Out[6]: {<pandas.core.dtypes.dtypes.A at 0x11e824080>: 'a', <pandas.core.dtypes.dtypes.B at 0x11e80a828>: 'b'} In [7]: d[a] Out[7]: 'a' In [8]: len({a, b}) Out[8]: 2

Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?

Yes, that's correct. I think that's OK. If people have non-hashable things that characterize their class, they'll need to implement their own __hash__

TomAugspurger · 2018-10-04T21:44:35Z

Apparently decimal contexts aren't hashable in python 2. Just going to skip.

jreback

give things a perf check; these can often be expensive checks

WillAyd · 2018-10-05T03:12:47Z

What's the apprehension against a metaclass for ExtensionTypes? Couldn't we leverage introspection during class creation to do this automatically instead of requiring users to explicitly define that class attribute?

jorisvandenbossche · 2018-10-05T10:02:26Z

What's the apprehension against a metaclass for ExtensionTypes?

The docstring mentions performance reasons.

In principle we could indeed inspect the signature of the constructor (that's what eg sklearn does to determine what the 'params' of an Estimator are), but personally I am fine with requiring the user to list them explicitly (explicitness also has some value).

TomAugspurger · 2018-10-05T12:01:22Z

give things a perf check; these can often be expensive checks

Where do you have in mind? This PR has no change in behavior anywhere in pandas. All of our ExtensionDtypes, other than IntegerArray already defined a __hash__.

What's the apprehension against a metaclass for ExtensionTypes?

I was mainly thinking unnecessary complexity. In the past, every time I've reached for a metaclass there's been a simpler solution. But as Joris mentions, isinstance(foo, MetaClass) is quite a bit slower than for regular classes inheriting from object.

jreback · 2018-10-05T12:06:34Z

Where do you have in mind? This PR has no change in behavior anywhere in pandas. All of our ExtensionDtypes, other than IntegerArray already defined a hash.

well, run the asv suite. I have changed things like this in the pass and if its not exactly right it can kill perf. These checks are hit a lot if practically everything.

TomAugspurger · 2018-10-05T12:44:22Z

Ok, I can check on Sunday. I’m away from the computer for a couple days. But again, the only change in pandas would be to not raise an error when trying to hash IntegerDtype. The rest already override hash.

…

________________________________ From: Jeff Reback <[email protected]> Sent: Friday, October 5, 2018 7:06:49 AM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] API: ExtensionDtype Equality and Hashability (#22996) Where do you have in mind? This PR has no change in behavior anywhere in pandas. All of our ExtensionDtypes, other than IntegerArray already defined a hash. well, run the asv suite. I have changed things like this in the pass and if its not exactly right it can kill perf. These checks are hit a lot if practically everything. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#22996 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIhDxShkStMZH_b71FvRbhwNN-pQbks5uh0tZgaJpZM4XI1XM>.

jreback · 2018-10-06T15:10:55Z

But again, the only change in pandas would be to not raise an error when trying to hash IntegerDtype. The rest already override hash.

ok so prob nothing, but can't hurt to check. thanks.

TomAugspurger · 2018-10-08T15:30:54Z

Just some noise it looks like

+        1.89±0ms         2.42±0ms     1.28  rolling.Methods.time_rolling('DataFrame', 10, 'int', 'sum')
+        9.24±0ms         11.8±0ms     1.28  gil.ParallelRolling.time_rolling('skew')
+        3.16±0ms         4.02±0ms     1.27  offset.OffsetDatetimeIndexArithmetic.time_add_offset(<BusinessDay>)
+         663±0μs          824±0μs     1.24  timeseries.DatetimeIndex.time_to_time('dst')
+        56.3±0ms         69.1±0ms     1.23  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'higher')
+        3.76±0ms         4.61±0ms     1.22  stat_ops.FrameOps.time_op('sum', 'float', 0, False)
+        66.4±0ms         81.1±0ms     1.22  stat_ops.FrameOps.time_op('skew', 'float', 1, False)
+        1.74±0ms         2.11±0ms     1.21  index_object.Ops.time_divide('int')
+        3.38±0ms         4.08±0ms     1.21  indexing.Take.time_take('datetime')
+         556±0μs          672±0μs     1.21  series_methods.IsInForObjects.time_isin_nans
+        14.3±0ms         17.2±0ms     1.21  join_merge.Join.time_join_dataframe_index_shuffle_key_bigger_sort(True)
+         739±0μs          892±0μs     1.21  ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x11bd06d90>, False)
+        37.8±0ms         45.6±0ms     1.21  frame_ctor.FromDicts.time_list_of_dict
+         485±0μs          585±0μs     1.20  offset.ApplyIndex.time_apply_index(<YearBegin: month=1>)
+         243±0ms          293±0ms     1.20  join_merge.ConcatDataFrames.time_f_ordered(1, False)
+        7.54±0ms         9.02±0ms     1.20  categoricals.Rank.time_rank_int_cat_ordered
+        22.7±0ms         26.9±0ms     1.18  io.sql.ReadSQLTableDtypes.time_read_sql_table_column('datetime')
+         120±0ms          142±0ms     1.18  io.hdf.HDFStoreDataFrame.time_read_store_mixed
+        4.11±0ms         4.86±0ms     1.18  rolling.Methods.time_rolling('DataFrame', 10, 'float', 'kurt')
+         505±0μs          596±0μs     1.18  offset.ApplyIndex.time_apply_index(<QuarterBegin: startingMonth=3>)
+        1.06±0ms         1.25±0ms     1.18  indexing.CategoricalIndexIndexing.time_getitem_bool_array('non_monotonic')
+        8.69±0ms         10.2±0ms     1.18  frame_methods.Equals.time_frame_float_unequal
+         212±0ms          248±0ms     1.17  groupby.GroupByMethods.time_dtype_as_group('int', 'skew', 'transformation')
+        10.1±0ms         11.8±0ms     1.17  frame_methods.Apply.time_apply_lambda_mean
+        4.00±0ms         4.67±0ms     1.17  offset.OnOffset.time_on_offset(<CustomBusinessMonthEnd>)
+         680±0ms          791±0ms     1.16  io.stata.Stata.time_read_stata('tc')
+        4.02±0ms         4.65±0ms     1.16  offset.OffsetSeriesArithmetic.time_add_offset(<SemiMonthEnd: day_of_month=15>)
+        48.7±0ms         56.2±0ms     1.16  strings.Methods.time_center
+        9.23±0ms         10.5±0ms     1.14  frame_methods.ToString.time_to_string_floats
+         167±0ms          188±0ms     1.13  reindex.DropDuplicates.time_frame_drop_dups_int(False)
+         941±0μs         1.06±0ms     1.13  indexing.CategoricalIndexIndexing.time_getitem_list('monotonic_decr')
+         323±0ms          364±0ms     1.13  frame_methods.Duplicated.time_frame_duplicated
+         421±0ms          471±0ms     1.12  frame_methods.Duplicated.time_frame_duplicated_wide
+        7.80±0ms         8.71±0ms     1.12  groupby.Categories.time_groupby_nosort
+        3.68±0ms         4.10±0ms     1.12  sparse.Arithmetic.time_make_union(0.01, 0)
+        3.70±0ms         4.13±0ms     1.12  sparse.ArithmeticBlock.time_division(0)
+         355±0ms          396±0ms     1.11  gil.ParallelGroupbyMethods.time_loop(4, 'prod')
+         543±0ms          605±0ms     1.11  gil.ParallelGroupbyMethods.time_parallel(8, 'sum')
+        16.7±0ms         18.6±0ms     1.11  io.pickle.Pickle.time_write_pickle
+         1.03±0s          1.15±0s     1.11  groupby.GroupByMethods.time_dtype_as_group('float', 'mad', 'direct')
+        3.70±0ms         4.10±0ms     1.11  timeseries.ToDatetimeCache.time_dup_string_dates_and_format(True)
+        2.88±0ms         3.19±0ms     1.11  stat_ops.FrameOps.time_op('prod', 'float', 0, True)
+         585±0μs          647±0μs     1.11  offset.ApplyIndex.time_apply_index(<QuarterEnd: startingMonth=3>)
+        35.3±0ms         39.0±0ms     1.11  algorithms.Duplicated.time_duplicated_int('last')
+         140±0ms          154±0ms     1.10  sparse.Arithmetic.time_add(0.01, nan)
+         374±0ms          412±0ms     1.10  gil.ParallelGroupbyMethods.time_loop(4, 'sum')
+        26.5±0ms         29.2±0ms     1.10  strings.Methods.time_lstrip
+        47.7±0ms         52.4±0ms     1.10  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'higher')

though I'll be sure to check the asv runner tomorrow.

TomAugspurger · 2018-10-08T18:04:56Z

Merging tomorrow if no objections.

jreback · 2018-10-08T20:47:04Z

if asv is like that then no objections

* API: ExtensionDtype Equality and Hashability Closes pandas-dev#22476

API: ExtensionDtype Equality and Hashability

ed56aa3

Closes pandas-dev#22476

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Oct 4, 2018

TomAugspurger added this to the 0.24.0 milestone Oct 4, 2018

issue note

463c0bd

jorisvandenbossche reviewed Oct 4, 2018

View reviewed changes

TomAugspurger mentioned this pull request Oct 4, 2018

BUG: Concat multiple different ExtensionArray types #22997

Merged

skip py2

b7242f6

jreback requested changes Oct 5, 2018

View reviewed changes

TomAugspurger merged commit f4fe6e3 into pandas-dev:master Oct 8, 2018

TomAugspurger deleted the hashable-ea branch October 8, 2018 21:21

znicholls mentioned this pull request Oct 9, 2018

Pandas update hgrecco/pint#713

Closed

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

API: ExtensionDtype Equality and Hashability (pandas-dev#22996)

e6cc042

* API: ExtensionDtype Equality and Hashability Closes pandas-dev#22476

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: ExtensionDtype Equality and Hashability #22996

API: ExtensionDtype Equality and Hashability #22996

TomAugspurger commented Oct 4, 2018

pep8speaks commented Oct 4, 2018

TomAugspurger commented Oct 4, 2018

codecov bot commented Oct 4, 2018 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Oct 4, 2018

jorisvandenbossche Oct 4, 2018

TomAugspurger Oct 4, 2018

TomAugspurger commented Oct 4, 2018

jreback left a comment

WillAyd commented Oct 5, 2018

jorisvandenbossche commented Oct 5, 2018

TomAugspurger commented Oct 5, 2018

jreback commented Oct 5, 2018

TomAugspurger commented Oct 5, 2018 via email

jreback commented Oct 6, 2018

TomAugspurger commented Oct 8, 2018

TomAugspurger commented Oct 8, 2018

jreback commented Oct 8, 2018

API: ExtensionDtype Equality and Hashability #22996

API: ExtensionDtype Equality and Hashability #22996

Conversation

TomAugspurger commented Oct 4, 2018

pep8speaks commented Oct 4, 2018

TomAugspurger commented Oct 4, 2018

codecov bot commented Oct 4, 2018 • edited Loading

Codecov Report

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Oct 4, 2018

Choose a reason for hiding this comment

jorisvandenbossche Oct 4, 2018

Choose a reason for hiding this comment

TomAugspurger Oct 4, 2018

Choose a reason for hiding this comment

TomAugspurger commented Oct 4, 2018

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Oct 5, 2018

jorisvandenbossche commented Oct 5, 2018

TomAugspurger commented Oct 5, 2018

jreback commented Oct 5, 2018

TomAugspurger commented Oct 5, 2018 via email

jreback commented Oct 6, 2018

TomAugspurger commented Oct 8, 2018

TomAugspurger commented Oct 8, 2018

jreback commented Oct 8, 2018

codecov bot commented Oct 4, 2018 •

edited

Loading