Skip to content

API: ExtensionDtype Equality and Hashability #22996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 8, 2018

Conversation

TomAugspurger
Copy link
Contributor

Implements a default __eq__ and __hash__ for ExtensionDtype. Adds a test ensure that they're defined.

Do people have thoughts on _metadata? I've tried to document everywhere the importance of it for parametrized extension dtypes. If you fail to set it correctly and use the default __eq__ and __hash__, you'll end up with bizarre behavior like Period('D') == Period("M") being true, and having the same hash.

A more heavyweight alternative is to do some metaclass trickery to inspect ExtensionDtype.__init__ for parameters and set _metadata based on that, but I tend to stay away from metaclasses unless they're absolutely necessary.

Closes #22476

@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for submitting the PR.

@TomAugspurger
Copy link
Contributor Author

cc @xhochy @jorisvandenbossche, @znicholls (people who seem to have implemented an extension array outside of pandas). This may subtly break your code if you have a parametrized ExtensionDtype and don't update to set _metadata.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Oct 4, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Oct 4, 2018
@codecov
Copy link

codecov bot commented Oct 4, 2018

Codecov Report

Merging #22996 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22996      +/-   ##
==========================================
- Coverage    92.2%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50822    50831       +9     
==========================================
+ Hits        46858    46866       +8     
- Misses       3964     3965       +1
Flag Coverage Δ
#multiple 90.61% <100%> (-0.01%) ⬇️
#single 42.35% <42.85%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/base.py 100% <100%> (ø) ⬆️
pandas/core/dtypes/dtypes.py 95.54% <100%> (-0.02%) ⬇️
pandas/util/testing.py 86.18% <0%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d430195...b7242f6. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks sensible to me

return False

def __hash__(self):
return hash(tuple(getattr(self, attr) for attr in self._metadata))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this contain something more than only the attributes?
Because this would mean that two classes with the same attributes can give the same hash? (eg a fictional DatetimeDtype and PeriodDtype in case both have a freq (and only a freq))

Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?

Ah, I see your note about that in the docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this would mean that two classes with the same attributes can give the same hash?

Yes, but I don't think that's a problem in practice. As I understand it, python expects that if two classes are equal, then they have the same hash. The reverse is not true. I think it's OK for two objects to have the same hash, but not be equal.

With the following

class A(ExtensionDtype):
    _freq = ('foo',)
    name = "A"
    def __init__(self, foo):
        self.foo = foo


class B(ExtensionDtype):
    _freq = ('foo',)
    name = 'B'

    def __init__(self, foo):
        self.foo = foo

We get

In [2]: a = A('foo')

In [3]: b = B('foo')

In [4]: hash(a) == hash(b)
Out[4]: True

In [5]: d = {a: 'a', b: 'b'}

In [6]: d
Out[6]:
{<pandas.core.dtypes.dtypes.A at 0x11e824080>: 'a',
 <pandas.core.dtypes.dtypes.B at 0x11e80a828>: 'b'}

In [7]: d[a]
Out[7]: 'a'

In [8]: len({a, b})
Out[8]: 2

Also, the consequence is that each attribute needs to be hashable, and eg cannot be a list?

Yes, that's correct. I think that's OK. If people have non-hashable things that characterize their class, they'll need to implement their own __hash__

@TomAugspurger
Copy link
Contributor Author

Apparently decimal contexts aren't hashable in python 2. Just going to skip.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give things a perf check; these can often be expensive checks

@WillAyd
Copy link
Member

WillAyd commented Oct 5, 2018

What's the apprehension against a metaclass for ExtensionTypes? Couldn't we leverage introspection during class creation to do this automatically instead of requiring users to explicitly define that class attribute?

@jorisvandenbossche
Copy link
Member

What's the apprehension against a metaclass for ExtensionTypes?

The docstring mentions performance reasons.

In principle we could indeed inspect the signature of the constructor (that's what eg sklearn does to determine what the 'params' of an Estimator are), but personally I am fine with requiring the user to list them explicitly (explicitness also has some value).

@TomAugspurger
Copy link
Contributor Author

give things a perf check; these can often be expensive checks

Where do you have in mind? This PR has no change in behavior anywhere in pandas. All of our ExtensionDtypes, other than IntegerArray already defined a __hash__.

What's the apprehension against a metaclass for ExtensionTypes?

I was mainly thinking unnecessary complexity. In the past, every time I've reached for a metaclass there's been a simpler solution. But as Joris mentions, isinstance(foo, MetaClass) is quite a bit slower than for regular classes inheriting from object.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2018

Where do you have in mind? This PR has no change in behavior anywhere in pandas. All of our ExtensionDtypes, other than IntegerArray already defined a hash.

well, run the asv suite. I have changed things like this in the pass and if its not exactly right it can kill perf. These checks are hit a lot if practically everything.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 5, 2018 via email

@jreback
Copy link
Contributor

jreback commented Oct 6, 2018

But again, the only change in pandas would be to not raise an error when trying to hash IntegerDtype. The rest already override hash.

ok so prob nothing, but can't hurt to check. thanks.

@TomAugspurger
Copy link
Contributor Author

Just some noise it looks like

+        1.89±0ms         2.42±0ms     1.28  rolling.Methods.time_rolling('DataFrame', 10, 'int', 'sum')
+        9.24±0ms         11.8±0ms     1.28  gil.ParallelRolling.time_rolling('skew')
+        3.16±0ms         4.02±0ms     1.27  offset.OffsetDatetimeIndexArithmetic.time_add_offset(<BusinessDay>)
+         663±0μs          824±0μs     1.24  timeseries.DatetimeIndex.time_to_time('dst')
+        56.3±0ms         69.1±0ms     1.23  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'higher')
+        3.76±0ms         4.61±0ms     1.22  stat_ops.FrameOps.time_op('sum', 'float', 0, False)
+        66.4±0ms         81.1±0ms     1.22  stat_ops.FrameOps.time_op('skew', 'float', 1, False)
+        1.74±0ms         2.11±0ms     1.21  index_object.Ops.time_divide('int')
+        3.38±0ms         4.08±0ms     1.21  indexing.Take.time_take('datetime')
+         556±0μs          672±0μs     1.21  series_methods.IsInForObjects.time_isin_nans
+        14.3±0ms         17.2±0ms     1.21  join_merge.Join.time_join_dataframe_index_shuffle_key_bigger_sort(True)
+         739±0μs          892±0μs     1.21  ctors.SeriesConstructors.time_series_constructor(<function SeriesConstructors.<lambda> at 0x11bd06d90>, False)
+        37.8±0ms         45.6±0ms     1.21  frame_ctor.FromDicts.time_list_of_dict
+         485±0μs          585±0μs     1.20  offset.ApplyIndex.time_apply_index(<YearBegin: month=1>)
+         243±0ms          293±0ms     1.20  join_merge.ConcatDataFrames.time_f_ordered(1, False)
+        7.54±0ms         9.02±0ms     1.20  categoricals.Rank.time_rank_int_cat_ordered
+        22.7±0ms         26.9±0ms     1.18  io.sql.ReadSQLTableDtypes.time_read_sql_table_column('datetime')
+         120±0ms          142±0ms     1.18  io.hdf.HDFStoreDataFrame.time_read_store_mixed
+        4.11±0ms         4.86±0ms     1.18  rolling.Methods.time_rolling('DataFrame', 10, 'float', 'kurt')
+         505±0μs          596±0μs     1.18  offset.ApplyIndex.time_apply_index(<QuarterBegin: startingMonth=3>)
+        1.06±0ms         1.25±0ms     1.18  indexing.CategoricalIndexIndexing.time_getitem_bool_array('non_monotonic')
+        8.69±0ms         10.2±0ms     1.18  frame_methods.Equals.time_frame_float_unequal
+         212±0ms          248±0ms     1.17  groupby.GroupByMethods.time_dtype_as_group('int', 'skew', 'transformation')
+        10.1±0ms         11.8±0ms     1.17  frame_methods.Apply.time_apply_lambda_mean
+        4.00±0ms         4.67±0ms     1.17  offset.OnOffset.time_on_offset(<CustomBusinessMonthEnd>)
+         680±0ms          791±0ms     1.16  io.stata.Stata.time_read_stata('tc')
+        4.02±0ms         4.65±0ms     1.16  offset.OffsetSeriesArithmetic.time_add_offset(<SemiMonthEnd: day_of_month=15>)
+        48.7±0ms         56.2±0ms     1.16  strings.Methods.time_center
+        9.23±0ms         10.5±0ms     1.14  frame_methods.ToString.time_to_string_floats
+         167±0ms          188±0ms     1.13  reindex.DropDuplicates.time_frame_drop_dups_int(False)
+         941±0μs         1.06±0ms     1.13  indexing.CategoricalIndexIndexing.time_getitem_list('monotonic_decr')
+         323±0ms          364±0ms     1.13  frame_methods.Duplicated.time_frame_duplicated
+         421±0ms          471±0ms     1.12  frame_methods.Duplicated.time_frame_duplicated_wide
+        7.80±0ms         8.71±0ms     1.12  groupby.Categories.time_groupby_nosort
+        3.68±0ms         4.10±0ms     1.12  sparse.Arithmetic.time_make_union(0.01, 0)
+        3.70±0ms         4.13±0ms     1.12  sparse.ArithmeticBlock.time_division(0)
+         355±0ms          396±0ms     1.11  gil.ParallelGroupbyMethods.time_loop(4, 'prod')
+         543±0ms          605±0ms     1.11  gil.ParallelGroupbyMethods.time_parallel(8, 'sum')
+        16.7±0ms         18.6±0ms     1.11  io.pickle.Pickle.time_write_pickle
+         1.03±0s          1.15±0s     1.11  groupby.GroupByMethods.time_dtype_as_group('float', 'mad', 'direct')
+        3.70±0ms         4.10±0ms     1.11  timeseries.ToDatetimeCache.time_dup_string_dates_and_format(True)
+        2.88±0ms         3.19±0ms     1.11  stat_ops.FrameOps.time_op('prod', 'float', 0, True)
+         585±0μs          647±0μs     1.11  offset.ApplyIndex.time_apply_index(<QuarterEnd: startingMonth=3>)
+        35.3±0ms         39.0±0ms     1.11  algorithms.Duplicated.time_duplicated_int('last')
+         140±0ms          154±0ms     1.10  sparse.Arithmetic.time_add(0.01, nan)
+         374±0ms          412±0ms     1.10  gil.ParallelGroupbyMethods.time_loop(4, 'sum')
+        26.5±0ms         29.2±0ms     1.10  strings.Methods.time_lstrip
+        47.7±0ms         52.4±0ms     1.10  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'higher')

though I'll be sure to check the asv runner tomorrow.

@TomAugspurger
Copy link
Contributor Author

Merging tomorrow if no objections.

@jreback
Copy link
Contributor

jreback commented Oct 8, 2018

if asv is like that then no objections

@TomAugspurger TomAugspurger merged commit f4fe6e3 into pandas-dev:master Oct 8, 2018
@TomAugspurger TomAugspurger deleted the hashable-ea branch October 8, 2018 21:21
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
* API: ExtensionDtype Equality and Hashability

Closes pandas-dev#22476
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants