-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Slightly related, this is an issue on with datetime / datetimetz In [86]: ns = np.array([946706400000000000, 946792800000000000, 946879200000000000,
...: 946965600000000000])
...:
In [87]: a = pd.Series(pd.DatetimeIndex(ns))
In [88]: b = pd.Series(pd.DatetimeIndex(ns, tz='UTC'))
In [89]: pd.util.testing.assert_series_equal(pd.util.hash_pandas_object(a), pd.util.hash_pandas_object(b)) |
Hmm, maybe we just don't care about different dtypes having the same hashed values (which is perfectly fair) In [92]: a = pd.Series(pd.Categorical([0, 0, 1, 2]))
In [93]: b = pd.Series([0, 0, 1, 2])
In [94]: pd.util.hash_pandas_object(a)
Out[94]:
0 3713087409444908179
1 7478705303072568462
2 3975671353655200382
3 3563156779521628949
dtype: uint64
In [95]: pd.util.hash_pandas_object(b)
Out[95]:
0 3713087409444908179
1 7478705303072568462
2 3975671353655200382
3 3563156779521628949
dtype: uint64 |
iirc ther is an issue about the datetime hashing from a while ago |
Looks like #16372 So we have two issues
Let's focus on the first issue here. I think EAs need some kind of way to say what values are hashed. Performance seems too critical to just
Right now I'm leaning toward 2. |
Also, What would be the reason not to use |
re-using _values_for_factorize should be fine. |
It looks like |
this is exposed as an api for other libraries (dask) |
Options
_hash_values
to the interface. But, how do we prevent hash collisions between similar, but different EAs? For example, the fastest hash for aPeriodArray
would be to just hash the ordinals. But we wouldn't want the following two to hash identically (using my PeriodArray branch)So we need to mix the dtype information in too.
The text was updated successfully, but these errors were encountered: