You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.
This PR makes the hashing dependent on the values the categorical
represents, rather than on the codes. The categories are first hashed,
and then the codes are remapped to the hashed values. This is slightly
slower than before (still need to hash the categories, where we didn't
before), but allows for more consistent hashing.
Copy file name to clipboardExpand all lines: doc/source/whatsnew/v0.20.0.txt
+2-1
Original file line number
Diff line number
Diff line change
@@ -309,6 +309,7 @@ Bug Fixes
309
309
- Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`)
310
310
- Bug in ``pd.read_fwf`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`)
311
311
- Bug in ``pd.read_csv()`` in which missing data was being improperly handled with ``usecols`` (:issue:`6710`)
312
+
- Bug in ``pandas.tools.hashing.hash_pandas_object`` in which hashing of categoricals depended on the ordering of categories, instead of just their values.
312
313
313
314
- Bug in ``DataFrame.loc`` with indexing a ``MultiIndex`` with a ``Series`` indexer (:issue:`14730`)
314
315
@@ -369,4 +370,4 @@ Bug Fixes
369
370
- Bug in ``Series`` constructor when both ``copy=True`` and ``dtype`` arguments are provided (:issue:`15125`)
370
371
- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
371
372
372
-
- Bug in ``Series.dt.round`` inconsistent behaviour on NAT's with different arguments (:issue:`14940`)
373
+
- Bug in ``Series.dt.round`` inconsistent behaviour on NAT's with different arguments (:issue:`14940`)
0 commit comments