You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings. This PR makes the hashing dependent on the values the
categorical represents, rather than on the codes. The categories are
first hashed, and then the codes are remapped to the hashed values.
This is slightly slower than before (still need to hash the
categories, where we didn't before), but allows for more consistent
hashing. Related to this work in dask:
dask/dask#1877.
Author: Jim Crist <[email protected]>
Closespandas-dev#15143 from jcrist/categories_hash_consistently and squashes the following commits:
f1aea13 [Jim Crist] Address comments
7878c55 [Jim Crist] Categoricals hash consistently
Copy file name to clipboardExpand all lines: doc/source/whatsnew/v0.20.0.txt
+1
Original file line number
Diff line number
Diff line change
@@ -326,6 +326,7 @@ Bug Fixes
326
326
- Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`)
327
327
- Bug in ``pd.read_fwf`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`)
328
328
- Bug in ``pd.read_csv()`` in which missing data was being improperly handled with ``usecols`` (:issue:`6710`)
329
+
- Bug in ``pd.tools.hashing.hash_pandas_object()`` in which hashing of categoricals depended on the ordering of categories, instead of just their values. (:issue:`15143`)
329
330
330
331
- Bug in ``DataFrame.loc`` with indexing a ``MultiIndex`` with a ``Series`` indexer (:issue:`14730`)
0 commit comments