Skip to content

Commit e8ad6ad

Browse files
committed
Hacky workaround hashing mixed types
1 parent e966659 commit e8ad6ad

File tree

5 files changed

+42
-9
lines changed

5 files changed

+42
-9
lines changed

doc/source/advanced.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -654,7 +654,7 @@ setting the index of a ``DataFrame/Series`` with a ``category`` dtype would conv
654654
655655
df = pd.DataFrame({'A': np.arange(6),
656656
'B': list('aabbca')})
657-
df['B'] = df['B'].astype('category', categories=list('cab'))
657+
df['B'] = df['B'].astype(pd.CategoricalDtype(list('cab')))
658658
df
659659
df.dtypes
660660
df.B.cat.categories

doc/source/categorical.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -239,7 +239,7 @@ It's also possible to pass in the categories in a specific order:
239239

240240
.. ipython:: python
241241
242-
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
242+
s = pd.Series(list('babc')).astype(pd.CategoricalDtype(list('abcd')))
243243
s
244244
245245
# categories
@@ -356,7 +356,7 @@ meaning and certain operations are possible. If the categorical is unordered, ``
356356
357357
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
358358
s.sort_values(inplace=True)
359-
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
359+
s = pd.Series(["a","b","c","a"]).astype(pd.CategoricalDtype(ordered=True))
360360
s.sort_values(inplace=True)
361361
s
362362
s.min(), s.max()
@@ -456,9 +456,9 @@ categories or a categorical with any list-like object, will raise a TypeError.
456456

457457
.. ipython:: python
458458
459-
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
460-
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
461-
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
459+
cat = pd.Series([1,2,3]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
460+
cat_base = pd.Series([2,2,2]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
461+
cat_base2 = pd.Series([2,2,2]).astype(pd.CategoricalDtype(ordered=True))
462462
463463
cat
464464
cat_base

doc/source/merging.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -831,7 +831,7 @@ The left frame.
831831
.. ipython:: python
832832
833833
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
834-
X = X.astype('category', categories=['foo', 'bar'])
834+
X = X.astype(pd.CategoricalDtype(categories=['foo', 'bar']))
835835
836836
left = pd.DataFrame({'X': X,
837837
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +842,9 @@ The right frame.
842842

843843
.. ipython:: python
844844
845-
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
846-
'Z': [1, 2]})
845+
right = pd.DataFrame(
846+
{'X': pd.Series(['foo', 'bar'], dtype=pd.CategoricalDtype(['foo', 'bar'])),
847+
'Z': [1, 2]})
847848
right
848849
right.dtypes
849850

doc/source/whatsnew/v0.21.0.txt

+22
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
2222
New features
2323
~~~~~~~~~~~~
2424

25+
- New user-facing :class:`CategoricalDtype` for specifying categorical independent
26+
of the data (:issue:`14711`, :issue:`15078`)
2527
- Support for `PEP 519 -- Adding a file system path protocol
2628
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
2729
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
@@ -106,6 +108,26 @@ This does not permit that column to be accessed as an attribute:
106108

107109
Both of these now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
108110

111+
.. _whatsnew_0210.enhancements.categorical_dtype:
112+
113+
``CategoricalDtype`` for specifying categoricals
114+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
115+
116+
:class:`CategoricalDtype` has been added to the public API and expanded to
117+
include the ``categories`` and ``ordered`` attributes. A ``CategoricalDtype``
118+
can be used to specify the set of categories and orderedness of an array,
119+
independent of the data themselves. This can be useful, e.g., when converting
120+
string data to a ``Categorical``:
121+
122+
.. ipython:: python
123+
124+
s = pd.Series(['a', 'b', 'c', 'a']) # strings
125+
dtype = pd.CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
126+
s.astype(dtype)
127+
128+
The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
129+
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
130+
109131
.. _whatsnew_0210.enhancements.other:
110132

111133
Other Enhancements

pandas/core/dtypes/dtypes.py

+10
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,16 @@ def _hash_categories(categories, ordered=True):
241241
categories = list(categories) # breaks if a np.array of categories
242242
cat_array = hash_tuples(categories)
243243
else:
244+
if categories.dtype == 'O':
245+
types = [type(x) for x in categories]
246+
if not len(set(types)) == 1:
247+
# TODO: hash_array doesn't handle mixed types. It casts
248+
# everything to a str first, which means we treat
249+
# {'1', '2'} the same as {'1', 2}
250+
# find a better solution
251+
cat_array = np.array([hash(x) for x in categories])
252+
hashed = hash((tuple(categories), ordered))
253+
return hashed
244254
cat_array = hash_array(np.asarray(categories), categorize=False)
245255
if ordered:
246256
cat_array = np.vstack([

0 commit comments

Comments
 (0)