Skip to content

Pandas 1.0 no longer handles numpy.str_s as catgories #31499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flying-sheep opened this issue Jan 31, 2020 · 5 comments · Fixed by #31528
Closed

Pandas 1.0 no longer handles numpy.str_s as catgories #31499

flying-sheep opened this issue Jan 31, 2020 · 5 comments · Fixed by #31528
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@flying-sheep
Copy link
Contributor

flying-sheep commented Jan 31, 2020

Code Sample

import pandas as pd
pd.Categorical(['1', '0', '1'], [np.str_('0'), np.str_('1')])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/angerer/Dev/Python/venvs/env-pandas-1/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 385, in __init__
    codes = _get_codes_for_values(values, dtype.categories)
  File "/home/angerer/Dev/Python/venvs/env-pandas-1/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 2576, in _get_codes_for_values
    t.map_locations(cats)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1403, in pandas._libs.hashtable.StringHashTable.map_locations
TypeError: Expected unicode, got numpy.str_

Problem description

I know that having a list of numpy.str_s seems weird, but it easily happens when you use non-numpy algorithms on numpy arrays (e.g. natsort.natsorted in our case), or via comprehensions or so:

>>> np.array(['1', '0'])[0].__class__
<class 'numpy.str_'>
>>> [type(s) for s in np.array(['1', '0'])]
[<class 'numpy.str_'>, <class 'numpy.str_'>]

Expected Output

A normal pd.Categorical

Pandas version

pandas 1.0

@TomAugspurger
Copy link
Contributor

This changed from 0.25.3?

Are you able to pin down what change caused it?

@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels Jan 31, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.1 milestone Jan 31, 2020
@jorisvandenbossche
Copy link
Member

Yes, in 0.25.3 this worked.

At least, it gives a categorical with object categories with those numpy strings (but with Series constructor, we also preserve the numpy strings, and don't convert to python strings, so that seems the "expected" behaviour).

@jorisvandenbossche
Copy link
Member

My guess is that it's related to #30419 which changed get_c_string implementation (which is used in StringHashTable to get the c string from the string object) cc @jbrockmendel

@jbrockmendel
Copy link
Member

maybe if we're lucky it will be good enough to change L706 in hashtable_class_helper.pxi.in from v = get_c_string(val) to v = get_c_string(<str>val), but this is really a PITA because the previous line is precisely a check for isintance(val, str) which is True for np.str_ objects

@flying-sheep
Copy link
Contributor Author

Yeah, it’s kinda shitty. I think implicit conversation would be better than a hard-to-interpret error here though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants