Skip to content

ERR: Categoricals should not allow non-strings when an object dtype is passed #13919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hnykda opened this issue Aug 5, 2016 · 12 comments
Closed
Labels
Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Enhancement Error Reporting Incorrect or improved errors from pandas good first issue IO HDF5 read_hdf, HDFStore

Comments

@hnykda
Copy link

hnykda commented Aug 5, 2016

HDF Store (with table) doesn't support categories with mixed type inside a category

Even though it is possible to store category type using to_hdf with table, you can't do that when you have mixed types inside the category. It would be nice to mention this at least on docs.

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a':[1, 'string']})
df['a'] = df['a'].astype('category')
df.to_hdf('play.hdf', 'main', format='table')

the error:

TypeError: Cannot serialize the column [values] because
its data contents are [mixed-integer] object dtype

Expected Output

No error, saved file.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.6.4-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.1
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
@jreback
Copy link
Contributor

jreback commented Aug 5, 2016

this is in general a completely bad idea. categories are by-definition single dtypes (and not object), except for strings.

These should raise actually on creation.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Categorical Categorical Data Type labels Aug 5, 2016
@jreback jreback added this to the Next Major Release milestone Aug 5, 2016
@jreback
Copy link
Contributor

jreback commented Aug 5, 2016

if you'd like to push a PR for error reporting in catetgorical creation that would be great.

@jreback jreback changed the title HDF Store (with table) doesn't support categories with mixed type inside a category ERR: categories should not all non-strings when an object dtype is passed Aug 5, 2016
@jreback jreback changed the title ERR: categories should not all non-strings when an object dtype is passed ERR: Categoricals should not allow non-strings when an object dtype is passed Aug 5, 2016
@hnykda
Copy link
Author

hnykda commented Aug 5, 2016

Yeah, I agree with you that it should be raised on creation.

I had a thousands of columns loaded from CSV and was converting them automatically to categories based on something like len(column.unique())>3: column.astype('category'). And when there is e.g. np.nan and 'string' in the column, it leads to what I have reported.

Will think about doing that PR.

Thanks Jeff!

@hnykda
Copy link
Author

hnykda commented Aug 7, 2016

Sorry, I can't do it in a near future (say, month). But I am marking it down and maybe come back to it. If anyone wants to help here, it's very welcome.

@wcwagner
Copy link
Contributor

@jreback Can you clarify what you mean by saying "Categoricals should not allow non-strings when an object dtype is passed"

My PR above takes this literally, which breaks many tests, particularly here . That list comprehension passes in arr to Categorical.from_array, which is often of type object, but all the individual values are homogenous

In my PR, should I just check if all the values are of the same type?

Thanks

@hnykda
Copy link
Author

hnykda commented Aug 18, 2016

Hello @wcwagner . Thank you for taking a look on this.

I believe that it was more like the second option - individual values should be homogenous (in the terms of dtypes). So don't allow something like this: 1, '2', 3 (mixing int with string), while '1', '2', '3' (all str/objects) or 1, 2, 3 (all ints) are valid.

If I should implement it extremely naively, I would do something like:

categories_types = [type(x) for x in categories]
if len(categories_types) > 1: 
    raise ValueError('Categories must be all of the same type. They are %s', categories_types)

@jreback
Copy link
Contributor

jreback commented Aug 18, 2016

@hnykda we don't allow for object dtypes an inferred type of non (string, unicode, period (will be removed later, pls add a TODO)).

In [9]: pd.lib.infer_dtype([1, '2', 3.0])
Out[9]: 'mixed-integer'

In [10]: pd.lib.infer_dtype([pd.Timestamp('20130101'), 3.0])
Out[10]: 'mixed'

In [12]: pd.lib.infer_dtype(['foo', 'bar'])
Out[12]: 'string'

In [13]: pd.lib.infer_dtype([u'foo', u'bar'])
Out[13]: 'unicode'

In [16]: pd.lib.infer_dtype([pd.Period('2016','M')])
Out[16]: 'period'

note this should only be done on the categories as these are already coerced as much as possible.

@jorisvandenbossche
Copy link
Member

Repeating what I said in the PR (#14047): personally, I don't think we should check this at Categorical construction, I would rather check for this in the hdf code itself.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2016

I don't think it's ever useful to support mixed dtypes inside a Categorical even if it's technically possible.

@jorisvandenbossche
Copy link
Member

Given the comments on the PR, it's not technically impossible to disallow, but it would make the implementation (which is mixed with MultiIndex) more complex.
But, more fundamentally, I don't see any reason to disallow it, even if we could. I could for example imagine a case where you want to put custom objects (eg your own Interval objects) as categories, which is something we would (currently) identify as mixed.

@jreback
Copy link
Contributor

jreback commented Sep 14, 2016

well, we are not tested at all for mixed type categoricals. I think its pretty reasonable to disallow them; makes them easier to deal with, more meaningful and pure.

@jbrockmendel
Copy link
Member

im with @jorisvandenbossche on this one. pd.Categorical can accept pretty much anything that pd.Index can accept.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Enhancement Error Reporting Incorrect or improved errors from pandas good first issue IO HDF5 read_hdf, HDFStore
Projects
None yet
7 participants