Skip to content

Stata read categoricals gives ValueError: Categorical categories must be unique #13923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ozak opened this issue Aug 6, 2016 · 13 comments
Closed
Labels
Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata
Milestone

Comments

@ozak
Copy link

ozak commented Aug 6, 2016

I am trying to open some Stata files generated in IPUMS International, but I am getting a ValueError: Categorical categories must be unique. I opened in Stata and could not find a repeated category for the column I am trying to import. I had similar issues with other datasets from the same source, which seemed to be generated by missing values, but that does not seem to be the case here. Here's the link to the file I am trying to read.

Code Sample, a copy-pastable example if possible

df = pd.read_stata('ipumsi_00014.dta', columns=['ethnicsn'])

Expected Output

df.shape = (1694761,1)

output of pd.show_versions()

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 18.2
Cython: 0.22.1
numpy: 1.11.1
scipy: 0.15.1
statsmodels: 0.6.1
xarray: None
IPython: 3.2.1
sphinx: None
patsy: 0.2.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 0.8.0
tables: None
numexpr: 2.4
matplotlib: 1.4.3
openpyxl: 2.1.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.4
lxml: 3.3.5
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)
jinja2: 2.7.3
boto: 2.34.0
pandas_datareader: None
@ozak
Copy link
Author

ozak commented Aug 6, 2016

Here's the link to a file with a column with only categoricals (non-repeated) and the same issue arises.

@jreback jreback added the IO Stata read_stata, to_stata label Aug 6, 2016
@jreback
Copy link
Contributor

jreback commented Aug 6, 2016

cc @bashtage
cc @kshedden

@bashtage
Copy link
Contributor

bashtage commented Aug 8, 2016

The value labels for the data posted a

{101: 'bainouk',
 102: 'badiaranke',
 ...
 111: 'diola',
 112: 'fulani',
 113: 'wolof',
 114: 'laobe',
 115: 'lebou',
 ...
 128: 'tandanke',
 129: 'toucouleur',
 130: 'wolof',
 131: 'khassonke',
 ...
 398: 'other countries',
 999: 'unknown'}

Note 130 and 113: wolof. This is the problem. I'm not sure what the correct behavior is here since there are two numeric values that make to same name. Categoricals don't understand this since there must be a 1-to-1 mapping between the underying numeric store and the labels.

@bashtage
Copy link
Contributor

bashtage commented Aug 8, 2016

I suppose the error could be trapped and a more meaningful error possibly with a report could be returned.

@ozak
Copy link
Author

ozak commented Aug 8, 2016

I thought it was having trouble due to possibly repeated numbering of the categories. Not sure why it should care if the label values (strings) are repeated, since I imagine the categories should work regardless of the value of the label value, no? Or am I missing something? I agree that a better error message and even a print out of the repeated categories would be very useful.

Thanks for the help!

@ozak
Copy link
Author

ozak commented Aug 8, 2016

Another possibility would be to allow Pandas to read the categories as strings.

@bashtage
Copy link
Contributor

bashtage commented Aug 8, 2016

Stata stores value labeled variables as labels and some number. Your data has 2 values that correspond to the same number. In pandas, a label is as good as its underlying integer data type, and so there is no way for a categorical to have 2 values with the same label. Stata value labels are not equivalent to pandas categoricals, only close. This is a case where the difference matters.

@bashtage
Copy link
Contributor

bashtage commented Aug 8, 2016

BTW, you can use convert_categoricals=False to read the data and return the integer values. You can also use StataReader to access the value labels. From these you can anything you want with the data.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

so this looks like buggy Stata behavior? @bashtage

yeah I would prob raise here (if convert_categoricals is set), let the user figure it out. (I suppose you could just turn categorical conversion off and show a warning).

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate labels Aug 8, 2016
@jreback jreback added this to the Next Major Release milestone Aug 8, 2016
@ozak
Copy link
Author

ozak commented Aug 9, 2016

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

@bashtage
Copy link
Contributor

bashtage commented Aug 9, 2016

It is not buggy Stata behavior so much as just different. In Stata value
labels are just labels and are not actually values. Pandas doesnt have the
concept of a labeled Series, and so there is no way to map between the two
perfectly.

On Tue, Aug 9, 2016, 12:10 AM Jeff Reback [email protected] wrote:

so this looks like buggy Stata behavior? @bashtage
https://github.com/bashtage

yeah I would prob raise here (if convert_categoricals is set), let the
user figure it out. (I suppose you could just turn categorical conversion
off and show a warning).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13923 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFU5RTT5FYLD0Q50948r1gtpbD9sRKPgks5qd7dXgaJpZM4JeU3B
.

@bashtage
Copy link
Contributor

bashtage commented Aug 9, 2016

Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels?

Using your short file:

import pandas as pd
df = pd.read_stata('ipumsi_00014_ethn.dta',convert_categoricals=False)
sr = pd.io.stata.StataReader('ipumsi_00014_ethn.dta')
vl = sr.value_labels()
sr.close()

Then you can do whatever you need to with vl and df.

@ozak
Copy link
Author

ozak commented Aug 9, 2016

@bashtage Cool! Thanks!

@jreback jreback modified the milestones: 0.19.0, Next Major Release Aug 9, 2016
bashtage added a commit to bashtage/pandas that referenced this issue Aug 10, 2016
Improve the error message to be more explicit when attempting to read Stata
files containing repeated categories.

closes pandas-dev#13923
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants