-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Stata read categoricals gives ValueError: Categorical categories must be unique #13923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here's the link to a file with a column with only categoricals (non-repeated) and the same issue arises. |
The value labels for the data posted a
Note 130 and 113: |
I suppose the error could be trapped and a more meaningful error possibly with a report could be returned. |
I thought it was having trouble due to possibly repeated numbering of the categories. Not sure why it should care if the label values (strings) are repeated, since I imagine the categories should work regardless of the value of the label value, no? Or am I missing something? I agree that a better error message and even a print out of the repeated categories would be very useful. Thanks for the help! |
Another possibility would be to allow Pandas to read the categories as strings. |
Stata stores value labeled variables as labels and some number. Your data has 2 values that correspond to the same number. In pandas, a label is as good as its underlying integer data type, and so there is no way for a categorical to have 2 values with the same label. Stata value labels are not equivalent to pandas categoricals, only close. This is a case where the difference matters. |
BTW, you can use |
so this looks like buggy Stata behavior? @bashtage yeah I would prob raise here (if convert_categoricals is set), let the user figure it out. (I suppose you could just turn categorical conversion off and show a warning). |
Thanks for the suggestions @bashtage ...do you have an example on how one would access those labels? |
It is not buggy Stata behavior so much as just different. In Stata value On Tue, Aug 9, 2016, 12:10 AM Jeff Reback [email protected] wrote:
|
Using your short file: import pandas as pd
df = pd.read_stata('ipumsi_00014_ethn.dta',convert_categoricals=False)
sr = pd.io.stata.StataReader('ipumsi_00014_ethn.dta')
vl = sr.value_labels()
sr.close() Then you can do whatever you need to with |
@bashtage Cool! Thanks! |
Improve the error message to be more explicit when attempting to read Stata files containing repeated categories. closes pandas-dev#13923
I am trying to open some Stata files generated in IPUMS International, but I am getting a
ValueError: Categorical categories must be unique
. I opened in Stata and could not find a repeated category for the column I am trying to import. I had similar issues with other datasets from the same source, which seemed to be generated by missing values, but that does not seem to be the case here. Here's the link to the file I am trying to read.Code Sample, a copy-pastable example if possible
Expected Output
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: