Skip to content

StataReader.variable_labels() does not read variable label correctly for stata datasets saved under Stata 13 using 'save' (but it can read datasets saved using 'saveold') #7816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shafiquejamal opened this issue Jul 22, 2014 · 14 comments · Fixed by #7818
Labels
Bug IO Stata read_stata, to_stata
Milestone

Comments

@shafiquejamal
Copy link

If I use SataReader to read a Stata dataset saved in Stata 13 using the save command, I can get the data but not the variable labels.

If, however, I use the saveold command in Stata 13, I am able to get the variable labels in Python3 using StataReader.variable_labels().

Can anyone suggest how to accommodate Stata 13? Thanks,

@jreback
Copy link
Contributor

jreback commented Jul 22, 2014

docs are here: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-from-stata-format

something like:

reader = pandas.io.stata.StataReader(file)

# labels
reader.variable_labels()

# data
reader.data(....)

look inside the pandas.io.stata.read_stata (and doc-string of StataReader)

@jreback
Copy link
Contributor

jreback commented Jul 22, 2014

closing as a usage question

@jreback jreback closed this as completed Jul 22, 2014
@shafiquejamal
Copy link
Author

Hello, I'm sorry if I wasn't clear earlier.

I did use the variables_label() method of reader. But this does NOT work for Stata datasets saved in later versions of Stata (e.g. Stata 13) using the save command. (It DOES work if the dataset was saved in Stata 13 using the saveold command.)

Can you please re-open this issue? It is still not resolved (I am using the latest Pandas master branch). Thanks.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2014

ok, so this is a feature/bug request then? ok

@jreback jreback reopened this Jul 22, 2014
@jreback jreback added the Bug label Jul 22, 2014
@jreback jreback added this to the 0.15.1 milestone Jul 22, 2014
@jreback
Copy link
Contributor

jreback commented Jul 22, 2014

cc @bashtage

@shafiquejamal
Copy link
Author

Yes it is a bug/feature request. I guess Stata changed something in how they save data files, which means that the Stata reader needs to be updated to accommodate this change. Many thanks!

@bashtage
Copy link
Contributor

@shafiquejamal Would be helpful if you could share a simple example file .dta which produces the problem, as well as a v12 one that works.

This looks like it is implemented in the v13 path - although it probably is buggy

@shafiquejamal
Copy link
Author

Certainly. I have a couple of .dta files of about 450kb each that I can share (problem dataset HHRosterEducHealth_small_varwithnolabel_notsaveold and non-problem dataset HHRosterEducHealth_small_varwithnolabel_saveold).

I tried dragging them into this comment window, but I'm getting this error at the bottom of this comment window: "Unfortunately, we don't support that file type. Try again with a PNG, GIF, or JPG."

How can I share these .dta files with you? Thanks,

@jreback
Copy link
Contributor

jreback commented Jul 22, 2014

@shafiquejamal put them up on a public dropbox / share site. I think you can do it via gist as well.

and post the link here.

@shafiquejamal
Copy link
Author

Here is the dropbox link:

https://www.dropbox.com/sh/4r0fhspsiwpim5p/AACBaC-lu7TaNPLUQQgU_rt4a

So StataReader can handle the file ending in _saveold.dta (saved using an old Stata dataset format), not the file ending in _notsaveold.dta (saved using the newer Stata dataset format). Thanks.

@bashtage
Copy link
Contributor

The bug, unfortunately, seems to be in stata. Stata's dta file definition claims that it gives the offset to the start of this segment as 1 of 14 8 byte values, in . Unfortunately, this value is 0 (0000 0000 0000 0000 in the file) in this file, and is 0 in 1 I just saved from Stata 13.

The code appears to be a correct implementation of Stata's documented file format, so I'm not sure if this should be "fixed" (which would be to hack around Stata's problem).

@shafiquejamal
Copy link
Author

Thanks for looking into this so quickly. I'll see about contacting folks at Stata to see whether they can fix their documentation, which would then just justify modifying Pandas.

To summarize then: the problem is that the offset (to the start of the segment in the dta file that defines the variable labels) should be 1, according to Stata's documentation (in help dta), but this offset is in fact 0 instead. Correct?

Many thanks,

@bashtage
Copy link
Contributor

I have submitted a patch that works around the difference between the docs and the implementation. The required value is technically unnecessary since it can be computed from other values.

@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 22, 2014
bashtage added a commit to bashtage/pandas that referenced this issue Jul 23, 2014
… files

Stata's implementation does not match the online dta file format description.
The solution used here is to directly compute the offset rather than reading
it from the dta file.  If Stata fixes their implementation, the original code
can be restored.
closes pandas-dev#7816
@shafiquejamal
Copy link
Author

Thanks! Its working with my datasets. Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants