StataReader.variable_labels() does not read variable label correctly for stata datasets saved under Stata 13 using 'save' (but it can read datasets saved using 'saveold') #7816

shafiquejamal · 2014-07-22T02:59:39Z

If I use SataReader to read a Stata dataset saved in Stata 13 using the save command, I can get the data but not the variable labels.

If, however, I use the saveold command in Stata 13, I am able to get the variable labels in Python3 using StataReader.variable_labels().

Can anyone suggest how to accommodate Stata 13? Thanks,

The text was updated successfully, but these errors were encountered:

jreback · 2014-07-22T12:03:00Z

docs are here: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-from-stata-format

something like:

reader = pandas.io.stata.StataReader(file)

# labels
reader.variable_labels()

# data
reader.data(....)

look inside the pandas.io.stata.read_stata (and doc-string of StataReader)

jreback · 2014-07-22T12:09:26Z

closing as a usage question

shafiquejamal · 2014-07-22T14:16:01Z

Hello, I'm sorry if I wasn't clear earlier.

I did use the variables_label() method of reader. But this does NOT work for Stata datasets saved in later versions of Stata (e.g. Stata 13) using the save command. (It DOES work if the dataset was saved in Stata 13 using the saveold command.)

Can you please re-open this issue? It is still not resolved (I am using the latest Pandas master branch). Thanks.

jreback · 2014-07-22T14:17:36Z

ok, so this is a feature/bug request then? ok

jreback · 2014-07-22T14:17:56Z

cc @bashtage

shafiquejamal · 2014-07-22T14:21:36Z

Yes it is a bug/feature request. I guess Stata changed something in how they save data files, which means that the Stata reader needs to be updated to accommodate this change. Many thanks!

bashtage · 2014-07-22T14:22:56Z

@shafiquejamal Would be helpful if you could share a simple example file .dta which produces the problem, as well as a v12 one that works.

This looks like it is implemented in the v13 path - although it probably is buggy

shafiquejamal · 2014-07-22T14:30:19Z

Certainly. I have a couple of .dta files of about 450kb each that I can share (problem dataset HHRosterEducHealth_small_varwithnolabel_notsaveold and non-problem dataset HHRosterEducHealth_small_varwithnolabel_saveold).

I tried dragging them into this comment window, but I'm getting this error at the bottom of this comment window: "Unfortunately, we don't support that file type. Try again with a PNG, GIF, or JPG."

How can I share these .dta files with you? Thanks,

jreback · 2014-07-22T14:32:09Z

@shafiquejamal put them up on a public dropbox / share site. I think you can do it via gist as well.

and post the link here.

shafiquejamal · 2014-07-22T14:44:13Z

Here is the dropbox link:

https://www.dropbox.com/sh/4r0fhspsiwpim5p/AACBaC-lu7TaNPLUQQgU_rt4a

So StataReader can handle the file ending in _saveold.dta (saved using an old Stata dataset format), not the file ending in _notsaveold.dta (saved using the newer Stata dataset format). Thanks.

bashtage · 2014-07-22T16:58:17Z

The bug, unfortunately, seems to be in stata. Stata's dta file definition claims that it gives the offset to the start of this segment as 1 of 14 8 byte values, in . Unfortunately, this value is 0 (0000 0000 0000 0000 in the file) in this file, and is 0 in 1 I just saved from Stata 13.

The code appears to be a correct implementation of Stata's documented file format, so I'm not sure if this should be "fixed" (which would be to hack around Stata's problem).

shafiquejamal · 2014-07-22T17:22:00Z

Thanks for looking into this so quickly. I'll see about contacting folks at Stata to see whether they can fix their documentation, which would then just justify modifying Pandas.

To summarize then: the problem is that the offset (to the start of the segment in the dta file that defines the variable labels) should be 1, according to Stata's documentation (in help dta), but this offset is in fact 0 instead. Correct?

Many thanks,

bashtage · 2014-07-22T17:26:01Z

I have submitted a patch that works around the difference between the docs and the implementation. The required value is technically unnecessary since it can be computed from other values.

… files Stata's implementation does not match the online dta file format description. The solution used here is to directly compute the offset rather than reading it from the dta file. If Stata fixes their implementation, the original code can be restored. closes pandas-dev#7816

shafiquejamal · 2014-07-23T13:36:22Z

Thanks! Its working with my datasets. Cheers,

jreback added Usage Question labels Jul 22, 2014

jreback closed this as completed Jul 22, 2014

jreback reopened this Jul 22, 2014

jreback added the Bug label Jul 22, 2014

jreback added this to the 0.15.1 milestone Jul 22, 2014

jreback removed the Usage Question label Jul 22, 2014

bashtage mentioned this issue Jul 22, 2014

BUG: Fixed failure in StataReader when reading variable labels in 117 #7818

Merged

jreback modified the milestones: 0.15.0, 0.15.1 Jul 22, 2014

jreback closed this as completed in #7818 Jul 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

StataReader.variable_labels() does not read variable label correctly for stata datasets saved under Stata 13 using 'save' (but it can read datasets saved using 'saveold') #7816

StataReader.variable_labels() does not read variable label correctly for stata datasets saved under Stata 13 using 'save' (but it can read datasets saved using 'saveold') #7816

shafiquejamal commented Jul 22, 2014

jreback commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

bashtage commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

bashtage commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

bashtage commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 23, 2014

Uh oh!

Uh oh!

StataReader.variable_labels() does not read variable label correctly for stata datasets saved under Stata 13 using 'save' (but it can read datasets saved using 'saveold') #7816

StataReader.variable_labels() does not read variable label correctly for stata datasets saved under Stata 13 using 'save' (but it can read datasets saved using 'saveold') #7816

Comments

shafiquejamal commented Jul 22, 2014

jreback commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

bashtage commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

jreback commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

bashtage commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 22, 2014

Uh oh!

bashtage commented Jul 22, 2014

Uh oh!

shafiquejamal commented Jul 23, 2014

Uh oh!