-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @bashtage |
One of the type checks fails because in various places the value table codes are expected to be floats (dict[float, str]) whereas in fact they are stored as ints (dict[int, str]). I think there are two options to make this work:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on what it here. Overall I think the cyclomatic complexity has gotten too high and I think it would be better to pull these up as two private methods, one for reading 108+ and the other for older formats. Would let you avoid a long if...else...
clause.
pandas/io/stata.py
Outdated
slength = self._path_or_buf.read(4) | ||
if not slength: | ||
break # end of value label table (format < 117) | ||
if self._format_version <= 108: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be == 108 since you ahve >= 108 above and <= 108 means ==108.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be equivalent (and clearer) - I was using the cut-off versions for label name lengths for reference. I'll make the suggested change.
pandas/io/stata.py
Outdated
break # end of value label table (format < 117) | ||
if self._format_version <= 117: | ||
labname = self._decode(self._path_or_buf.read(33)) | ||
if self._format_version >= 108: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is off with the logic here and a few lines above with the other check. This should only be read if < 117 it looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content of if self._format_version >= 108:
are almost identical to the original code (though the change in indentation makes this less obvious in the diff). All that I changed within the if section is to add the 9 character read when version is 108. The new code is mostly in the else statement.
pandas/io/stata.py
Outdated
labname = self._decode(self._path_or_buf.read(33)) | ||
if self._format_version >= 108: | ||
slength = self._path_or_buf.read(4) | ||
if not slength: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be if self._format_version < 117 and not slength:
so that it only checks on 108 <= x < 117
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't looked at the slength bit previously, but it could probably be a bit clearer. The if self._format_version >= 117:
lines should mean that for a well-formed file the loop breaks before if not slength:
becomes true, however for a case where </val
is missing or in the wrong place the while loop would not exit if I added < 117
to the is not slength
condition. Basically, as I understand it, this is allowing the loop to end if an end-of-file is reached. In versions < 117 this is how the end of the value label section is indicated, but in 117 and greater this should be indicted with the string </value_labels>.
As for typechecking, you can probably need to replace |
It seems to be complaining about a mismatch with lines like |
I did wonder about splitting it into functions, but didn't want to change too many lines. I am happy to have a go at implementing this. |
…inator) label names and uses the newer value label layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor chenages - nearly there.
if self._format_version >= 117: | ||
self._path_or_buf.read(6) # </lbl> | ||
|
||
def _read_old_value_labels(self) -> None: | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small docstring (one-liner) here and above indicating the versions that this targets would help readability without having to dig deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - I have now added this.
pandas/io/stata.py
Outdated
# Don't read twice | ||
return | ||
|
||
if self._format_version >= 117: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this logic move to the helper functions? Seems cleaner to me to move the cursor in the code that does the actual reading. I think the other lines about checking if already can stay here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was trying to avoid duplicating calculations for the seek location between the two function, but that does make sense (maybe long term the calculations could be moved to _read_old_header
and stored in self._seek_value_labels
which would then match _read_new_header
?). I have now made the suggested change here too.
pandas/io/stata.py
Outdated
assert self._dtype is not None | ||
offset = self._nobs * self._dtype.itemsize | ||
self._path_or_buf.seek(self._data_location + offset) | ||
|
||
self._value_labels_read = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps move these to after the block that does the read. Make a bit more sense there now that this is short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I have now moved this down.
pandas/io/stata.py
Outdated
@@ -1580,13 +1580,13 @@ def _read_value_labels(self) -> None: | |||
# Don't read twice | |||
return | |||
|
|||
self._value_labels_read = True | |||
self._value_label_dict: dict[str, dict[int, str]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps one last one. Should this be moved to the __init__
? I prefer to declare all attributes there since it avoids late addition of attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now moved this up as suggested. I put it in the # State variables for the file
section as this seemed the closest fit, but can shift it around it you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @cmjcharlton |
…tata dta files (pandas-dev#58155) * ENH: Add support for reading value labels from 108-format and prior Stata dta files * Add type hints for value label dictionary * Apply changes suggested by pylint * Clarify that only the 108 format has both 8 character (plus null terminator) label names and uses the newer value label layout * Split function for reading value labels into newer and older format versions * Remove duplicate line * Update type hints for value label dictionary keys to match read content * Indicate versions each value label helper function applies to via docstrings * Seek to value table location within version specific helper functions * Wait until value labels are read before setting flag * Move value label dictionary initialisation to class __init__
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This change extends support for reading value labels to Stata 108-format (Stata 6) and earlier dta files.