Skip to content

ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 9, 2024

Conversation

cmjcharlton
Copy link
Contributor

This change extends support for reading value labels to Stata 108-format (Stata 6) and earlier dta files.

@jbrockmendel
Copy link
Member

cc @bashtage

@cmjcharlton
Copy link
Contributor Author

One of the type checks fails because in various places the value table codes are expected to be floats (dict[float, str]) whereas in fact they are stored as ints (dict[int, str]). I think there are two options to make this work:

  1. Cast the codes to float when read and update the newly added type hints to match elsewhere, but this would break anything that relies on the existing behaviour
  2. Change the existing type hints to expect int keys, but this might not match the intent of the original author

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on what it here. Overall I think the cyclomatic complexity has gotten too high and I think it would be better to pull these up as two private methods, one for reading 108+ and the other for older formats. Would let you avoid a long if...else... clause.

slength = self._path_or_buf.read(4)
if not slength:
break # end of value label table (format < 117)
if self._format_version <= 108:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be == 108 since you ahve >= 108 above and <= 108 means ==108.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be equivalent (and clearer) - I was using the cut-off versions for label name lengths for reference. I'll make the suggested change.

break # end of value label table (format < 117)
if self._format_version <= 117:
labname = self._decode(self._path_or_buf.read(33))
if self._format_version >= 108:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off with the logic here and a few lines above with the other check. This should only be read if < 117 it looks like.

Copy link
Contributor Author

@cmjcharlton cmjcharlton Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of if self._format_version >= 108: are almost identical to the original code (though the change in indentation makes this less obvious in the diff). All that I changed within the if section is to add the 9 character read when version is 108. The new code is mostly in the else statement.

labname = self._decode(self._path_or_buf.read(33))
if self._format_version >= 108:
slength = self._path_or_buf.read(4)
if not slength:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be if self._format_version < 117 and not slength: so that it only checks on 108 <= x < 117

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't looked at the slength bit previously, but it could probably be a bit clearer. The if self._format_version >= 117: lines should mean that for a well-formed file the loop breaks before if not slength: becomes true, however for a case where </val is missing or in the wrong place the while loop would not exit if I added < 117 to the is not slength condition. Basically, as I understand it, this is allowing the loop to end if an end-of-file is reached. In versions < 117 this is how the end of the value label section is indicated, but in 117 and greater this should be indicted with the string </value_labels>.

@bashtage
Copy link
Contributor

bashtage commented Apr 5, 2024

As for typechecking, you can probably need to replace self._value_label_dict[labname][val[i]] with self._value_label_dict[labname][int(val[i])] since it will be hard to convince the type checker that val is a 1-d array of int which would be needed to let it know that val[i] was an int.

@mroeschke mroeschke added the IO Stata read_stata, to_stata label Apr 5, 2024
@cmjcharlton
Copy link
Contributor Author

As for typechecking, you can probably need to replace self._value_label_dict[labname][val[i]] with self._value_label_dict[labname][int(val[i])] since it will be hard to convince the type checker that val is a 1-d array of int which would be needed to let it know that val[i] was an int.

It seems to be complaining about a mismatch with lines like def value_labels(self) -> dict[str, dict[float, str]]: elsewhere in the file. I think I could make it happy by changing the type hints I added to match this, but then this wouldn't match the actual content of the data.

@cmjcharlton
Copy link
Contributor Author

Some comments on what it here. Overall I think the cyclomatic complexity has gotten too high and I think it would be better to pull these up as two private methods, one for reading 108+ and the other for older formats. Would let you avoid a long if...else... clause.

I did wonder about splitting it into functions, but didn't want to change too many lines. I am happy to have a go at implementing this.

@cmjcharlton cmjcharlton requested a review from bashtage April 9, 2024 09:43
Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor chenages - nearly there.

if self._format_version >= 117:
self._path_or_buf.read(6) # </lbl>

def _read_old_value_labels(self) -> None:
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small docstring (one-liner) here and above indicating the versions that this targets would help readability without having to dig deeper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - I have now added this.

# Don't read twice
return

if self._format_version >= 117:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this logic move to the helper functions? Seems cleaner to me to move the cursor in the code that does the actual reading. I think the other lines about checking if already can stay here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to avoid duplicating calculations for the seek location between the two function, but that does make sense (maybe long term the calculations could be moved to _read_old_header and stored in self._seek_value_labels which would then match _read_new_header?). I have now made the suggested change here too.

assert self._dtype is not None
offset = self._nobs * self._dtype.itemsize
self._path_or_buf.seek(self._data_location + offset)

self._value_labels_read = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move these to after the block that does the read. Make a bit more sense there now that this is short.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I have now moved this down.

@@ -1580,13 +1580,13 @@ def _read_value_labels(self) -> None:
# Don't read twice
return

self._value_labels_read = True
self._value_label_dict: dict[str, dict[int, str]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps one last one. Should this be moved to the __init__? I prefer to declare all attributes there since it avoids late addition of attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now moved this up as suggested. I put it in the # State variables for the file section as this seemed the closest fit, but can shift it around it you like.

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mroeschke mroeschke added this to the 3.0 milestone Apr 9, 2024
@mroeschke mroeschke merged commit 583026b into pandas-dev:main Apr 9, 2024
46 checks passed
@mroeschke
Copy link
Member

Thanks @cmjcharlton

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…tata dta files (pandas-dev#58155)

* ENH: Add support for reading value labels from 108-format and prior Stata dta files

* Add type hints for value label dictionary

* Apply changes suggested by pylint

* Clarify that only the 108 format has both 8 character (plus null terminator) label names and uses the newer value label layout

* Split function for reading value labels into newer and older format versions

* Remove duplicate line

* Update type hints for value label dictionary keys to match read content

* Indicate versions each value label helper function applies to via docstrings

* Seek to value table location within version specific helper functions

* Wait until value labels are read before setting flag

* Move value label dictionary initialisation to class __init__
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support reading value labels for Stata formats 108 (Stata 6) and earlier
4 participants