ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155

cmjcharlton · 2024-04-05T13:23:31Z

closes ENH: Support reading value labels for Stata formats 108 (Stata 6) and earlier #58154
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This change extends support for reading value labels to Stata 108-format (Stata 6) and earlier dta files.

…tata dta files

jbrockmendel · 2024-04-05T15:58:42Z

cmjcharlton · 2024-04-05T16:07:30Z

One of the type checks fails because in various places the value table codes are expected to be floats (dict[float, str]) whereas in fact they are stored as ints (dict[int, str]). I think there are two options to make this work:

Cast the codes to float when read and update the newly added type hints to match elsewhere, but this would break anything that relies on the existing behaviour
Change the existing type hints to expect int keys, but this might not match the intent of the original author

bashtage

Some comments on what it here. Overall I think the cyclomatic complexity has gotten too high and I think it would be better to pull these up as two private methods, one for reading 108+ and the other for older formats. Would let you avoid a long if...else... clause.

bashtage · 2024-04-05T16:32:53Z

pandas/io/stata.py

+                slength = self._path_or_buf.read(4)
+                if not slength:
+                    break  # end of value label table (format < 117)
+                if self._format_version <= 108:


This should be == 108 since you ahve >= 108 above and <= 108 means ==108.

Yes, this should be equivalent (and clearer) - I was using the cut-off versions for label name lengths for reference. I'll make the suggested change.

bashtage · 2024-04-05T16:40:27Z

pandas/io/stata.py

-                break  # end of value label table (format < 117)
-            if self._format_version <= 117:
-                labname = self._decode(self._path_or_buf.read(33))
+            if self._format_version >= 108:


Something is off with the logic here and a few lines above with the other check. This should only be read if < 117 it looks like.

The content of if self._format_version >= 108: are almost identical to the original code (though the change in indentation makes this less obvious in the diff). All that I changed within the if section is to add the 9 character read when version is 108. The new code is mostly in the else statement.

bashtage · 2024-04-05T16:41:58Z

pandas/io/stata.py

-                labname = self._decode(self._path_or_buf.read(33))
+            if self._format_version >= 108:
+                slength = self._path_or_buf.read(4)
+                if not slength:


Perhaps this should be if self._format_version < 117 and not slength: so that it only checks on 108 <= x < 117

I hadn't looked at the slength bit previously, but it could probably be a bit clearer. The if self._format_version >= 117: lines should mean that for a well-formed file the loop breaks before if not slength: becomes true, however for a case where </val is missing or in the wrong place the while loop would not exit if I added < 117 to the is not slength condition. Basically, as I understand it, this is allowing the loop to end if an end-of-file is reached. In versions < 117 this is how the end of the value label section is indicated, but in 117 and greater this should be indicted with the string </value_labels>.

bashtage · 2024-04-05T16:49:40Z

As for typechecking, you can probably need to replace self._value_label_dict[labname][val[i]] with self._value_label_dict[labname][int(val[i])] since it will be hard to convince the type checker that val is a 1-d array of int which would be needed to let it know that val[i] was an int.

cmjcharlton · 2024-04-05T17:41:12Z

As for typechecking, you can probably need to replace self._value_label_dict[labname][val[i]] with self._value_label_dict[labname][int(val[i])] since it will be hard to convince the type checker that val is a 1-d array of int which would be needed to let it know that val[i] was an int.

It seems to be complaining about a mismatch with lines like def value_labels(self) -> dict[str, dict[float, str]]: elsewhere in the file. I think I could make it happy by changing the type hints I added to match this, but then this wouldn't match the actual content of the data.

cmjcharlton · 2024-04-05T17:43:27Z

Some comments on what it here. Overall I think the cyclomatic complexity has gotten too high and I think it would be better to pull these up as two private methods, one for reading 108+ and the other for older formats. Would let you avoid a long if...else... clause.

I did wonder about splitting it into functions, but didn't want to change too many lines. I am happy to have a go at implementing this.

…inator) label names and uses the newer value label layout

…ersions

bashtage

A few minor chenages - nearly there.

bashtage · 2024-04-09T10:03:21Z

pandas/io/stata.py

            if self._format_version >= 117:
                self._path_or_buf.read(6)  # </lbl>
+
+    def _read_old_value_labels(self) -> None:
+        while True:


Small docstring (one-liner) here and above indicating the versions that this targets would help readability without having to dig deeper.

Good idea - I have now added this.

bashtage · 2024-04-09T10:04:51Z

pandas/io/stata.py

+            # Don't read twice
+            return
+
+        if self._format_version >= 117:


Should this logic move to the helper functions? Seems cleaner to me to move the cursor in the code that does the actual reading. I think the other lines about checking if already can stay here.

I think I was trying to avoid duplicating calculations for the seek location between the two function, but that does make sense (maybe long term the calculations could be moved to _read_old_header and stored in self._seek_value_labels which would then match _read_new_header?). I have now made the suggested change here too.

bashtage · 2024-04-09T10:05:24Z

pandas/io/stata.py

+            assert self._dtype is not None
+            offset = self._nobs * self._dtype.itemsize
+            self._path_or_buf.seek(self._data_location + offset)
+
        self._value_labels_read = True


Perhaps move these to after the block that does the read. Make a bit more sense there now that this is short.

I agree, I have now moved this down.

…strings

bashtage · 2024-04-09T11:20:34Z

pandas/io/stata.py

@@ -1580,13 +1580,13 @@ def _read_value_labels(self) -> None:
            # Don't read twice
            return

-        self._value_labels_read = True
        self._value_label_dict: dict[str, dict[int, str]] = {}


Perhaps one last one. Should this be moved to the __init__? I prefer to declare all attributes there since it avoids late addition of attributes.

I have now moved this up as suggested. I put it in the # State variables for the file section as this seemed the closest fit, but can shift it around it you like.

bashtage

LGTM

mroeschke · 2024-04-09T16:55:46Z

Thanks @cmjcharlton

…tata dta files (pandas-dev#58155) * ENH: Add support for reading value labels from 108-format and prior Stata dta files * Add type hints for value label dictionary * Apply changes suggested by pylint * Clarify that only the 108 format has both 8 character (plus null terminator) label names and uses the newer value label layout * Split function for reading value labels into newer and older format versions * Remove duplicate line * Update type hints for value label dictionary keys to match read content * Indicate versions each value label helper function applies to via docstrings * Seek to value table location within version specific helper functions * Wait until value labels are read before setting flag * Move value label dictionary initialisation to class __init__

cmjcharlton added 3 commits April 5, 2024 14:11

ENH: Add support for reading value labels from 108-format and prior S…

41d1f22

…tata dta files

Add type hints for value label dictionary

3feed75

Apply changes suggested by pylint

be8aac5

bashtage requested changes Apr 5, 2024

View reviewed changes

mroeschke added the IO Stata read_stata, to_stata label Apr 5, 2024

cmjcharlton added 6 commits April 5, 2024 19:11

Clarify that only the 108 format has both 8 character (plus null term…

dd14736

…inator) label names and uses the newer value label layout

Split function for reading value labels into newer and older format v…

c2836bf

…ersions

Merge remote-tracking branch 'upstream/main' into stata-old-valuelabels

3f2acb3

Remove duplicate line

2310022

Update type hints for value label dictionary keys to match read content

bf8620c

Merge remote-tracking branch 'upstream/main' into stata-old-valuelabels

36acb33

cmjcharlton requested a review from bashtage April 9, 2024 09:43

bashtage requested changes Apr 9, 2024

View reviewed changes

cmjcharlton added 3 commits April 9, 2024 11:31

Indicate versions each value label helper function applies to via doc…

af7d5e4

…strings

Seek to value table location within version specific helper functions

b0dc320

Wait until value labels are read before setting flag

792d10c

bashtage reviewed Apr 9, 2024

View reviewed changes

Move value label dictionary initialisation to class __init__

445fbaf

bashtage approved these changes Apr 9, 2024

View reviewed changes

mroeschke approved these changes Apr 9, 2024

View reviewed changes

mroeschke added this to the 3.0 milestone Apr 9, 2024

mroeschke merged commit 583026b into pandas-dev:main Apr 9, 2024
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155

ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155

cmjcharlton commented Apr 5, 2024

jbrockmendel commented Apr 5, 2024

cmjcharlton commented Apr 5, 2024

bashtage left a comment

bashtage Apr 5, 2024

cmjcharlton Apr 5, 2024

bashtage Apr 5, 2024

cmjcharlton Apr 5, 2024 •

edited

Loading

bashtage Apr 5, 2024

cmjcharlton Apr 5, 2024

bashtage commented Apr 5, 2024

cmjcharlton commented Apr 5, 2024

cmjcharlton commented Apr 5, 2024

bashtage left a comment

bashtage Apr 9, 2024

cmjcharlton Apr 9, 2024

bashtage Apr 9, 2024

cmjcharlton Apr 9, 2024

bashtage Apr 9, 2024

cmjcharlton Apr 9, 2024

bashtage Apr 9, 2024

cmjcharlton Apr 9, 2024

bashtage left a comment

mroeschke commented Apr 9, 2024

ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155

ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155

Conversation

cmjcharlton commented Apr 5, 2024

jbrockmendel commented Apr 5, 2024

cmjcharlton commented Apr 5, 2024

bashtage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmjcharlton Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashtage commented Apr 5, 2024

cmjcharlton commented Apr 5, 2024

cmjcharlton commented Apr 5, 2024

bashtage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashtage left a comment

Choose a reason for hiding this comment

mroeschke commented Apr 9, 2024

cmjcharlton Apr 5, 2024 •

edited

Loading