Skip to content

ENH: Add support for reading value labels from 108-format and prior Stata dta files #58155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 9, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Other enhancements
- :meth:`Styler.set_tooltips` provides alternative method to storing tooltips by using title attribute of td elements. (:issue:`56981`)
- Allow dictionaries to be passed to :meth:`pandas.Series.str.replace` via ``pat`` parameter (:issue:`51748`)
- Support passing a :class:`Series` input to :func:`json_normalize` that retains the :class:`Series` :class:`Index` (:issue:`51452`)
- Support reading value labels from Stata 108-format (Stata 6) and earlier files (:issue:`58154`)
- Users can globally disable any ``PerformanceWarning`` by setting the option ``mode.performance_warnings`` to ``False`` (:issue:`56920`)
- :meth:`Styler.format_index_names` can now be used to format the index and column names (:issue:`48936` and :issue:`47489`)
-
Expand Down
83 changes: 50 additions & 33 deletions pandas/io/stata.py
Original file line number Diff line number Diff line change
Expand Up @@ -1507,11 +1507,6 @@ def _read_value_labels(self) -> None:
if self._value_labels_read:
# Don't read twice
return
if self._format_version <= 108:
# Value labels are not supported in version 108 and earlier.
self._value_labels_read = True
self._value_label_dict: dict[str, dict[float, str]] = {}
return

if self._format_version >= 117:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this logic move to the helper functions? Seems cleaner to me to move the cursor in the code that does the actual reading. I think the other lines about checking if already can stay here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to avoid duplicating calculations for the seek location between the two function, but that does make sense (maybe long term the calculations could be moved to _read_old_header and stored in self._seek_value_labels which would then match _read_new_header?). I have now made the suggested change here too.

self._path_or_buf.seek(self._seek_value_labels)
Expand All @@ -1521,42 +1516,64 @@ def _read_value_labels(self) -> None:
self._path_or_buf.seek(self._data_location + offset)

self._value_labels_read = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move these to after the block that does the read. Make a bit more sense there now that this is short.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I have now moved this down.

self._value_label_dict = {}
self._value_label_dict: dict[str, dict[int, str]] = {}

while True:
if self._format_version >= 117:
if self._path_or_buf.read(5) == b"</val": # <lbl>
break # end of value label table

slength = self._path_or_buf.read(4)
if not slength:
break # end of value label table (format < 117)
if self._format_version <= 117:
labname = self._decode(self._path_or_buf.read(33))
if self._format_version >= 108:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off with the logic here and a few lines above with the other check. This should only be read if < 117 it looks like.

Copy link
Contributor Author

@cmjcharlton cmjcharlton Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of if self._format_version >= 108: are almost identical to the original code (though the change in indentation makes this less obvious in the diff). All that I changed within the if section is to add the 9 character read when version is 108. The new code is mostly in the else statement.

slength = self._path_or_buf.read(4)
if not slength:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be if self._format_version < 117 and not slength: so that it only checks on 108 <= x < 117

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't looked at the slength bit previously, but it could probably be a bit clearer. The if self._format_version >= 117: lines should mean that for a well-formed file the loop breaks before if not slength: becomes true, however for a case where </val is missing or in the wrong place the while loop would not exit if I added < 117 to the is not slength condition. Basically, as I understand it, this is allowing the loop to end if an end-of-file is reached. In versions < 117 this is how the end of the value label section is indicated, but in 117 and greater this should be indicted with the string </value_labels>.

break # end of value label table (format < 117)
if self._format_version <= 108:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be == 108 since you ahve >= 108 above and <= 108 means ==108.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be equivalent (and clearer) - I was using the cut-off versions for label name lengths for reference. I'll make the suggested change.

labname = self._decode(self._path_or_buf.read(9))
elif self._format_version <= 117:
labname = self._decode(self._path_or_buf.read(33))
else:
labname = self._decode(self._path_or_buf.read(129))
self._path_or_buf.read(3) # padding

n = self._read_uint32()
txtlen = self._read_uint32()
off = np.frombuffer(
self._path_or_buf.read(4 * n), dtype=f"{self._byteorder}i4", count=n
)
val = np.frombuffer(
self._path_or_buf.read(4 * n), dtype=f"{self._byteorder}i4", count=n
)
ii = np.argsort(off)
off = off[ii]
val = val[ii]
txt = self._path_or_buf.read(txtlen)
self._value_label_dict[labname] = {}
for i in range(n):
end = off[i + 1] if i < n - 1 else txtlen
self._value_label_dict[labname][val[i]] = self._decode(
txt[off[i] : end]
)
if self._format_version >= 117:
self._path_or_buf.read(6) # </lbl>
else:
labname = self._decode(self._path_or_buf.read(129))
self._path_or_buf.read(3) # padding
if not self._path_or_buf.read(2):
# end-of-file may have been reached, if so stop here
break

n = self._read_uint32()
txtlen = self._read_uint32()
off = np.frombuffer(
self._path_or_buf.read(4 * n), dtype=f"{self._byteorder}i4", count=n
)
val = np.frombuffer(
self._path_or_buf.read(4 * n), dtype=f"{self._byteorder}i4", count=n
)
ii = np.argsort(off)
off = off[ii]
val = val[ii]
txt = self._path_or_buf.read(txtlen)
self._value_label_dict[labname] = {}
for i in range(n):
end = off[i + 1] if i < n - 1 else txtlen
self._value_label_dict[labname][val[i]] = self._decode(
txt[off[i] : end]
# otherwise back up and read again, taking byteorder into account
self._path_or_buf.seek(-2, os.SEEK_CUR)
n = self._read_uint16()
labname = self._decode(self._path_or_buf.read(9))
self._path_or_buf.read(1) # padding
codes = np.frombuffer(
self._path_or_buf.read(2 * n), dtype=f"{self._byteorder}i2", count=n
)
if self._format_version >= 117:
self._path_or_buf.read(6) # </lbl>
self._value_label_dict[labname] = {}
for i in range(n):
self._value_label_dict[labname][codes[i]] = self._decode(
self._path_or_buf.read(8)
)

self._value_labels_read = True

def _read_strls(self) -> None:
Expand Down Expand Up @@ -1729,7 +1746,7 @@ def read(
i, _stata_elapsed_date_to_datetime_vec(data.iloc[:, i], fmt)
)

if convert_categoricals and self._format_version > 108:
if convert_categoricals:
data = self._do_convert_categoricals(
data, self._value_label_dict, self._lbllist, order_categoricals
)
Expand Down
Binary file added pandas/tests/io/data/stata/stata4_105.dta
Binary file not shown.
Binary file added pandas/tests/io/data/stata/stata4_108.dta
Binary file not shown.
Binary file added pandas/tests/io/data/stata/stata4_111.dta
Binary file not shown.
48 changes: 47 additions & 1 deletion pandas/tests/io/test_stata.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ def test_read_dta3(self, file, datapath):
tm.assert_frame_equal(parsed, expected)

@pytest.mark.parametrize(
"file", ["stata4_113", "stata4_114", "stata4_115", "stata4_117"]
"file", ["stata4_111", "stata4_113", "stata4_114", "stata4_115", "stata4_117"]
)
def test_read_dta4(self, file, datapath):
file = datapath("io", "data", "stata", f"{file}.dta")
Expand Down Expand Up @@ -270,6 +270,52 @@ def test_read_dta4(self, file, datapath):
# stata doesn't save .category metadata
tm.assert_frame_equal(parsed, expected)

@pytest.mark.parametrize("file", ["stata4_105", "stata4_108"])
def test_readold_dta4(self, file, datapath):
# This test is the same as test_read_dta4 above except that the columns
# had to be renamed to match the restrictions in older file format
file = datapath("io", "data", "stata", f"{file}.dta")
parsed = self.read_dta(file)

expected = DataFrame.from_records(
[
["one", "ten", "one", "one", "one"],
["two", "nine", "two", "two", "two"],
["three", "eight", "three", "three", "three"],
["four", "seven", 4, "four", "four"],
["five", "six", 5, np.nan, "five"],
["six", "five", 6, np.nan, "six"],
["seven", "four", 7, np.nan, "seven"],
["eight", "three", 8, np.nan, "eight"],
["nine", "two", 9, np.nan, "nine"],
["ten", "one", "ten", np.nan, "ten"],
],
columns=[
"fulllab",
"fulllab2",
"incmplab",
"misslab",
"floatlab",
],
)

# these are all categoricals
for col in expected:
orig = expected[col].copy()

categories = np.asarray(expected["fulllab"][orig.notna()])
if col == "incmplab":
categories = orig

cat = orig.astype("category")._values
cat = cat.set_categories(categories, ordered=True)
cat.categories.rename(None, inplace=True)

expected[col] = cat

# stata doesn't save .category metadata
tm.assert_frame_equal(parsed, expected)

# File containing strls
def test_read_dta12(self, datapath):
parsed_117 = self.read_dta(datapath("io", "data", "stata", "stata12_117.dta"))
Expand Down
Loading