Skip to content

Commit 35156dc

Browse files
bashtagejreback
authored andcommitted
ENH: Improve explanation when erroring on dta files (#25968)
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes #25772
1 parent 013f4b4 commit 35156dc

File tree

3 files changed

+25
-7
lines changed

3 files changed

+25
-7
lines changed

doc/source/whatsnew/v0.25.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,7 @@ I/O
357357
- Bug in :meth:`DataFrame.to_html` where header numbers would ignore display options when rounding (:issue:`17280`)
358358
- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
359359
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
360+
- Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (:issue:`25772`)
360361
- Improved :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` to read incorrectly formatted 118 format files saved by Stata (:issue:`25960`)
361362

362363
Plotting

pandas/io/stata.py

+13-4
Original file line numberDiff line numberDiff line change
@@ -1719,10 +1719,19 @@ def _do_convert_categoricals(self, data, value_label_dict, lbllist,
17191719
vc = Series(categories).value_counts()
17201720
repeats = list(vc.index[vc > 1])
17211721
repeats = '-' * 80 + '\n' + '\n'.join(repeats)
1722-
raise ValueError('Value labels for column {col} are not '
1723-
'unique. The repeated labels are:\n'
1724-
'{repeats}'
1725-
.format(col=col, repeats=repeats))
1722+
# GH 25772
1723+
msg = """
1724+
Value labels for column {col} are not unique. These cannot be converted to
1725+
pandas categoricals.
1726+
1727+
Either read the file with `convert_categoricals` set to False or use the
1728+
low level interface in `StataReader` to separately read the values and the
1729+
value_labels.
1730+
1731+
The repeated labels are:
1732+
{repeats}
1733+
"""
1734+
raise ValueError(msg.format(col=col, repeats=repeats))
17261735
# TODO: is the next line needed above in the data(...) method?
17271736
cat_data = Series(cat_data, index=data.index)
17281737
cat_converted_data.append((col, cat_data))

pandas/tests/io/test_stata.py

+11-3
Original file line numberDiff line numberDiff line change
@@ -1311,9 +1311,17 @@ def test_unsupported_datetype(self):
13111311
original.to_stata(path)
13121312

13131313
def test_repeated_column_labels(self):
1314-
# GH 13923
1315-
msg = (r"Value labels for column ethnicsn are not unique\. The"
1316-
r" repeated labels are:\n-+\nwolof")
1314+
# GH 13923, 25772
1315+
msg = """
1316+
Value labels for column ethnicsn are not unique. These cannot be converted to
1317+
pandas categoricals.
1318+
1319+
Either read the file with `convert_categoricals` set to False or use the
1320+
low level interface in `StataReader` to separately read the values and the
1321+
value_labels.
1322+
1323+
The repeated labels are:\n-+\nwolof
1324+
"""
13171325
with pytest.raises(ValueError, match=msg):
13181326
read_stata(self.dta23, convert_categoricals=True)
13191327

0 commit comments

Comments
 (0)