Skip to content

Commit 479a82a

Browse files
committed
ENH: Improve explanation when erroring on dta files
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
1 parent 70773d9 commit 479a82a

File tree

3 files changed

+25
-7
lines changed

3 files changed

+25
-7
lines changed

doc/source/whatsnew/v0.25.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,7 @@ I/O
355355
- Improved performance in :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` when converting columns that have missing values (:issue:`25772`)
356356
- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
357357
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
358+
- Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (:issue:`25772`)
358359
- Improved :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` to read incorrectly formatted 118 format files saved by Stata (:issue:`25960`)
359360

360361
Plotting

pandas/io/stata.py

+13-4
Original file line numberDiff line numberDiff line change
@@ -1719,10 +1719,19 @@ def _do_convert_categoricals(self, data, value_label_dict, lbllist,
17191719
vc = Series(categories).value_counts()
17201720
repeats = list(vc.index[vc > 1])
17211721
repeats = '-' * 80 + '\n' + '\n'.join(repeats)
1722-
raise ValueError('Value labels for column {col} are not '
1723-
'unique. The repeated labels are:\n'
1724-
'{repeats}'
1725-
.format(col=col, repeats=repeats))
1722+
# GH 25772
1723+
msg = """
1724+
Value labels for column {col} are not unique. These cannot be converted to
1725+
pandas categoricals.
1726+
1727+
Either read the file with `convert_categoricals` set to False or use the
1728+
low level interface in `StataReader` to separately read the values and the
1729+
value_labels.
1730+
1731+
The repeated labels are:
1732+
{repeats}
1733+
"""
1734+
raise ValueError(msg.format(col=col, repeats=repeats))
17261735
# TODO: is the next line needed above in the data(...) method?
17271736
cat_data = Series(cat_data, index=data.index)
17281737
cat_converted_data.append((col, cat_data))

pandas/tests/io/test_stata.py

+11-3
Original file line numberDiff line numberDiff line change
@@ -1311,9 +1311,17 @@ def test_unsupported_datetype(self):
13111311
original.to_stata(path)
13121312

13131313
def test_repeated_column_labels(self):
1314-
# GH 13923
1315-
msg = (r"Value labels for column ethnicsn are not unique\. The"
1316-
r" repeated labels are:\n-+\nwolof")
1314+
# GH 13923, 25772
1315+
msg = """
1316+
Value labels for column ethnicsn are not unique. These cannot be converted to
1317+
pandas categoricals.
1318+
1319+
Either read the file with `convert_categoricals` set to False or use the
1320+
low level interface in `StataReader` to separately read the values and the
1321+
value_labels.
1322+
1323+
The repeated labels are:\n-+\nwolof
1324+
"""
13171325
with pytest.raises(ValueError, match=msg):
13181326
read_stata(self.dta23, convert_categoricals=True)
13191327

0 commit comments

Comments
 (0)