Skip to content

Commit 53ad235

Browse files
committed
ENH: Improve explanation when erroring on dta files
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
1 parent 4814a28 commit 53ad235

File tree

3 files changed

+25
-7
lines changed

3 files changed

+25
-7
lines changed

doc/source/whatsnew/v0.25.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,7 @@ I/O
355355
- Improved performance in :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` when converting columns that have missing values (:issue:`25772`)
356356
- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
357357
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
358+
- Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (:issue:`25772`)
358359

359360
Plotting
360361
^^^^^^^^

pandas/io/stata.py

+13-4
Original file line numberDiff line numberDiff line change
@@ -1715,10 +1715,19 @@ def _do_convert_categoricals(self, data, value_label_dict, lbllist,
17151715
vc = Series(categories).value_counts()
17161716
repeats = list(vc.index[vc > 1])
17171717
repeats = '-' * 80 + '\n' + '\n'.join(repeats)
1718-
raise ValueError('Value labels for column {col} are not '
1719-
'unique. The repeated labels are:\n'
1720-
'{repeats}'
1721-
.format(col=col, repeats=repeats))
1718+
# GH 25772
1719+
msg = """
1720+
Value labels for column {col} are not unique. These cannot be converted to
1721+
pandas categoricals.
1722+
1723+
Either read the file with `convert_categoricals` set to False or use the
1724+
low level interface in `StataReader` to separately read the values and the
1725+
value_labels.
1726+
1727+
The repeated labels are:
1728+
{repeats}
1729+
"""
1730+
raise ValueError(msg.format(col=col, repeats=repeats))
17221731
# TODO: is the next line needed above in the data(...) method?
17231732
cat_data = Series(cat_data, index=data.index)
17241733
cat_converted_data.append((col, cat_data))

pandas/tests/io/test_stata.py

+11-3
Original file line numberDiff line numberDiff line change
@@ -1309,9 +1309,17 @@ def test_unsupported_datetype(self):
13091309
original.to_stata(path)
13101310

13111311
def test_repeated_column_labels(self):
1312-
# GH 13923
1313-
msg = (r"Value labels for column ethnicsn are not unique\. The"
1314-
r" repeated labels are:\n-+\nwolof")
1312+
# GH 13923, 25772
1313+
msg = """
1314+
Value labels for column ethnicsn are not unique. These cannot be converted to
1315+
pandas categoricals.
1316+
1317+
Either read the file with `convert_categoricals` set to False or use the
1318+
low level interface in `StataReader` to separately read the values and the
1319+
value_labels.
1320+
1321+
The repeated labels are:\n-+\nwolof
1322+
"""
13151323
with pytest.raises(ValueError, match=msg):
13161324
read_stata(self.dta23, convert_categoricals=True)
13171325

0 commit comments

Comments
 (0)