Skip to content

Commit 1e7243f

Browse files
committed
ENH: Improve explanation when erroring on dta files
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
1 parent f90f4aa commit 1e7243f

File tree

3 files changed

+14
-6
lines changed

3 files changed

+14
-6
lines changed

doc/source/whatsnew/v0.25.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,7 @@ I/O
355355
- Improved performance in :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` when converting columns that have missing values (:issue:`25772`)
356356
- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
357357
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
358+
- Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (:issue:`25772`)
358359

359360
Plotting
360361
^^^^^^^^

pandas/io/stata.py

+12-4
Original file line numberDiff line numberDiff line change
@@ -1715,10 +1715,18 @@ def _do_convert_categoricals(self, data, value_label_dict, lbllist,
17151715
vc = Series(categories).value_counts()
17161716
repeats = list(vc.index[vc > 1])
17171717
repeats = '-' * 80 + '\n' + '\n'.join(repeats)
1718-
raise ValueError('Value labels for column {col} are not '
1719-
'unique. The repeated labels are:\n'
1720-
'{repeats}'
1721-
.format(col=col, repeats=repeats))
1718+
msg = """
1719+
Value labels for column {col} are not unique. These cannot be converted to
1720+
pandas categoricals.
1721+
1722+
Either read the file with `convert_categoricals` set to False or use the
1723+
low level interface in `StataReader` to separately read the values and the
1724+
value_labels.
1725+
1726+
The repeated labels are:
1727+
{repeats}
1728+
"""
1729+
raise ValueError(msg.format(col=col, repeats=repeats))
17221730
# TODO: is the next line needed above in the data(...) method?
17231731
cat_data = Series(cat_data, index=data.index)
17241732
cat_converted_data.append((col, cat_data))

pandas/tests/io/test_stata.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -1310,8 +1310,7 @@ def test_unsupported_datetype(self):
13101310

13111311
def test_repeated_column_labels(self):
13121312
# GH 13923
1313-
msg = (r"Value labels for column ethnicsn are not unique\. The"
1314-
r" repeated labels are:\n-+\nwolof")
1313+
msg = r"The repeated labels are:\n-+\nwolof"
13151314
with pytest.raises(ValueError, match=msg):
13161315
read_stata(self.dta23, convert_categoricals=True)
13171316

0 commit comments

Comments
 (0)