Skip to content

Commit 257ac88

Browse files
bashtagejreback
authored andcommitted
ENH: Improve error message for repeated Stata categories
closes #13923 Author: Kevin Sheppard <[email protected]> Closes #13949 from bashtage/categorical-error-message and squashes the following commits: 0880d02 [Kevin Sheppard] ENH: Improve error message for repeated Stata categories
1 parent 0b2f1f4 commit 257ac88

File tree

4 files changed

+17
-1
lines changed

4 files changed

+17
-1
lines changed

doc/source/whatsnew/v0.19.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -394,6 +394,7 @@ Other enhancements
394394
- ``Series.append`` now supports the ``ignore_index`` option (:issue:`13677`)
395395
- ``.to_stata()`` and ``StataWriter`` can now write variable labels to Stata dta files using a dictionary to make column names to labels (:issue:`13535`, :issue:`13536`)
396396
- ``.to_stata()`` and ``StataWriter`` will automatically convert ``datetime64[ns]`` columns to Stata format ``%tc``, rather than raising a ``ValueError`` (:issue:`12259`)
397+
- ``read_stata()`` and ``StataReader`` raise with a more explicit error message when reading Stata files with repeated value labels when ``convert_categoricals=True`` (:issue:`13923`)
397398
- ``DataFrame.style`` will now render sparsified MultiIndexes (:issue:`11655`)
398399
- ``DataFrame.style`` will now show column level names (e.g. ``DataFrame.columns.names``) (:issue:`13775`)
399400
- ``DataFrame`` has gained support to re-order the columns based on the values

pandas/io/stata.py

+9-1
Original file line numberDiff line numberDiff line change
@@ -1649,7 +1649,15 @@ def _do_convert_categoricals(self, data, value_label_dict, lbllist,
16491649
categories.append(value_label_dict[label][category])
16501650
else:
16511651
categories.append(category) # Partially labeled
1652-
cat_data.categories = categories
1652+
try:
1653+
cat_data.categories = categories
1654+
except ValueError:
1655+
vc = Series(categories).value_counts()
1656+
repeats = list(vc.index[vc > 1])
1657+
repeats = '\n' + '-' * 80 + '\n'.join(repeats)
1658+
msg = 'Value labels for column {0} are not unique. The ' \
1659+
'repeated labels are:\n{1}'.format(col, repeats)
1660+
raise ValueError(msg)
16531661
# TODO: is the next line needed above in the data(...) method?
16541662
cat_data = Series(cat_data, index=data.index)
16551663
cat_converted_data.append((col, cat_data))

pandas/io/tests/data/stata15.dta

3.11 KB
Binary file not shown.

pandas/io/tests/test_stata.py

+7
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ def setUp(self):
8080
self.dta21_117 = os.path.join(self.dirpath, 'stata12_117.dta')
8181

8282
self.dta22_118 = os.path.join(self.dirpath, 'stata14_118.dta')
83+
self.dta23 = os.path.join(self.dirpath, 'stata15.dta')
8384

8485
def read_dta(self, file):
8586
# Legacy default reader configuration
@@ -1212,6 +1213,12 @@ def test_unsupported_datetype(self):
12121213
with tm.ensure_clean() as path:
12131214
original.to_stata(path)
12141215

1216+
def test_repeated_column_labels(self):
1217+
# GH 13923
1218+
with tm.assertRaises(ValueError) as cm:
1219+
read_stata(self.dta23, convert_categoricals=True)
1220+
tm.assertTrue('wolof' in cm.exception)
1221+
12151222
if __name__ == '__main__':
12161223
nose.runmodule(argv=[__file__, '-vvs', '-x', '--pdb', '--pdb-failure'],
12171224
exit=False)

0 commit comments

Comments
 (0)