Skip to content

Commit ef7636f

Browse files
committed
BUG, ENH: Add support for parsing duplicate columns
Deduplicates the 'names' parameter by default if there are duplicate names. Also raises when 'mangle_ dupe_cols' is False to prevent data overwrite. Closes pandas-devgh-7160. Closes pandas-devgh-9424.
1 parent afde718 commit ef7636f

File tree

7 files changed

+149
-38
lines changed

7 files changed

+149
-38
lines changed

doc/source/io.rst

+40-1
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,8 @@ header : int or list of ints, default ``'infer'``
120120
rather than the first line of the file.
121121
names : array-like, default ``None``
122122
List of column names to use. If file contains no header row, then you should
123-
explicitly pass ``header=None``.
123+
explicitly pass ``header=None``. Duplicates in this list are not allowed unless
124+
``mangle_dupe_cols=True``, which is the default.
124125
index_col : int or sequence or ``False``, default ``None``
125126
Column to use as the row labels of the DataFrame. If a sequence is given, a
126127
MultiIndex is used. If you have a malformed file with delimiters at the end of
@@ -139,6 +140,8 @@ prefix : str, default ``None``
139140
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
140141
mangle_dupe_cols : boolean, default ``True``
141142
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'.
143+
Passing in False will cause data to be overwritten if there are duplicate
144+
names in the columns.
142145

143146
General Parsing Configuration
144147
+++++++++++++++++++++++++++++
@@ -432,6 +435,42 @@ If the header is in a row other than the first, pass the row number to
432435
data = 'skip this skip it\na,b,c\n1,2,3\n4,5,6\n7,8,9'
433436
pd.read_csv(StringIO(data), header=1)
434437
438+
.. _io.dupe_names:
439+
440+
Duplicate names parsing
441+
'''''''''''''''''''''''
442+
443+
If the file or header contains duplicate names, pandas by default will deduplicate
444+
these names so as to prevent data overwrite:
445+
446+
.. ipython :: python
447+
448+
data = 'a,b,a\n0,1,2\n3,4,5'
449+
pd.read_csv(StringIO(data))
450+
451+
There is no more duplicate data because ``mangle_dupe_cols=True`` by default, which modifies
452+
a series of duplicate columns 'X'...'X' to become 'X.0'...'X.N'. If ``mangle_dupe_cols
453+
=False``, duplicate data can arise:
454+
455+
.. code-block :: python
456+
457+
In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
458+
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
459+
Out[3]:
460+
a b a
461+
0 2 1 2
462+
1 5 4 5
463+
464+
To prevent users from encountering this problem with duplicate data, a ``ValueError``
465+
exception is raised if ``mangle_dupe_cols != True``:
466+
467+
.. code-block :: python
468+
469+
In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
470+
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
471+
...
472+
ValueError: Setting mangle_dupe_cols=False is not supported yet
473+
435474
.. _io.usecols:
436475

437476
Filtering columns (``usecols``)

doc/source/whatsnew/v0.18.2.txt

+27
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,37 @@ Highlights include:
1919
New features
2020
~~~~~~~~~~~~
2121

22+
.. _whatsnew_0182.enhancements.read_csv_dupe_col_names_support:
2223

24+
``pd.read_csv`` has improved support for duplicate column names
25+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2326

27+
:ref:`Duplicate column names <io.dupe_names>` are now supported in ``pd.read_csv()`` whether
28+
they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :issue:`9424`)
2429

30+
.. ipython :: python
2531

32+
data = '0,1,2\n3,4,5'
33+
names = ['a', 'b', 'a']
34+
35+
Previous behaviour:
36+
37+
.. code-block:: ipython
38+
39+
In [2]: pd.read_csv(StringIO(data), names=names)
40+
Out[2]:
41+
a b a
42+
0 2 1 2
43+
1 5 4 5
44+
45+
The first 'a' column contains the same data as the second 'a' column, when it should have
46+
contained the array ``[0, 3]``.
47+
48+
New behaviour:
49+
50+
.. ipython :: python
51+
52+
In [2]: pd.read_csv(StringIO(data), names=names)
2653

2754
.. _whatsnew_0182.enhancements.other:
2855

pandas/io/parsers.py

+47-11
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,8 @@
7373
rather than the first line of the file.
7474
names : array-like, default None
7575
List of column names to use. If file contains no header row, then you
76-
should explicitly pass header=None
76+
should explicitly pass header=None. Duplicates in this list are not
77+
allowed unless mangle_dupe_cols=True, which is the default.
7778
index_col : int or sequence or False, default None
7879
Column to use as the row labels of the DataFrame. If a sequence is given, a
7980
MultiIndex is used. If you have a malformed file with delimiters at the end
@@ -91,7 +92,9 @@
9192
prefix : str, default None
9293
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
9394
mangle_dupe_cols : boolean, default True
94-
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
95+
Duplicate columns will be specified as 'X.0'...'X.N', rather than
96+
'X'...'X'. Passing in False will cause data to be overwritten if there
97+
are duplicate names in the columns.
9598
dtype : Type name or dict of column -> type, default None
9699
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
97100
(Unsupported with engine='python'). Use `str` or `object` to preserve and
@@ -655,7 +658,14 @@ def _get_options_with_defaults(self, engine):
655658
options = {}
656659

657660
for argname, default in compat.iteritems(_parser_defaults):
658-
options[argname] = kwds.get(argname, default)
661+
value = kwds.get(argname, default)
662+
663+
# see gh-12935
664+
if argname == 'mangle_dupe_cols' and not value:
665+
raise ValueError('Setting mangle_dupe_cols=False is '
666+
'not supported yet')
667+
else:
668+
options[argname] = value
659669

660670
for argname, default in compat.iteritems(_c_parser_defaults):
661671
if argname in kwds:
@@ -899,6 +909,7 @@ def __init__(self, kwds):
899909
self.true_values = kwds.get('true_values')
900910
self.false_values = kwds.get('false_values')
901911
self.tupleize_cols = kwds.get('tupleize_cols', False)
912+
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
902913
self.infer_datetime_format = kwds.pop('infer_datetime_format', False)
903914

904915
self._date_conv = _make_date_converter(
@@ -1012,6 +1023,26 @@ def tostr(x):
10121023

10131024
return names, index_names, col_names, passed_names
10141025

1026+
def _maybe_dedup_names(self, names):
1027+
# see gh-7160 and gh-9424: this helps to provide
1028+
# immediate alleviation of the duplicate names
1029+
# issue and appears to be satisfactory to users,
1030+
# but ultimately, not needing to butcher the names
1031+
# would be nice!
1032+
if self.mangle_dupe_cols:
1033+
names = list(names) # so we can index
1034+
counts = {}
1035+
1036+
for i, col in enumerate(names):
1037+
cur_count = counts.get(col, 0)
1038+
1039+
if cur_count > 0:
1040+
names[i] = '%s.%d' % (col, cur_count)
1041+
1042+
counts[col] = cur_count + 1
1043+
1044+
return names
1045+
10151046
def _maybe_make_multi_index_columns(self, columns, col_names=None):
10161047
# possibly create a column mi here
10171048
if (not self.tupleize_cols and len(columns) and
@@ -1314,10 +1345,11 @@ def read(self, nrows=None):
13141345
except StopIteration:
13151346
if self._first_chunk:
13161347
self._first_chunk = False
1348+
names = self._maybe_dedup_names(self.orig_names)
13171349

13181350
index, columns, col_dict = _get_empty_meta(
1319-
self.orig_names, self.index_col,
1320-
self.index_names, dtype=self.kwds.get('dtype'))
1351+
names, self.index_col, self.index_names,
1352+
dtype=self.kwds.get('dtype'))
13211353

13221354
if self.usecols is not None:
13231355
columns = self._filter_usecols(columns)
@@ -1361,6 +1393,8 @@ def read(self, nrows=None):
13611393
if self.usecols is not None:
13621394
names = self._filter_usecols(names)
13631395

1396+
names = self._maybe_dedup_names(names)
1397+
13641398
# rename dict keys
13651399
data = sorted(data.items())
13661400
data = dict((k, v) for k, (i, v) in zip(names, data))
@@ -1373,6 +1407,7 @@ def read(self, nrows=None):
13731407

13741408
# ugh, mutation
13751409
names = list(self.orig_names)
1410+
names = self._maybe_dedup_names(names)
13761411

13771412
if self.usecols is not None:
13781413
names = self._filter_usecols(names)
@@ -1567,7 +1602,6 @@ def __init__(self, f, **kwds):
15671602
self.skipinitialspace = kwds['skipinitialspace']
15681603
self.lineterminator = kwds['lineterminator']
15691604
self.quoting = kwds['quoting']
1570-
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
15711605
self.usecols = _validate_usecols_arg(kwds['usecols'])
15721606
self.skip_blank_lines = kwds['skip_blank_lines']
15731607

@@ -1756,8 +1790,8 @@ def read(self, rows=None):
17561790
columns = list(self.orig_names)
17571791
if not len(content): # pragma: no cover
17581792
# DataFrame with the right metadata, even though it's length 0
1759-
return _get_empty_meta(self.orig_names,
1760-
self.index_col,
1793+
names = self._maybe_dedup_names(self.orig_names)
1794+
return _get_empty_meta(names, self.index_col,
17611795
self.index_names)
17621796

17631797
# handle new style for names in index
@@ -1770,26 +1804,28 @@ def read(self, rows=None):
17701804
alldata = self._rows_to_cols(content)
17711805
data = self._exclude_implicit_index(alldata)
17721806

1773-
columns, data = self._do_date_conversions(self.columns, data)
1807+
columns = self._maybe_dedup_names(self.columns)
1808+
columns, data = self._do_date_conversions(columns, data)
17741809

17751810
data = self._convert_data(data)
17761811
index, columns = self._make_index(data, alldata, columns, indexnamerow)
17771812

17781813
return index, columns, data
17791814

17801815
def _exclude_implicit_index(self, alldata):
1816+
names = self._maybe_dedup_names(self.orig_names)
17811817

17821818
if self._implicit_index:
17831819
excl_indices = self.index_col
17841820

17851821
data = {}
17861822
offset = 0
1787-
for i, col in enumerate(self.orig_names):
1823+
for i, col in enumerate(names):
17881824
while i + offset in excl_indices:
17891825
offset += 1
17901826
data[col] = alldata[i + offset]
17911827
else:
1792-
data = dict((k, v) for k, v in zip(self.orig_names, alldata))
1828+
data = dict((k, v) for k, v in zip(names, alldata))
17931829

17941830
return data
17951831

pandas/io/tests/parser/c_parser_only.py

+9-14
Original file line numberDiff line numberDiff line change
@@ -293,23 +293,18 @@ def test_empty_with_mangled_column_pass_dtype_by_indexes(self):
293293
{'one': np.empty(0, dtype='u1'), 'one.1': np.empty(0, dtype='f')})
294294
tm.assert_frame_equal(result, expected, check_index_type=False)
295295

296-
def test_empty_with_dup_column_pass_dtype_by_names(self):
297-
data = 'one,one'
298-
result = self.read_csv(
299-
StringIO(data), mangle_dupe_cols=False, dtype={'one': 'u1'})
300-
expected = pd.concat([Series([], name='one', dtype='u1')] * 2, axis=1)
301-
tm.assert_frame_equal(result, expected, check_index_type=False)
302-
303296
def test_empty_with_dup_column_pass_dtype_by_indexes(self):
304-
# FIXME in gh-9424
305-
raise nose.SkipTest(
306-
"gh-9424; known failure read_csv with duplicate columns")
297+
# see gh-9424
298+
expected = pd.concat([Series([], name='one', dtype='u1'),
299+
Series([], name='one.1', dtype='f')], axis=1)
307300

308301
data = 'one,one'
309-
result = self.read_csv(
310-
StringIO(data), mangle_dupe_cols=False, dtype={0: 'u1', 1: 'f'})
311-
expected = pd.concat([Series([], name='one', dtype='u1'),
312-
Series([], name='one', dtype='f')], axis=1)
302+
result = self.read_csv(StringIO(data), dtype={0: 'u1', 1: 'f'})
303+
tm.assert_frame_equal(result, expected, check_index_type=False)
304+
305+
data = ''
306+
result = self.read_csv(StringIO(data), names=['one', 'one'],
307+
dtype={0: 'u1', 1: 'f'})
313308
tm.assert_frame_equal(result, expected, check_index_type=False)
314309

315310
def test_usecols_dtypes(self):

pandas/io/tests/parser/common.py

+16-5
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,8 @@ def test_unnamed_columns(self):
243243
'Unnamed: 4'])
244244

245245
def test_duplicate_columns(self):
246+
# TODO: add test for condition 'mangle_dupe_cols=False'
247+
# once it is actually supported (gh-12935)
246248
data = """A,A,B,B,B
247249
1,2,3,4,5
248250
6,7,8,9,10
@@ -256,11 +258,6 @@ def test_duplicate_columns(self):
256258
self.assertEqual(list(df.columns),
257259
['A', 'A.1', 'B', 'B.1', 'B.2'])
258260

259-
df = getattr(self, method)(StringIO(data), sep=',',
260-
mangle_dupe_cols=False)
261-
self.assertEqual(list(df.columns),
262-
['A', 'A', 'B', 'B', 'B'])
263-
264261
df = getattr(self, method)(StringIO(data), sep=',',
265262
mangle_dupe_cols=True)
266263
self.assertEqual(list(df.columns),
@@ -1281,3 +1278,17 @@ def test_euro_decimal_format(self):
12811278
self.assertEqual(df2['Number1'].dtype, float)
12821279
self.assertEqual(df2['Number2'].dtype, float)
12831280
self.assertEqual(df2['Number3'].dtype, float)
1281+
1282+
def test_read_duplicate_names(self):
1283+
# See gh-7160
1284+
data = "a,b,a\n0,1,2\n3,4,5"
1285+
df = self.read_csv(StringIO(data))
1286+
expected = DataFrame([[0, 1, 2], [3, 4, 5]],
1287+
columns=['a', 'b', 'a.1'])
1288+
tm.assert_frame_equal(df, expected)
1289+
1290+
data = "0,1,2\n3,4,5"
1291+
df = self.read_csv(StringIO(data), names=["a", "b", "a"])
1292+
expected = DataFrame([[0, 1, 2], [3, 4, 5]],
1293+
columns=['a', 'b', 'a.1'])
1294+
tm.assert_frame_equal(df, expected)

pandas/io/tests/parser/test_parsers.py

-7
Original file line numberDiff line numberDiff line change
@@ -84,13 +84,6 @@ def read_table(self, *args, **kwds):
8484

8585

8686
class TestPythonParser(BaseParser, PythonParserTests, tm.TestCase):
87-
"""
88-
Class for Python parser testing. Unless specifically stated
89-
as a PythonParser-specific issue, the goal is to eventually move
90-
as many of these tests into ParserTests as soon as the C parser
91-
can accept further specific arguments when parsing.
92-
"""
93-
9487
engine = 'python'
9588
float_precision_choices = [None]
9689

pandas/io/tests/parser/test_unsupported.py

+10
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,16 @@
2020

2121

2222
class TestUnsupportedFeatures(tm.TestCase):
23+
def test_mangle_dupe_cols_false(self):
24+
# see gh-12935
25+
data = 'a b c\n1 2 3'
26+
msg = 'is not supported'
27+
28+
for engine in ('c', 'python'):
29+
with tm.assertRaisesRegexp(ValueError, msg):
30+
read_csv(StringIO(data), engine=engine,
31+
mangle_dupe_cols=False)
32+
2333
def test_c_engine(self):
2434
# see gh-6607
2535
data = 'a b c\n1 2 3'

0 commit comments

Comments
 (0)