Skip to content

BUG, ENH: Add support for parsing duplicate columns #12935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 40 additions & 1 deletion doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,8 @@ header : int or list of ints, default ``'infer'``
rather than the first line of the file.
names : array-like, default ``None``
List of column names to use. If file contains no header row, then you should
explicitly pass ``header=None``.
explicitly pass ``header=None``. Duplicates in this list are not allowed unless
``mangle_dupe_cols=True``, which is the default.
index_col : int or sequence or ``False``, default ``None``
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end of
Expand All @@ -139,6 +140,8 @@ prefix : str, default ``None``
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
mangle_dupe_cols : boolean, default ``True``
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'.
Passing in False will cause data to be overwritten if there are duplicate
names in the columns.

General Parsing Configuration
+++++++++++++++++++++++++++++
Expand Down Expand Up @@ -432,6 +435,42 @@ If the header is in a row other than the first, pass the row number to
data = 'skip this skip it\na,b,c\n1,2,3\n4,5,6\n7,8,9'
pd.read_csv(StringIO(data), header=1)

.. _io.dupe_names:

Duplicate names parsing
'''''''''''''''''''''''

If the file or header contains duplicate names, pandas by default will deduplicate
these names so as to prevent data overwrite:

.. ipython :: python

data = 'a,b,a\n0,1,2\n3,4,5'
pd.read_csv(StringIO(data))

There is no more duplicate data because ``mangle_dupe_cols=True`` by default, which modifies
a series of duplicate columns 'X'...'X' to become 'X.0'...'X.N'. If ``mangle_dupe_cols
=False``, duplicate data can arise:

.. code-block :: python

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
Out[3]:
a b a
0 2 1 2
1 5 4 5

To prevent users from encountering this problem with duplicate data, a ``ValueError``
exception is raised if ``mangle_dupe_cols != True``:

.. code-block :: python

In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
...
ValueError: Setting mangle_dupe_cols=False is not supported yet

.. _io.usecols:

Filtering columns (``usecols``)
Expand Down
27 changes: 27 additions & 0 deletions doc/source/whatsnew/v0.18.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,37 @@ Highlights include:
New features
~~~~~~~~~~~~

.. _whatsnew_0182.enhancements.read_csv_dupe_col_names_support:

``pd.read_csv`` has improved support for duplicate column names
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:ref:`Duplicate column names <io.dupe_names>` are now supported in ``pd.read_csv()`` whether
they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :issue:`9424`)

.. ipython :: python

data = '0,1,2\n3,4,5'
names = ['a', 'b', 'a']

Previous behaviour:

.. code-block:: ipython

In [2]: pd.read_csv(StringIO(data), names=names)
Out[2]:
a b a
0 2 1 2
1 5 4 5

The first 'a' column contains the same data as the second 'a' column, when it should have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a :ref: to the new doc-section you made

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some sort of :ref:...not 100% sure if it's right though.

contained the array ``[0, 3]``.

New behaviour:

.. ipython :: python

In [2]: pd.read_csv(StringIO(data), names=names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you use an ipython block, you don't write out results, just the code, eg.

pd.read_csv(StringIO(data), names=names)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Done.


.. _whatsnew_0182.enhancements.other:

Expand Down
58 changes: 47 additions & 11 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,8 @@
rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
should explicitly pass header=None. Duplicates in this list are not
allowed unless mangle_dupe_cols=True, which is the default.
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
Expand All @@ -91,7 +92,9 @@
prefix : str, default None
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
Duplicate columns will be specified as 'X.0'...'X.N', rather than
'X'...'X'. Passing in False will cause data to be overwritten if there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add something to the names part of the doc-string as well (e.g. saying if you have dupes then pay attention to mangle_dupe_cols)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

are duplicate names in the columns.
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
(Unsupported with engine='python'). Use `str` or `object` to preserve and
Expand Down Expand Up @@ -655,7 +658,14 @@ def _get_options_with_defaults(self, engine):
options = {}

for argname, default in compat.iteritems(_parser_defaults):
options[argname] = kwds.get(argname, default)
value = kwds.get(argname, default)

# see gh-12935
if argname == 'mangle_dupe_cols' and not value:
raise ValueError('Setting mangle_dupe_cols=False is '
'not supported yet')
else:
options[argname] = value

for argname, default in compat.iteritems(_c_parser_defaults):
if argname in kwds:
Expand Down Expand Up @@ -899,6 +909,7 @@ def __init__(self, kwds):
self.true_values = kwds.get('true_values')
self.false_values = kwds.get('false_values')
self.tupleize_cols = kwds.get('tupleize_cols', False)
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
self.infer_datetime_format = kwds.pop('infer_datetime_format', False)

self._date_conv = _make_date_converter(
Expand Down Expand Up @@ -1012,6 +1023,26 @@ def tostr(x):

return names, index_names, col_names, passed_names

def _maybe_dedup_names(self, names):
# see gh-7160 and gh-9424: this helps to provide
# immediate alleviation of the duplicate names
# issue and appears to be satisfactory to users,
# but ultimately, not needing to butcher the names
# would be nice!
if self.mangle_dupe_cols:
names = list(names) # so we can index
counts = {}

for i, col in enumerate(names):
cur_count = counts.get(col, 0)

if cur_count > 0:
names[i] = '%s.%d' % (col, cur_count)

counts[col] = cur_count + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future, defaultdict(int) is a more pythonic way of implementing this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More pythonic? Hmm...I figured unless you were going to use functionality in defaultdict that isn't in dict, using defaultdict is slight overkill. But that is just my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using the functionality - you wouldn't need cur_count = counts.get(col, 0) since counts[col] = cur_count + 1 does this for you if it's a defaultdict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I should have said was "unless you were going to use functionality in defaultdict that can't be easily emulated in dict, using defaultdict is slight overkill"


return names

def _maybe_make_multi_index_columns(self, columns, col_names=None):
# possibly create a column mi here
if (not self.tupleize_cols and len(columns) and
Expand Down Expand Up @@ -1314,10 +1345,11 @@ def read(self, nrows=None):
except StopIteration:
if self._first_chunk:
self._first_chunk = False
names = self._maybe_dedup_names(self.orig_names)

index, columns, col_dict = _get_empty_meta(
self.orig_names, self.index_col,
self.index_names, dtype=self.kwds.get('dtype'))
names, self.index_col, self.index_names,
dtype=self.kwds.get('dtype'))

if self.usecols is not None:
columns = self._filter_usecols(columns)
Expand Down Expand Up @@ -1361,6 +1393,8 @@ def read(self, nrows=None):
if self.usecols is not None:
names = self._filter_usecols(names)

names = self._maybe_dedup_names(names)

# rename dict keys
data = sorted(data.items())
data = dict((k, v) for k, (i, v) in zip(names, data))
Expand All @@ -1373,6 +1407,7 @@ def read(self, nrows=None):

# ugh, mutation
names = list(self.orig_names)
names = self._maybe_dedup_names(names)

if self.usecols is not None:
names = self._filter_usecols(names)
Expand Down Expand Up @@ -1567,7 +1602,6 @@ def __init__(self, f, **kwds):
self.skipinitialspace = kwds['skipinitialspace']
self.lineterminator = kwds['lineterminator']
self.quoting = kwds['quoting']
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
self.usecols = _validate_usecols_arg(kwds['usecols'])
self.skip_blank_lines = kwds['skip_blank_lines']

Expand Down Expand Up @@ -1756,8 +1790,8 @@ def read(self, rows=None):
columns = list(self.orig_names)
if not len(content): # pragma: no cover
# DataFrame with the right metadata, even though it's length 0
return _get_empty_meta(self.orig_names,
self.index_col,
names = self._maybe_dedup_names(self.orig_names)
return _get_empty_meta(names, self.index_col,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we testing with empties as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I'm closing #9424. There were skipped tests added in a previous PR that I am now no longer skipping.

self.index_names)

# handle new style for names in index
Expand All @@ -1770,26 +1804,28 @@ def read(self, rows=None):
alldata = self._rows_to_cols(content)
data = self._exclude_implicit_index(alldata)

columns, data = self._do_date_conversions(self.columns, data)
columns = self._maybe_dedup_names(self.columns)
columns, data = self._do_date_conversions(columns, data)

data = self._convert_data(data)
index, columns = self._make_index(data, alldata, columns, indexnamerow)

return index, columns, data

def _exclude_implicit_index(self, alldata):
names = self._maybe_dedup_names(self.orig_names)

if self._implicit_index:
excl_indices = self.index_col

data = {}
offset = 0
for i, col in enumerate(self.orig_names):
for i, col in enumerate(names):
while i + offset in excl_indices:
offset += 1
data[col] = alldata[i + offset]
else:
data = dict((k, v) for k, v in zip(self.orig_names, alldata))
data = dict((k, v) for k, v in zip(names, alldata))

return data

Expand Down
23 changes: 9 additions & 14 deletions pandas/io/tests/parser/c_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -293,23 +293,18 @@ def test_empty_with_mangled_column_pass_dtype_by_indexes(self):
{'one': np.empty(0, dtype='u1'), 'one.1': np.empty(0, dtype='f')})
tm.assert_frame_equal(result, expected, check_index_type=False)

def test_empty_with_dup_column_pass_dtype_by_names(self):
data = 'one,one'
result = self.read_csv(
StringIO(data), mangle_dupe_cols=False, dtype={'one': 'u1'})
expected = pd.concat([Series([], name='one', dtype='u1')] * 2, axis=1)
tm.assert_frame_equal(result, expected, check_index_type=False)

def test_empty_with_dup_column_pass_dtype_by_indexes(self):
# FIXME in gh-9424
raise nose.SkipTest(
"gh-9424; known failure read_csv with duplicate columns")
# see gh-9424
expected = pd.concat([Series([], name='one', dtype='u1'),
Series([], name='one.1', dtype='f')], axis=1)

data = 'one,one'
result = self.read_csv(
StringIO(data), mangle_dupe_cols=False, dtype={0: 'u1', 1: 'f'})
expected = pd.concat([Series([], name='one', dtype='u1'),
Series([], name='one', dtype='f')], axis=1)
result = self.read_csv(StringIO(data), dtype={0: 'u1', 1: 'f'})
tm.assert_frame_equal(result, expected, check_index_type=False)

data = ''
result = self.read_csv(StringIO(data), names=['one', 'one'],
dtype={0: 'u1', 1: 'f'})
tm.assert_frame_equal(result, expected, check_index_type=False)

def test_usecols_dtypes(self):
Expand Down
21 changes: 16 additions & 5 deletions pandas/io/tests/parser/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,8 @@ def test_unnamed_columns(self):
'Unnamed: 4'])

def test_duplicate_columns(self):
# TODO: add test for condition 'mangle_dupe_cols=False'
# once it is actually supported (gh-12935)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any other tests that touch mangle_dupe_cols (I 'll suppose not), but if we do, move them near here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, all of them are in common.py. I couldn't find others.

data = """A,A,B,B,B
1,2,3,4,5
6,7,8,9,10
Expand All @@ -256,11 +258,6 @@ def test_duplicate_columns(self):
self.assertEqual(list(df.columns),
['A', 'A.1', 'B', 'B.1', 'B.2'])

df = getattr(self, method)(StringIO(data), sep=',',
mangle_dupe_cols=False)
self.assertEqual(list(df.columns),
['A', 'A', 'B', 'B', 'B'])

df = getattr(self, method)(StringIO(data), sep=',',
mangle_dupe_cols=True)
self.assertEqual(list(df.columns),
Expand Down Expand Up @@ -1281,3 +1278,17 @@ def test_euro_decimal_format(self):
self.assertEqual(df2['Number1'].dtype, float)
self.assertEqual(df2['Number2'].dtype, float)
self.assertEqual(df2['Number3'].dtype, float)

def test_read_duplicate_names(self):
# See gh-7160
data = "a,b,a\n0,1,2\n3,4,5"
df = self.read_csv(StringIO(data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is test when manlge_dupe_cols=False and you have dupe names, it raises ValueError right?

Copy link
Member Author

@gfyoung gfyoung May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference: question addressed in discussion below

expected = DataFrame([[0, 1, 2], [3, 4, 5]],
columns=['a', 'b', 'a.1'])
tm.assert_frame_equal(df, expected)

data = "0,1,2\n3,4,5"
df = self.read_csv(StringIO(data), names=["a", "b", "a"])
expected = DataFrame([[0, 1, 2], [3, 4, 5]],
columns=['a', 'b', 'a.1'])
tm.assert_frame_equal(df, expected)
7 changes: 0 additions & 7 deletions pandas/io/tests/parser/test_parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,13 +84,6 @@ def read_table(self, *args, **kwds):


class TestPythonParser(BaseParser, PythonParserTests, tm.TestCase):
"""
Class for Python parser testing. Unless specifically stated
as a PythonParser-specific issue, the goal is to eventually move
as many of these tests into ParserTests as soon as the C parser
can accept further specific arguments when parsing.
"""

engine = 'python'
float_precision_choices = [None]

Expand Down
10 changes: 10 additions & 0 deletions pandas/io/tests/parser/test_unsupported.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@


class TestUnsupportedFeatures(tm.TestCase):
def test_mangle_dupe_cols_false(self):
# see gh-12935
data = 'a b c\n1 2 3'
msg = 'is not supported'

for engine in ('c', 'python'):
with tm.assertRaisesRegexp(ValueError, msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh its here. But does this depend on whether names is passed and/or are dupes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As of now, it is an unsupported feature because it fails with a duplicate header (or names) and when you set mangle_dupe_cols=False, so it is treated like any other unsupported feature in parser.py - we just don't allow it period.

read_csv(StringIO(data), engine=engine,
mangle_dupe_cols=False)

def test_c_engine(self):
# see gh-6607
data = 'a b c\n1 2 3'
Expand Down