-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG, ENH: Add support for parsing duplicate columns #12935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,10 +19,37 @@ Highlights include: | |
New features | ||
~~~~~~~~~~~~ | ||
|
||
.. _whatsnew_0182.enhancements.read_csv_dupe_col_names_support: | ||
|
||
``pd.read_csv`` has improved support for duplicate column names | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:ref:`Duplicate column names <io.dupe_names>` are now supported in ``pd.read_csv()`` whether | ||
they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :issue:`9424`) | ||
|
||
.. ipython :: python | ||
|
||
data = '0,1,2\n3,4,5' | ||
names = ['a', 'b', 'a'] | ||
|
||
Previous behaviour: | ||
|
||
.. code-block:: ipython | ||
|
||
In [2]: pd.read_csv(StringIO(data), names=names) | ||
Out[2]: | ||
a b a | ||
0 2 1 2 | ||
1 5 4 5 | ||
|
||
The first 'a' column contains the same data as the second 'a' column, when it should have | ||
contained the array ``[0, 3]``. | ||
|
||
New behaviour: | ||
|
||
.. ipython :: python | ||
|
||
In [2]: pd.read_csv(StringIO(data), names=names) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when you use an ipython block, you don't write out results, just the code, eg.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, right. Done. |
||
|
||
.. _whatsnew_0182.enhancements.other: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,7 +73,8 @@ | |
rather than the first line of the file. | ||
names : array-like, default None | ||
List of column names to use. If file contains no header row, then you | ||
should explicitly pass header=None | ||
should explicitly pass header=None. Duplicates in this list are not | ||
allowed unless mangle_dupe_cols=True, which is the default. | ||
index_col : int or sequence or False, default None | ||
Column to use as the row labels of the DataFrame. If a sequence is given, a | ||
MultiIndex is used. If you have a malformed file with delimiters at the end | ||
|
@@ -91,7 +92,9 @@ | |
prefix : str, default None | ||
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ... | ||
mangle_dupe_cols : boolean, default True | ||
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X' | ||
Duplicate columns will be specified as 'X.0'...'X.N', rather than | ||
'X'...'X'. Passing in False will cause data to be overwritten if there | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add something to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough. Done. |
||
are duplicate names in the columns. | ||
dtype : Type name or dict of column -> type, default None | ||
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32} | ||
(Unsupported with engine='python'). Use `str` or `object` to preserve and | ||
|
@@ -655,7 +658,14 @@ def _get_options_with_defaults(self, engine): | |
options = {} | ||
|
||
for argname, default in compat.iteritems(_parser_defaults): | ||
options[argname] = kwds.get(argname, default) | ||
value = kwds.get(argname, default) | ||
|
||
# see gh-12935 | ||
if argname == 'mangle_dupe_cols' and not value: | ||
raise ValueError('Setting mangle_dupe_cols=False is ' | ||
'not supported yet') | ||
else: | ||
options[argname] = value | ||
|
||
for argname, default in compat.iteritems(_c_parser_defaults): | ||
if argname in kwds: | ||
|
@@ -899,6 +909,7 @@ def __init__(self, kwds): | |
self.true_values = kwds.get('true_values') | ||
self.false_values = kwds.get('false_values') | ||
self.tupleize_cols = kwds.get('tupleize_cols', False) | ||
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True) | ||
self.infer_datetime_format = kwds.pop('infer_datetime_format', False) | ||
|
||
self._date_conv = _make_date_converter( | ||
|
@@ -1012,6 +1023,26 @@ def tostr(x): | |
|
||
return names, index_names, col_names, passed_names | ||
|
||
def _maybe_dedup_names(self, names): | ||
# see gh-7160 and gh-9424: this helps to provide | ||
# immediate alleviation of the duplicate names | ||
# issue and appears to be satisfactory to users, | ||
# but ultimately, not needing to butcher the names | ||
# would be nice! | ||
if self.mangle_dupe_cols: | ||
names = list(names) # so we can index | ||
counts = {} | ||
|
||
for i, col in enumerate(names): | ||
cur_count = counts.get(col, 0) | ||
|
||
if cur_count > 0: | ||
names[i] = '%s.%d' % (col, cur_count) | ||
|
||
counts[col] = cur_count + 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For future, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More pythonic? Hmm...I figured unless you were going to use functionality in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are using the functionality - you wouldn't need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, what I should have said was "unless you were going to use functionality in |
||
|
||
return names | ||
|
||
def _maybe_make_multi_index_columns(self, columns, col_names=None): | ||
# possibly create a column mi here | ||
if (not self.tupleize_cols and len(columns) and | ||
|
@@ -1314,10 +1345,11 @@ def read(self, nrows=None): | |
except StopIteration: | ||
if self._first_chunk: | ||
self._first_chunk = False | ||
names = self._maybe_dedup_names(self.orig_names) | ||
|
||
index, columns, col_dict = _get_empty_meta( | ||
self.orig_names, self.index_col, | ||
self.index_names, dtype=self.kwds.get('dtype')) | ||
names, self.index_col, self.index_names, | ||
dtype=self.kwds.get('dtype')) | ||
|
||
if self.usecols is not None: | ||
columns = self._filter_usecols(columns) | ||
|
@@ -1361,6 +1393,8 @@ def read(self, nrows=None): | |
if self.usecols is not None: | ||
names = self._filter_usecols(names) | ||
|
||
names = self._maybe_dedup_names(names) | ||
|
||
# rename dict keys | ||
data = sorted(data.items()) | ||
data = dict((k, v) for k, (i, v) in zip(names, data)) | ||
|
@@ -1373,6 +1407,7 @@ def read(self, nrows=None): | |
|
||
# ugh, mutation | ||
names = list(self.orig_names) | ||
names = self._maybe_dedup_names(names) | ||
|
||
if self.usecols is not None: | ||
names = self._filter_usecols(names) | ||
|
@@ -1567,7 +1602,6 @@ def __init__(self, f, **kwds): | |
self.skipinitialspace = kwds['skipinitialspace'] | ||
self.lineterminator = kwds['lineterminator'] | ||
self.quoting = kwds['quoting'] | ||
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True) | ||
self.usecols = _validate_usecols_arg(kwds['usecols']) | ||
self.skip_blank_lines = kwds['skip_blank_lines'] | ||
|
||
|
@@ -1756,8 +1790,8 @@ def read(self, rows=None): | |
columns = list(self.orig_names) | ||
if not len(content): # pragma: no cover | ||
# DataFrame with the right metadata, even though it's length 0 | ||
return _get_empty_meta(self.orig_names, | ||
self.index_col, | ||
names = self._maybe_dedup_names(self.orig_names) | ||
return _get_empty_meta(names, self.index_col, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are we testing with empties as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's why I'm closing #9424. There were skipped tests added in a previous PR that I am now no longer skipping. |
||
self.index_names) | ||
|
||
# handle new style for names in index | ||
|
@@ -1770,26 +1804,28 @@ def read(self, rows=None): | |
alldata = self._rows_to_cols(content) | ||
data = self._exclude_implicit_index(alldata) | ||
|
||
columns, data = self._do_date_conversions(self.columns, data) | ||
columns = self._maybe_dedup_names(self.columns) | ||
columns, data = self._do_date_conversions(columns, data) | ||
|
||
data = self._convert_data(data) | ||
index, columns = self._make_index(data, alldata, columns, indexnamerow) | ||
|
||
return index, columns, data | ||
|
||
def _exclude_implicit_index(self, alldata): | ||
names = self._maybe_dedup_names(self.orig_names) | ||
|
||
if self._implicit_index: | ||
excl_indices = self.index_col | ||
|
||
data = {} | ||
offset = 0 | ||
for i, col in enumerate(self.orig_names): | ||
for i, col in enumerate(names): | ||
while i + offset in excl_indices: | ||
offset += 1 | ||
data[col] = alldata[i + offset] | ||
else: | ||
data = dict((k, v) for k, v in zip(self.orig_names, alldata)) | ||
data = dict((k, v) for k, v in zip(names, alldata)) | ||
|
||
return data | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -243,6 +243,8 @@ def test_unnamed_columns(self): | |
'Unnamed: 4']) | ||
|
||
def test_duplicate_columns(self): | ||
# TODO: add test for condition 'mangle_dupe_cols=False' | ||
# once it is actually supported (gh-12935) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we have any other tests that touch mangle_dupe_cols (I 'll suppose not), but if we do, move them near here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAICT, all of them are in |
||
data = """A,A,B,B,B | ||
1,2,3,4,5 | ||
6,7,8,9,10 | ||
|
@@ -256,11 +258,6 @@ def test_duplicate_columns(self): | |
self.assertEqual(list(df.columns), | ||
['A', 'A.1', 'B', 'B.1', 'B.2']) | ||
|
||
df = getattr(self, method)(StringIO(data), sep=',', | ||
mangle_dupe_cols=False) | ||
self.assertEqual(list(df.columns), | ||
['A', 'A', 'B', 'B', 'B']) | ||
|
||
df = getattr(self, method)(StringIO(data), sep=',', | ||
mangle_dupe_cols=True) | ||
self.assertEqual(list(df.columns), | ||
|
@@ -1281,3 +1278,17 @@ def test_euro_decimal_format(self): | |
self.assertEqual(df2['Number1'].dtype, float) | ||
self.assertEqual(df2['Number2'].dtype, float) | ||
self.assertEqual(df2['Number3'].dtype, float) | ||
|
||
def test_read_duplicate_names(self): | ||
# See gh-7160 | ||
data = "a,b,a\n0,1,2\n3,4,5" | ||
df = self.read_csv(StringIO(data)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where is test when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For future reference: question addressed in discussion below |
||
expected = DataFrame([[0, 1, 2], [3, 4, 5]], | ||
columns=['a', 'b', 'a.1']) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
data = "0,1,2\n3,4,5" | ||
df = self.read_csv(StringIO(data), names=["a", "b", "a"]) | ||
expected = DataFrame([[0, 1, 2], [3, 4, 5]], | ||
columns=['a', 'b', 'a.1']) | ||
tm.assert_frame_equal(df, expected) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,16 @@ | |
|
||
|
||
class TestUnsupportedFeatures(tm.TestCase): | ||
def test_mangle_dupe_cols_false(self): | ||
# see gh-12935 | ||
data = 'a b c\n1 2 3' | ||
msg = 'is not supported' | ||
|
||
for engine in ('c', 'python'): | ||
with tm.assertRaisesRegexp(ValueError, msg): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh its here. But does this depend on whether There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. As of now, it is an unsupported feature because it fails with a duplicate header (or |
||
read_csv(StringIO(data), engine=engine, | ||
mangle_dupe_cols=False) | ||
|
||
def test_c_engine(self): | ||
# see gh-6607 | ||
data = 'a b c\n1 2 3' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a
:ref:
to the new doc-section you madeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some sort of
:ref:
...not 100% sure if it's right though.