BUG: MultiIndex mangling during parsing (#18062) #18094

WillAyd · 2017-11-03T16:48:03Z

closes Reading a CSV with duplicated MultiRow columns #18062
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

gfyoung · 2017-11-03T17:01:44Z

pandas/io/parsers.py

@@ -1106,6 +1106,11 @@ def _is_index_col(col):
    return col is not None and col is not False


+def _is_potential_multi_index(columns):


Add a docstring to this function.

gfyoung · 2017-11-03T17:02:52Z

pandas/tests/io/parser/header.py

+                             columns=MultiIndex.from_tuples(
+                                 [('A', 'one'), ('A', 'one.1'),
+                                  ('A', 'one.2'), ('B', 'two')]))
+        tm.assert_frame_equal(df, expected)


Add a test for the following case:

data = """A,A,A,B\none,one,one.1,two\n0,40,34,0.1"""

Let's make sure your deduping is robust.

jschendel · 2017-11-03T17:43:56Z

doc/source/whatsnew/v0.22.0.txt

@@ -89,6 +89,7 @@ Bug Fixes

 - Bug in ``pd.read_msgpack()`` with a non existent file is passed in Python 2 (:issue:`15296`)
 - Bug in ``DataFrame.groupby`` where key as tuple in a ``MultiIndex`` were interpreted as a list of keys (:issue:`17979`)
+- Bug in ``pd.read_csv`` where a multi-index with duplicate columns was not being mangled appropriately (:issue: `18062`)


Can you make the following changes:

``pd.read_csv`` -> :func:`read_csv`

"multi-index" -> ``MultiIndex``

Remove the space between :issue: and the issue number

codecov · 2017-11-03T18:18:48Z

Codecov Report

Merging #18094 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18094      +/-   ##
==========================================
+ Coverage   91.25%   91.26%   +<.01%     
==========================================
  Files         163      163              
  Lines       50120    50125       +5     
==========================================
+ Hits        45737    45745       +8     
+ Misses       4383     4380       -3

Flag	Coverage Δ
#multiple	`89.07% <100%> (+0.02%)`	⬆️
#single	`40.32% <57.14%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.53% <100%> (+0.01%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.2% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27bbea7...5fdf1e3. Read the comment docs.

codecov · 2017-11-03T18:18:49Z

Codecov Report

Merging #18094 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18094      +/-   ##
==========================================
+ Coverage   91.25%   91.26%   +<.01%     
==========================================
  Files         163      163              
  Lines       50120    50125       +5     
==========================================
+ Hits        45737    45745       +8     
+ Misses       4383     4380       -3

Flag	Coverage Δ
#multiple	`89.07% <100%> (+0.02%)`	⬆️
#single	`40.32% <57.14%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.53% <100%> (+0.01%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.2% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27bbea7...5fdf1e3. Read the comment docs.

codecov · 2017-11-03T18:18:59Z

Codecov Report

Merging #18094 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18094      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.02%     
==========================================
  Files         163      163              
  Lines       50123    50128       +5     
==========================================
- Hits        45752    45749       -3     
- Misses       4371     4379       +8

Flag	Coverage Δ
#multiple	`89.07% <100%> (ø)`	⬆️
#single	`40.32% <57.14%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.53% <100%> (+0.01%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️
pandas/core/indexes/datetimes.py	`95.5% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f11353...41eea13. Read the comment docs.

jreback · 2017-11-04T15:28:48Z

pandas/tests/io/parser/header.py

+        df = self.read_csv(StringIO(data), header=[0, 1])
+        expected = DataFrame([[0, 40, 34, 0.1]],
+                             columns=MultiIndex.from_tuples(
+                                 [('A', 'one'), ('A', 'one.1'),


why is this not:

In [5]: MultiIndex.from_tuples([('A', np.nan), ('A', 'one'),('A', 'one'), ('B', 'two')]).values Out[5]: array([('A', nan), ('A', 'one'), ('A', 'one'), ('B', 'two')], dtype=object)

Hey Jeff - I'm not sure I follow. Why would you expect there to be a NaN with those values? Or are you asking to add another test case that includes NaN values?

well the first value is missing, yes?

Shouldn't be. Assuming

data = """A,A,A,B\none,one,one.1,two\n0,40, 34,0.1"""

Then the first row has 3 A's and a B. The second row has 2 one's, 1 one.1 and a two in the data, so both rows essentially have 4 columns and no missing data

hahha, was reading '\none' as none. ok then.

gfyoung · 2017-11-04T17:30:11Z

pandas/tests/io/parser/header.py

+                             columns=MultiIndex.from_tuples(
+                                 [('A', 'one'), ('A', 'one.1'),
+                                  ('A', 'one.1.1'), ('B', 'two')]))
+        tm.assert_frame_equal(df, expected)


Very nice! Looking at your code, can we add one more test case here:

data = """A,A,A,B,B\none,one,one.1,two,two\n0,40, 34,0.1,0.1"""

jreback · 2017-11-06T13:43:09Z

@gfyoung merge if you are ok with this. lgtm.

gfyoung · 2017-11-06T17:41:37Z

Thanks @WillAyd !

Closes pandas-devgh-18062.

gfyoung added IO CSV read_csv, to_csv MultiIndex labels Nov 3, 2017

gfyoung reviewed Nov 3, 2017

View reviewed changes

jschendel reviewed Nov 3, 2017

View reviewed changes

WillAyd force-pushed the multi-mangle branch from 5fdf1e3 to 0207bee Compare November 3, 2017 20:04

jreback requested changes Nov 4, 2017

View reviewed changes

gfyoung reviewed Nov 4, 2017

View reviewed changes

BUG: MultiIndex mangling during parsing (pandas-dev#18062)

41eea13

WillAyd force-pushed the multi-mangle branch from 0207bee to 41eea13 Compare November 4, 2017 17:39

gfyoung approved these changes Nov 5, 2017

View reviewed changes

jreback approved these changes Nov 6, 2017

View reviewed changes

jreback added this to the 0.22.0 milestone Nov 6, 2017

gfyoung merged commit 980f650 into pandas-dev:master Nov 6, 2017

WillAyd deleted the multi-mangle branch November 6, 2017 22:09

watercrossing pushed a commit to watercrossing/pandas that referenced this pull request Nov 10, 2017

BUG: MultiIndex mangling during parsing (pandas-dev#18094)

db9be2d

Closes pandas-devgh-18062.

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

BUG: MultiIndex mangling during parsing (pandas-dev#18094)

f6e3f82

Closes pandas-devgh-18062.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: MultiIndex mangling during parsing (#18062) #18094

BUG: MultiIndex mangling during parsing (#18062) #18094

WillAyd commented Nov 3, 2017

gfyoung Nov 3, 2017 •

edited

Loading

gfyoung Nov 3, 2017 •

edited

Loading

jschendel Nov 3, 2017

codecov bot commented Nov 3, 2017

codecov bot commented Nov 3, 2017 •

edited

Loading

codecov bot commented Nov 3, 2017 •

edited

Loading

jreback Nov 4, 2017

WillAyd Nov 4, 2017

jreback Nov 4, 2017

WillAyd Nov 4, 2017

jreback Nov 4, 2017

gfyoung Nov 4, 2017 •

edited

Loading

jreback commented Nov 6, 2017

gfyoung commented Nov 6, 2017

		@@ -1106,6 +1106,11 @@ def _is_index_col(col):
		return col is not None and col is not False


		def _is_potential_multi_index(columns):

BUG: MultiIndex mangling during parsing (#18062) #18094

BUG: MultiIndex mangling during parsing (#18062) #18094

Conversation

WillAyd commented Nov 3, 2017

gfyoung Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

gfyoung Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

jschendel Nov 3, 2017

Choose a reason for hiding this comment

codecov bot commented Nov 3, 2017

Codecov Report

codecov bot commented Nov 3, 2017 • edited Loading

Codecov Report

codecov bot commented Nov 3, 2017 • edited Loading

Codecov Report

jreback Nov 4, 2017

Choose a reason for hiding this comment

WillAyd Nov 4, 2017

Choose a reason for hiding this comment

jreback Nov 4, 2017

Choose a reason for hiding this comment

WillAyd Nov 4, 2017

Choose a reason for hiding this comment

jreback Nov 4, 2017

Choose a reason for hiding this comment

gfyoung Nov 4, 2017 • edited Loading

Choose a reason for hiding this comment

jreback commented Nov 6, 2017

gfyoung commented Nov 6, 2017

gfyoung Nov 3, 2017 •

edited

Loading

gfyoung Nov 3, 2017 •

edited

Loading

codecov bot commented Nov 3, 2017 •

edited

Loading

codecov bot commented Nov 3, 2017 •

edited

Loading

gfyoung Nov 4, 2017 •

edited

Loading