Skip to content

BUG: MultiIndex mangling during parsing (#18062) #18094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 6, 2017

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Nov 3, 2017

@gfyoung gfyoung added IO CSV read_csv, to_csv MultiIndex labels Nov 3, 2017
@@ -1106,6 +1106,11 @@ def _is_index_col(col):
return col is not None and col is not False


def _is_potential_multi_index(columns):
Copy link
Member

@gfyoung gfyoung Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring to this function.

columns=MultiIndex.from_tuples(
[('A', 'one'), ('A', 'one.1'),
('A', 'one.2'), ('B', 'two')]))
tm.assert_frame_equal(df, expected)
Copy link
Member

@gfyoung gfyoung Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for the following case:

data = """A,A,A,B\none,one,one.1,two\n0,40,34,0.1"""

Let's make sure your deduping is robust.

@@ -89,6 +89,7 @@ Bug Fixes

- Bug in ``pd.read_msgpack()`` with a non existent file is passed in Python 2 (:issue:`15296`)
- Bug in ``DataFrame.groupby`` where key as tuple in a ``MultiIndex`` were interpreted as a list of keys (:issue:`17979`)
- Bug in ``pd.read_csv`` where a multi-index with duplicate columns was not being mangled appropriately (:issue: `18062`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the following changes:

  • ``pd.read_csv`` -> :func:`read_csv`
  • "multi-index" -> ``MultiIndex``
  • Remove the space between :issue: and the issue number

@codecov
Copy link

codecov bot commented Nov 3, 2017

Codecov Report

Merging #18094 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18094      +/-   ##
==========================================
+ Coverage   91.25%   91.26%   +<.01%     
==========================================
  Files         163      163              
  Lines       50120    50125       +5     
==========================================
+ Hits        45737    45745       +8     
+ Misses       4383     4380       -3
Flag Coverage Δ
#multiple 89.07% <100%> (+0.02%) ⬆️
#single 40.32% <57.14%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.53% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27bbea7...5fdf1e3. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 3, 2017

Codecov Report

Merging #18094 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18094      +/-   ##
==========================================
+ Coverage   91.25%   91.26%   +<.01%     
==========================================
  Files         163      163              
  Lines       50120    50125       +5     
==========================================
+ Hits        45737    45745       +8     
+ Misses       4383     4380       -3
Flag Coverage Δ
#multiple 89.07% <100%> (+0.02%) ⬆️
#single 40.32% <57.14%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.53% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27bbea7...5fdf1e3. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 3, 2017

Codecov Report

Merging #18094 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18094      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.02%     
==========================================
  Files         163      163              
  Lines       50123    50128       +5     
==========================================
- Hits        45752    45749       -3     
- Misses       4371     4379       +8
Flag Coverage Δ
#multiple 89.07% <100%> (ø) ⬆️
#single 40.32% <57.14%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.53% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.5% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f11353...41eea13. Read the comment docs.

df = self.read_csv(StringIO(data), header=[0, 1])
expected = DataFrame([[0, 40, 34, 0.1]],
columns=MultiIndex.from_tuples(
[('A', 'one'), ('A', 'one.1'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not:

In [5]: MultiIndex.from_tuples([('A', np.nan), ('A', 'one'),('A', 'one'), ('B', 'two')]).values
Out[5]: array([('A', nan), ('A', 'one'), ('A', 'one'), ('B', 'two')], dtype=object)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jeff - I'm not sure I follow. Why would you expect there to be a NaN with those values? Or are you asking to add another test case that includes NaN values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well the first value is missing, yes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be. Assuming

data = """A,A,A,B\none,one,one.1,two\n0,40, 34,0.1"""

Then the first row has 3 A's and a B. The second row has 2 one's, 1 one.1 and a two in the data, so both rows essentially have 4 columns and no missing data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahha, was reading '\none' as none. ok then.

columns=MultiIndex.from_tuples(
[('A', 'one'), ('A', 'one.1'),
('A', 'one.1.1'), ('B', 'two')]))
tm.assert_frame_equal(df, expected)
Copy link
Member

@gfyoung gfyoung Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Looking at your code, can we add one more test case here:

data = """A,A,A,B,B\none,one,one.1,two,two\n0,40, 34,0.1,0.1"""

@jreback jreback added this to the 0.22.0 milestone Nov 6, 2017
@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

@gfyoung merge if you are ok with this. lgtm.

@gfyoung gfyoung merged commit 980f650 into pandas-dev:master Nov 6, 2017
@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2017

Thanks @WillAyd !

@WillAyd WillAyd deleted the multi-mangle branch November 6, 2017 22:09
watercrossing pushed a commit to watercrossing/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading a CSV with duplicated MultiRow columns
4 participants