-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: MultiIndex mangling during parsing (#18062) #18094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1106,6 +1106,11 @@ def _is_index_col(col): | |||
return col is not None and col is not False | |||
|
|||
|
|||
def _is_potential_multi_index(columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring to this function.
columns=MultiIndex.from_tuples( | ||
[('A', 'one'), ('A', 'one.1'), | ||
('A', 'one.2'), ('B', 'two')])) | ||
tm.assert_frame_equal(df, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test for the following case:
data = """A,A,A,B\none,one,one.1,two\n0,40,34,0.1"""
Let's make sure your deduping is robust.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -89,6 +89,7 @@ Bug Fixes | |||
|
|||
- Bug in ``pd.read_msgpack()`` with a non existent file is passed in Python 2 (:issue:`15296`) | |||
- Bug in ``DataFrame.groupby`` where key as tuple in a ``MultiIndex`` were interpreted as a list of keys (:issue:`17979`) | |||
- Bug in ``pd.read_csv`` where a multi-index with duplicate columns was not being mangled appropriately (:issue: `18062`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the following changes:
``pd.read_csv``
->:func:`read_csv`
- "multi-index" ->
``MultiIndex``
- Remove the space between
:issue:
and the issue number
Codecov Report
@@ Coverage Diff @@
## master #18094 +/- ##
==========================================
+ Coverage 91.25% 91.26% +<.01%
==========================================
Files 163 163
Lines 50120 50125 +5
==========================================
+ Hits 45737 45745 +8
+ Misses 4383 4380 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18094 +/- ##
==========================================
+ Coverage 91.25% 91.26% +<.01%
==========================================
Files 163 163
Lines 50120 50125 +5
==========================================
+ Hits 45737 45745 +8
+ Misses 4383 4380 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18094 +/- ##
==========================================
- Coverage 91.27% 91.26% -0.02%
==========================================
Files 163 163
Lines 50123 50128 +5
==========================================
- Hits 45752 45749 -3
- Misses 4371 4379 +8
Continue to review full report at Codecov.
|
pandas/tests/io/parser/header.py
Outdated
df = self.read_csv(StringIO(data), header=[0, 1]) | ||
expected = DataFrame([[0, 40, 34, 0.1]], | ||
columns=MultiIndex.from_tuples( | ||
[('A', 'one'), ('A', 'one.1'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not:
In [5]: MultiIndex.from_tuples([('A', np.nan), ('A', 'one'),('A', 'one'), ('B', 'two')]).values
Out[5]: array([('A', nan), ('A', 'one'), ('A', 'one'), ('B', 'two')], dtype=object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Jeff - I'm not sure I follow. Why would you expect there to be a NaN with those values? Or are you asking to add another test case that includes NaN values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well the first value is missing, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be. Assuming
data = """A,A,A,B\none,one,one.1,two\n0,40, 34,0.1"""
Then the first row has 3 A's and a B. The second row has 2 one's, 1 one.1 and a two in the data, so both rows essentially have 4 columns and no missing data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hahha, was reading '\none' as none. ok then.
columns=MultiIndex.from_tuples( | ||
[('A', 'one'), ('A', 'one.1'), | ||
('A', 'one.1.1'), ('B', 'two')])) | ||
tm.assert_frame_equal(df, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Looking at your code, can we add one more test case here:
data = """A,A,A,B,B\none,one,one.1,two,two\n0,40, 34,0.1,0.1"""
@gfyoung merge if you are ok with this. lgtm. |
Thanks @WillAyd ! |
git diff upstream/master -u -- "*.py" | flake8 --diff