-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: MultiIndex not dropping nan level and invalid code value #26408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26408 +/- ##
==========================================
- Coverage 91.69% 91.68% -0.01%
==========================================
Files 174 174
Lines 50743 50749 +6
==========================================
+ Hits 46527 46530 +3
- Misses 4216 4219 +3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26408 +/- ##
==========================================
+ Coverage 91.69% 91.84% +0.14%
==========================================
Files 174 174
Lines 50743 50658 -85
==========================================
- Hits 46527 46525 -2
+ Misses 4216 4133 -83
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -387,8 +387,8 @@ MultiIndex | |||
^^^^^^^^^^ | |||
|
|||
- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`) | |||
- | |||
- | |||
- Bug in having code value < -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make something like
Bug in :class:`Multindex` construction from
levelsand
codesthat would incorrectly allows
NaNlevels
Should we reserve negative codes for use by pandas, like we do for Categorical? |
yes |
|
||
|
||
def test_na_levels(): | ||
tm.assert_index_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment with the issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue number added
MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]], | ||
codes=[[0, -1, 1, 2, 3, 4]]), | ||
MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]], | ||
codes=[[-1, -1, -1, -1, 3, 4]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
result=
expected=
tm.assert_index_equal(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test cases refactored
pandas/core/indexes/multi.py
Outdated
return result | ||
|
||
@classmethod | ||
def _reassign_na_codes(cls, level, code): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this _validate_codes, pls add a doc-string.
This should be an instance method (and not a classmethod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method level changed and moved into validation method
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -387,8 +387,8 @@ MultiIndex | |||
^^^^^^^^^^ | |||
|
|||
- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`) | |||
- | |||
- | |||
- Bug in :class:`Multindex` construction from levels and codes that would incorrectly allows code values < -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a single note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notes merged
pandas/core/indexes/multi.py
Outdated
@@ -243,10 +243,23 @@ def __new__(cls, levels=None, codes=None, sortorder=None, names=None, | |||
|
|||
if verify_integrity: | |||
result._verify_integrity() | |||
|
|||
codes = [cls._reassign_na_codes(level, code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should actually be IN _verify_integrity
MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]], | ||
codes=[[-1, -1, -1, -1, 3, 4]])) | ||
tm.assert_index_equal( | ||
MultiIndex(levels=[[np.nan, 's', pd.NaT, 128, None]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should also verify if .set_levels and/or .set_codes is called. Can you add tests for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new test cases added
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -387,8 +387,7 @@ MultiIndex | |||
^^^^^^^^^^ | |||
|
|||
- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`) | |||
- | |||
- | |||
- Bug in :class:`Multindex` construction from levels and codes that would incorrectly allows code values < -1 or NaN levels (:issue:`19387`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need to move this under the "API breaking changes" section, as previously we did allow codes < 0?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the whatsnew entry to an API breaking change
@@ -67,6 +67,7 @@ def test_constructor_mismatched_codes_levels(idx): | |||
length_error = (r"On level 0, code max \(3\) >= length of level \(1\)\." | |||
" NOTE: this index is in an inconsistent state") | |||
label_error = r"Unequal code lengths: \[4, 2\]" | |||
code_value_error = (r"On level 0, code value \(-2\) < -1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need the outer parentheses here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -387,8 +387,7 @@ MultiIndex | |||
^^^^^^^^^^ | |||
|
|||
- Bug in which incorrect exception raised by :class:`Timedelta` when testing the membership of :class:`MultiIndex` (:issue:`24570`) | |||
- | |||
- | |||
- Bug in :class:`Multindex` construction from levels and codes that would incorrectly allows code values < -1 or NaN levels (:issue:`19387`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sure
pandas/core/indexes/multi.py
Outdated
return result | ||
|
||
def _validate_codes(cls, level, code): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this accept self, and add doc-string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed and doc-string added
pandas/core/indexes/multi.py
Outdated
" level (%d). NOTE: this index is in an" | ||
" inconsistent state" % (i, level_codes.max(), | ||
len(level))) | ||
raise ValueError("On level {level}, code max ({max_code})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do like this
msg = "On level...........".format(....)
raise ValueError(msg)
(and if needed make the message nicely formatted over multiple lines, with parens e.g.
msg = ("........"
".......".format)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting updated
pandas/core/indexes/multi.py
Outdated
@@ -696,8 +713,9 @@ def _set_codes(self, codes, level=None, copy=False, validate=True, | |||
|
|||
if verify_integrity: | |||
self._verify_integrity(codes=new_codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you change _verfiy_integrity to return codes, this then becomes
if verify_integrity:
new_codes = self._verify_integrity(codes=new_codes)
self._codes = new_codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing it, but I don't see an ASV benchmark for MultiIndex with user-supplied codes. Do we suspect this will have a (major) performance impact? Should we add an ASV for it?
doc/source/whatsnew/v0.25.0.rst
Outdated
.. _whatsnew_0250.api_breaking.multi_indexing: | ||
|
||
|
||
MultiIndexing contracted from levels and codes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "contracted" is the right word. Maybe "Changed allowed codes in MultiIndex"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the typo in whatsnew, I think I meant constructed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, some minor comments.
can you run the asv suite and see if any regressions
pandas/core/indexes/multi.py
Outdated
return result | ||
|
||
def _validate_codes(self, level, code): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try to type annotate level and code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about type annotation, could you please point me to any example in the code base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure we have started generic codes in pandas._typing (see how they are used); this can be a followup
pandas/core/indexes/multi.py
Outdated
|
||
Parameters | ||
---------- | ||
code : optional list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code : list, optional
(same with level)
pandas/core/indexes/multi.py
Outdated
code : optional list | ||
Code to reassign. | ||
level : optional list | ||
Level to check for Nan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nan -> missing value (NaN, NaT, None)
pandas/core/indexes/multi.py
Outdated
codes = [self._validate_codes(level, code) | ||
for level, code in zip(levels, codes)] | ||
new_codes = FrozenList( | ||
_ensure_frozen(level_codes, lev, copy=False)._shallow_copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need the _shallow_copy()? the constructor (of frozen) should do this
@jreback before after ratio
|
pandas/core/indexes/multi.py
Outdated
return result | ||
|
||
def _validate_codes(self, level, code): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure we have started generic codes in pandas._typing (see how they are used); this can be a followup
pandas/core/indexes/multi.py
Outdated
|
||
Parameters | ||
---------- | ||
code : list, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are not optional
pandas/core/indexes/multi.py
Outdated
if verify_integrity: | ||
self._verify_integrity(codes=new_codes) | ||
|
||
new_codes = self._verify_integrity(codes=new_codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this only be called if verfiy_integrity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, changed
@@ -1760,9 +1799,10 @@ def __setstate__(self, state): | |||
|
|||
self._set_levels([Index(x) for x in levels], validate=False) | |||
self._set_codes(codes) | |||
new_codes = self._verify_integrity() | |||
self._set_codes(new_codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we end up calling verify_integrity twice? here (and in _set_codes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the set_codes are called with default verify_integrity=False, verification would not be called twice
@jiangyue12392 on the asv's so looks like somewhat slower, can you see if we are always checking codes? (I think we are checking twice sometims, and shouldn't check unless verify_integrity=True). in the tests you may need to test with verfiy_integrity =True as well as the default (False) |
how’s perf now? (you can just show some before / after with a sample benchmark that showed up before) |
@jreback |
great will look soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small test comment, otherwise lgtm. ping on green.
[[np.nan, 's', pd.NaT, 128, None]]) | ||
tm.assert_index_equal(result, expected) | ||
|
||
result = MultiIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you replace the actual OP tests as well (e.g. the .dropna()); I don't think they are very different that what you have, but they also attempt to .dropna()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropna tests added, and green now.
thanks @jiangyue12392 this was very nice! a non-trivial change and were very responsive! keep em coming! |
git diff upstream/master -u -- "*.py" | flake8 --diff
Please take note that another bug is also fixed in this PR:
When the MultiIndex is constructed with code value < -1, a segmentation fault would be resulted in subsequent operations e.g. putting the MultiIndex in a DataFrame
Input:
A segmentation fault can be resulted by these commands