-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix reindexing with multi-indexed DataFrames #30766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix reindexing with multi-indexed DataFrames #30766
Conversation
Hello @ChrisRobo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-04-08 05:17:02 UTC |
the formatting decisions on these lines are auto-generated output of |
@@ -183,6 +183,155 @@ def test_get_indexer(): | |||
idx1.get_indexer(idx2) | |||
|
|||
|
|||
def test_get_indexer_and_fill(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in all of these test cases, they illustrate the behavior up until 0.23.0. The assumption of this PR is that, in the absence of documentation about these changes to MultiIndex behavior in 0.23.0 (and that this PR does not break any existing tests), that those changes are bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don’t think the reindeximg result is valid here
pls show your original example at the top of the PR
include the case of reindex without pad/backfill
show the same example with an Index (factorize the tuples);
convince us why these should be different as that is what you are proposing (i think)
thanks for the quick feedback here! I included the original example from the issue (with both what I saw and what I expected to see) and also added the case of no filling (not included in the issue because I don't view that as a bug). Apologies for the ignorance (this is my first PR here), but what specifically are you looking for with tuple factorization? I'm having trouble constructing an index of tuples which is not itself a MultiIndex. My argument for why this is a bug is that backfilling/padding should respect python tuple ordering properties (which it appears to have done until version 0.23.0), e.g.
Happy to add more examples or any kind of additional explanation or validation that you'd like to help make this clearer. Thanks again for the responsiveness |
here's a more concise version of the example from the GH issue which still illustrates the problem, I think. I believe that since (0, 0) < (0, 1) < (0, 2) under python tuple ordering rules, that reindexing with (0, 1) with setup: >>> df = pd.DataFrame({'a': [0, 0], 'b': [0, 2], 'c': ['A', 'B']}).set_index(['a', 'b'])
>>> df
c
a b
0 0 A
2 B
>>>
>>> mi_2 = pd.MultiIndex.from_product([[0], [1]])
>>> mi_2
MultiIndex([(0, 1)],
) in both cases, without a >>> df.reindex(mi_2)
c
0 1 NaN filling with >>> df.reindex(mi_2, method='bfill')
c
0 1 A I believe it should be: >>> df.reindex(mi_2, method='bfill')
c
0 1 B similarly, as is, with >>> df.reindex(mi_2, method='pad')
c
0 1 NaN whereas I believe it should be >>> df.reindex(mi_2, method='pad')
c
0 1 A hopefully this is easier to read |
@ChrisRobo can you update your example to show a few values, instead of a single one. some should have values even in the reindex and some should not so as to see the effect of the reindex. |
sure (making a new comment in the interest of preserving the history of this thread) here's an example which I think is a bit more comprehensive. setup: >>> df = pd.DataFrame({
... 'a': [1, 1, 1, 4, 4, 4],
... 'b': [1, 5, 9, 1, 5, 9],
... 'c': ['A', 'B', 'C', 'D', 'E', 'F'],
... })
>>> df = df.set_index(['a', 'b'])
>>> df
c
a b
1 1 A
5 B
9 C
4 1 D
5 E
9 F
>>>
>>> mi_2 = pd.MultiIndex.from_tuples([
... (1, 0),
... (1, 5),
... (1, 7),
... (3, 2),
... (4, 9),
... (4, 11),
... (5, 2),
... ])
>>> mi_2
MultiIndex([(1, 0),
(1, 5),
(1, 7),
(3, 2),
(4, 9),
(4, 11),
(5, 2)],
)
>>> current behavior w/o a filling method (no proposed change): >>> df.reindex(mi_2)
c
1 0 NaN
5 B
7 NaN
3 2 NaN
4 9 F
11 NaN
5 2 NaN current behavior w/ >>> df.reindex(mi_2, method='backfill')
c
1 0 A
5 F
7 A
3 2 A
4 9 A
11 D
5 2 B
>>>
>>>
>>> df.reindex(mi_2, method='pad')
c
1 0 NaN
5 F
7 NaN
3 2 NaN
4 9 NaN
11 C
5 2 B whereas I think this is correct: >>> df.reindex(mi_2, method='backfill')
c
1 0 A
5 B
7 C
3 2 D
4 9 F
11 NaN
5 2 NaN
>>>
>>> df.reindex(mi_2, method='pad')
c
1 0 NaN
5 B
7 B
3 2 C
4 9 F
11 F this covers the case of values which are too large/too small (for backfilling/padding, respectively), cases which exist in the original and new index (i.e. (1, 5) and (4, 11)), values in the new index to which nothing is filled (i.e. (4, 5, E)), and backfilling/padding cases (i.e. those which are Let me know if there's a case which this is missing but which you think it should cover! |
ca6fbef
to
5df2152
Compare
@jreback @jbrockmendel when you get a chance, would one or both of you mind taking another look at this? I'm happy to add more examples or test coverage if you'd like |
will look. thought this was going to conflict with a recently merged refactor of MI indexing. pls merge master anyhow. |
ca13ff3
to
f5521e8
Compare
thanks! :). just rebased on top of master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very complicated implementation which is adding quite a bit of code. It seems to me that you can simply re-order the codes to avoid the carrying operation entirely, no? this is similar to recoding the values due to a new factorization.
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -172,6 +172,7 @@ MultiIndex | |||
index=[["a", "a", "b", "b"], [1, 2, 1, 2]]) | |||
# Rows are now ordered as the requested keys | |||
df.loc[(['b', 'a'], [2, 1]), :] | |||
- Bug in :meth:`MultiIndex.get_indexer` incorrectly handling use of pad and backfill options (:issue:`29896`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a separate sub section showing the before and the new results; yes its a bug fix but the effect is non-obvious. you can use the OP example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, let me know if the formatting is OK by you!
f5521e8
to
c29c2bd
Compare
@jreback @jbrockmendel thanks for the suggestions, sorry for the late reply here. I basically re-wrote this using what (I think is) a much simpler implementation, i.e. using something like an updated factorization of the MultiIndexes' tuples and passing that to the relevant filling method. I also added types wherever I could in cython code and made additions to the tests as requested. |
pandas/core/indexes/multi.py
Outdated
@@ -2351,7 +2351,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): | |||
raise NotImplementedError( | |||
"tolerance not implemented yet for MultiIndex" | |||
) | |||
indexer = self._engine.get_indexer(target, method, limit) | |||
indexer = self._engine.__class__.get_indexer_and_fill( | |||
self.values, target, method=method, limit=limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be preferable (per @jbrockmendel's point about keeping Index
objects out of the cython code) to just pass the MultiIndex object's tuples directly to the cython code
# 1: 2.0 | ||
# 2: 5.0 | ||
# 3: 5.8 | ||
df = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before but added 3 0-level values for df
's (multi)index
69dd92b
to
7836cc9
Compare
rebased on top of master, perf (in terms of times) for relevant ASV benchmarks seems the same as before against master:
against 0.22.0 (last version where the correct functionality was in place):
and no problem re: reviews, thanks for taking the time to look at this! |
thanks @ChrisRobo great effort! and thanks for sticking with it. we are getting lots of PRs now-a-days and hard to give good review on them all. |
my pleasure, and thanks @jreback and @jbrockmendel for the time reviewing! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Addresses an issue which appears to have existed since 0.23.0 where bugs in the
get_indexer()
method for theMultiIndex
class cause incorrect reindexing behavior for multi-indexed DataFrames.motivating, example, from (the issue)
as expected, without a
method
value:using
method="backfill"
, it is:but should (IMHO) be:
similarly, using
method="pad"
, it is:but should (IMHO) be: