-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Preserve key order when using loc on MultiIndex DataFrame #29190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Preserve key order when using loc on MultiIndex DataFrame #29190
Conversation
can you merge master and ping on green for another look |
b5b7be0
to
0f30bea
Compare
@jreback Of course. |
columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]], | ||
) | ||
|
||
res = df.loc[["b", "a"], :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test selection of columns as well? Looks like you have the data for it already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test for df.loc[:, ["Colorado", "Ohio"]]
and df.loc[:, (["Colorado", "Ohio"], ["Red", "Green"])]`.
Also, I add thisdf.loc[(slice(None), [2, 1]), :]
, testing if the result was
Ohio Colorado
Green Red Green
a 2 3 4 5
b 2 9 10 11
a 1 0 1 2
b 1 6 7 8
but maybe a preferable result is this
Ohio Colorado
Green Red Green
a 2 3 4 5
1 0 1 2
b 2 9 10 11
1 6 7 8
This would be harder to obtain given the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is getting quite difficult to parse, i proposed some simplifications.
pandas/core/indexes/multi.py
Outdated
else: | ||
# no matches we are done | ||
return Int64Index([])._ndarray_values | ||
|
||
elif com.is_null_slice(k): | ||
# empty slice | ||
indexer = _update_indexer(None, indexer=indexer) | ||
# index is given to conserve the order of this level | ||
indexer = _update_indexer(Int64Index(np.arange(n)), indexer=indexer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this statement entirely, it actually does't do anything (and neither does the original)
it just yields the indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part define the comportement when having something like
df.loc[(slice(None), [2, 1]), :]
(see my comment over it.
Other tests actually expect this to give the following results that keep the first level sorted exactly as existing, and the order of the subsequent level is then ignored.
One of this is test TestMultiIndexSlicers.test_per_axis_per_level_doc_examples
in tests/indexings.
Keeping this behavior will lead to this kind of result:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
I think the expected comportment can be one of this two:
# Ignoring the None slice order entirely
Ohio Colorado
Green Red Green
a 2 3 4 5
b 2 9 10 11
a 1 0 1 2
b 1 6 7 8
# Keeping the None slice order by code (and not by absolute position)
Ohio Colorado
Green Red Green
a 2 3 4 5
1 0 1 2
b 2 9 10 11
1 6 7 8
The first proposition is simpler to implement (ignore None slice as proposed).
Can you make a judgment call @jreback ?
can you merge master and will look again |
I just realized you actually had 2 impls, sorry (#28933) I think this one is more concise and easier to folow as it sorts in-line. which one do you prefer? are the tests the same? (this doesn't look as updated). |
Test are the same, but not in sync yet. I will update this one in priority now. For pro and cons on each impl, this one is more concise, sort in place, but may lack flexibility to handle more complex sort scheme (if ever needed). The other add sorting as a complete separate step, so maybe have a better separation of concern at the cost of added complexity. |
Testing return order of MultiIndex.loc MultiIndex.loc try to return the result in the same order as the key given.
From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe. It does for single index. When possible, the returned row now respect the given order for keys.
We need to order the slice(None) to comply with test from pandas/tests/indexing/multiindex/test_slice.py test_per_axis_per_level_doc_examples
50baa65
to
43df881
Compare
reading the other impl, i agree its simpler to grok. so changing my thoughts here. let's close this one and match tests on the other. |
This is another implementation to solve the issue described in #22797.
The other PR is #28933. When the other PR keep the main code identical and resort the indexer at the end, this implementation try to resolve the problem in the main loop of the original code.
The solution also as consequences on the performance when the key order is sorted as the index, so it add a flag to only keep the order if needed.
Benchmark
Benchmarks that have improved:
Benchmarks that have got worse:
Checklist
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff