BUG: Preserve key order when using loc on MultiIndex DataFrame #29190

nrebena · 2019-10-23T18:31:44Z

This is another implementation to solve the issue described in #22797.

The other PR is #28933. When the other PR keep the main code identical and resort the indexer at the end, this implementation try to resolve the problem in the main loop of the original code.
The solution also as consequences on the performance when the key order is sorted as the index, so it add a flag to only keep the order if needed.

Benchmark

Benchmarks that have improved:

   before           after         ratio
 [e623f0fb]       [b5b7be06]
 <stash~5>        <multiindex_loc_order_issue_22797>

2.50±0.01ms         2.35±0ms     0.94  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')

2.55±0.01ms         2.40±0ms     0.94  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')

  767±0.5ms          707±2ms     0.92  groupby.Transform.time_transform_lambda_max

 2.74±0.2μs       2.49±0.2μs     0.91  index_cached_properties.IndexCache.time_inferred_type('Float64Index')

 2.64±0.1μs       2.51±0.2μs     0.95  index_cached_properties.IndexCache.time_is_all_dates('Float64Index')

   517±10μs         491±10μs     0.95  index_cached_properties.IndexCache.time_is_monotonic('MultiIndex')

    671±6μs          605±6μs     0.90  index_cached_properties.IndexCache.time_is_monotonic_decreasing('MultiIndex')

   514±10μs         489±10μs     0.95  index_cached_properties.IndexCache.time_is_monotonic_increasing('MultiIndex')

 6.61±0.3μs         5.68±1μs     0.86  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')

 3.87±0.7μs       2.84±0.2μs     0.73  index_cached_properties.IndexCache.time_values('TimedeltaIndex')

2.87±0.03μs      2.72±0.01μs     0.95  indexing.GetItemSingleColumn.time_frame_getitem_single_column_label

1.19±0.02ms         787±20μs     0.66  indexing.MultiIndexing.time_frame_ix

1.11±0.02ms         722±20μs     0.65  indexing.MultiIndexing.time_series_ix

3.54±0.01ms      3.37±0.01ms     0.95  indexing.NonNumericSeriesIndexing.time_getitem_list_like('string', 'unique_monotonic_inc')

    644±2μs          568±1μs     0.88  multiindex_object.Duplicates.time_remove_unused_levels

    746±8μs          189±2μs     0.25  multiindex_object.GetLoc.time_large_get_loc

  169±0.6ms        102±0.4ms     0.61  multiindex_object.GetLoc.time_large_get_loc_warm

 94.2±0.2μs       23.8±0.3μs     0.25  multiindex_object.GetLoc.time_med_get_loc

 90.1±0.3ms       23.1±0.3ms     0.26  multiindex_object.GetLoc.time_med_get_loc_warm

 87.1±0.2ms       22.8±0.2ms     0.26  multiindex_object.GetLoc.time_small_get_loc_warm

 88.9±0.2μs       23.1±0.2μs     0.26  multiindex_object.GetLoc.time_string_get_loc

  700±0.8μs        550±0.6μs     0.79  multiindex_object.Values.time_datetime_level_values_sliced

1.65±0.03ms      1.21±0.02ms     0.73  reindex.LevelAlign.time_align_level

1.65±0.04ms      1.22±0.01ms     0.74  reindex.LevelAlign.time_reindex_level

 91.7±0.1ms       86.8±0.1ms     0.95  reshape.Crosstab.time_crosstab_normalize_margins

 25.5±0.4ms       24.0±0.4ms     0.94  reshape.PivotTable.time_pivot_table

 51.6±0.6ms       48.0±0.2ms     0.93  reshape.PivotTable.time_pivot_table_agg

19.7±0.05ms       18.4±0.2ms     0.93  reshape.PivotTable.time_pivot_table_categorical

14.2±0.03ms      13.1±0.04ms     0.92  reshape.PivotTable.time_pivot_table_categorical_observed

  107±0.8ms       94.0±0.9ms     0.88  reshape.PivotTable.time_pivot_table_margins

6.02±0.02ms      5.47±0.02ms     0.91  reshape.SimpleReshape.time_stack

2.46±0.02ms      2.20±0.02ms     0.89  reshape.SimpleReshape.time_unstack

   2.30±0ms      1.73±0.01ms     0.75  reshape.SparseIndex.time_unstack

 18.9±0.1ms      18.0±0.03ms     0.95  rolling.Pairwise.time_pairwise(10, 'cov', True)

   2.60±0ms         2.33±0ms     0.90  sparse.FromCoo.time_sparse_series_from_coo

 41.3±0.2ms       39.2±0.2ms     0.95  sparse.ToCoo.time_sparse_series_to_coo

    3.98±0s       3.75±0.01s     0.94  stat_ops.Correlation.time_corr_wide_nans('spearman', False)

 3.99±0.01s       3.74±0.01s     0.94  stat_ops.Correlation.time_corr_wide_nans('spearman', True)

40.9±0.07ms       36.8±0.1ms     0.90  stat_ops.FrameMultiIndexOps.time_op(0, 'kurt')

  195±0.3ms        165±0.3ms     0.85  stat_ops.FrameMultiIndexOps.time_op(1, 'kurt')

   87.7±1ms         78.1±1ms     0.89  stat_ops.FrameMultiIndexOps.time_op(1, 'skew')

    710±1ms          628±3ms     0.88  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'skew')

 10.7±0.1ms       9.66±0.3ms     0.90  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')

10.6±0.05ms       9.15±0.2ms     0.86  stat_ops.SeriesMultiIndexOps.time_op(0, 'skew')

 4.55±0.1ms      4.01±0.08ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')

4.15±0.06ms       3.86±0.1ms     0.93  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')

 49.4±0.4ms       43.3±0.2ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(1, 'kurt')

  128±0.3ms        112±0.3ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(1, 'mad')

 48.0±0.3ms       40.3±0.4ms     0.84  stat_ops.SeriesMultiIndexOps.time_op(1, 'skew')

4.55±0.08ms       3.97±0.1ms     0.87  stat_ops.SeriesMultiIndexOps.time_op(1, 'std')

 4.14±0.1ms      3.81±0.09ms     0.92  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')

  434±0.5ms        357±0.4ms     0.82  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'kurt')

    1.20±0s          1.04±0s     0.86  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'mad')

  422±0.7ms          339±1ms     0.80  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'skew')

 48.6±0.2ms      40.5±0.07ms     0.83  timeseries.DatetimeIndex.time_to_pydatetime('repeated')

  288±0.9ms        271±0.8ms     0.94  timeseries.DatetimeIndex.time_to_pydatetime('tz_aware')

 48.3±0.1ms       40.6±0.1ms     0.84  timeseries.DatetimeIndex.time_to_pydatetime('tz_naive')

Benchmarks that have got worse:

   before           after         ratio
 [e623f0fb]       [b5b7be06]
 <stash~5>        <multiindex_loc_order_issue_22797>

  149±0.5ms        170±0.6ms     1.15  categoricals.Rank.time_rank_string

 7.23±0.2ms       8.38±0.3ms     1.16  categoricals.Rank.time_rank_string_cat

 9.81±0.1ms       10.4±0.1ms     1.06  categoricals.Rank.time_rank_string_cat_ordered

2.45±0.01ms      2.65±0.09ms     1.08  frame_methods.Iteration.time_items_cached

  227±0.3ms          239±4ms     1.06  groupby.GroupByMethods.time_dtype_as_field('datetime', 'unique', 'transformation')

  148±0.3ms        157±0.3ms     1.06  groupby.GroupByMethods.time_dtype_as_field('float', 'unique', 'direct')

  148±0.5ms        157±0.8ms     1.06  groupby.GroupByMethods.time_dtype_as_field('float', 'unique', 'transformation')

  144±0.5ms          153±3ms     1.06  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'direct')

  143±0.4ms          155±5ms     1.08  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'transformation')

    489±3μs          541±4μs     1.11  groupby.GroupByMethods.time_dtype_as_field('object', 'first', 'direct')

    488±2μs          546±3μs     1.12  groupby.GroupByMethods.time_dtype_as_field('object', 'first', 'transformation')

    481±2μs         547±20μs     1.14  groupby.GroupByMethods.time_dtype_as_field('object', 'last', 'direct')

    480±2μs          536±6μs     1.12  groupby.GroupByMethods.time_dtype_as_field('object', 'last', 'transformation')

  173±0.2ms        183±0.5ms     1.06  groupby.GroupByMethods.time_dtype_as_field('object', 'unique', 'direct')

  173±0.5ms        183±0.4ms     1.06  groupby.GroupByMethods.time_dtype_as_field('object', 'unique', 'transformation')

  334±0.3ms          358±3ms     1.07  groupby.GroupByMethods.time_dtype_as_group('datetime', 'unique', 'direct')

  334±0.3ms          356±2ms     1.07  groupby.GroupByMethods.time_dtype_as_group('datetime', 'unique', 'transformation')

  330±0.4ms          352±2ms     1.07  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct')

  331±0.7ms        353±0.6ms     1.07  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'transformation')

  212±0.3ms        224±0.8ms     1.06  groupby.GroupByMethods.time_dtype_as_group('int', 'unique', 'direct')

  211±0.2ms        224±0.7ms     1.06  groupby.GroupByMethods.time_dtype_as_group('int', 'unique', 'transformation')

    766±3μs         1.67±0ms     2.19  groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'direct')

  766±0.7μs         1.68±0ms     2.19  groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'transformation')

4.96±0.05μs       5.42±0.1μs     1.09  index_cached_properties.IndexCache.time_engine('DatetimeIndex')

4.25±0.05μs      4.46±0.05μs     1.05  index_cached_properties.IndexCache.time_engine('PeriodIndex')

 2.92±0.1μs       3.13±0.2μs     1.07  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')

   988±30ns      1.07±0.05μs     1.08  index_cached_properties.IndexCache.time_inferred_type('RangeIndex')

 2.38±0.1μs       2.74±0.5μs     1.15  index_cached_properties.IndexCache.time_inferred_type('TimedeltaIndex')

 2.70±0.2μs       3.05±0.5μs     1.13  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')

1.41±0.02μs      1.51±0.04μs     1.07  index_cached_properties.IndexCache.time_is_monotonic_increasing('Int64Index')

   941±20ns      1.01±0.02μs     1.07  index_cached_properties.IndexCache.time_is_unique('RangeIndex')

 3.70±0.1μs       3.93±0.1μs     1.06  index_cached_properties.IndexCache.time_shape('UInt64Index')

 2.88±0.2μs       3.05±0.2μs     1.06  index_cached_properties.IndexCache.time_values('UInt64Index')

5.43±0.08μs      6.39±0.02μs     1.18  index_object.Indexing.time_get_loc('Int')

5.44±0.06μs       6.45±0.2μs     1.19  index_object.Indexing.time_get_loc_sorted('Int')

2.76±0.06ms         3.01±0ms     1.09  indexing.IntervalIndexing.time_getitem_scalar

3.09±0.06ms      3.34±0.02ms     1.08  indexing.IntervalIndexing.time_loc_list

2.78±0.05ms         3.03±0ms     1.09  indexing.IntervalIndexing.time_loc_scalar

5.83±0.02ms      16.6±0.03ms     2.85  indexing.MultiIndexing.time_index_slice

2.45±0.02ms      2.59±0.02ms     1.06  inference.NumericInferOps.time_divide(<class 'numpy.uint8'>)

9.78±0.08ms      10.4±0.07ms     1.06  io.csv.ParseDateComparison.time_read_csv_dayfirst(False)

    789±2ms          849±4ms     1.08  io.json.ReadJSON.time_read_json('index', 'datetime')

  840±0.6ms          923±3ms     1.10  io.json.ReadJSON.time_read_json('index', 'int')

49.2±0.05ms       52.1±0.3ms     1.06  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')

 68.1±0.4ms       72.3±0.5ms     1.06  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')

   1.64±0μs       1.86±0.1μs     1.14  period.PeriodProperties.time_property('M', 'daysinmonth')

1.68±0.01μs      1.80±0.03μs     1.07  period.PeriodProperties.time_property('M', 'is_leap_year')

1.68±0.02μs      1.77±0.02μs     1.05  period.PeriodProperties.time_property('min', 'is_leap_year')

  739±0.7ms          1.41±0s     1.91  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'kurt')

   764±10ns          810±2ns     1.06  timestamp.TimestampProperties.time_freqstr(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)

   763±10ns        812±0.9ns     1.07  timestamp.TimestampProperties.time_freqstr(<UTC>, None)

   763±10ns          812±1ns     1.06  timestamp.TimestampProperties.time_freqstr(None, None)

   762±10ns          811±2ns     1.06  timestamp.TimestampProperties.time_freqstr(tzutc(), None)

Checklist

closes loc() does not swap two rows in multi-index pandas dataframe #22797
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2019-11-13T18:21:31Z

can you merge master and ping on green for another look

nrebena · 2019-11-14T09:43:58Z

@jreback Of course.

WillAyd · 2019-12-09T18:23:03Z

pandas/tests/indexes/multi/test_indexing.py

+        columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
+    )
+
+    res = df.loc[["b", "a"], :]


Can you test selection of columns as well? Looks like you have the data for it already

Added a test for df.loc[:, ["Colorado", "Ohio"]] and df.loc[:, (["Colorado", "Ohio"], ["Red", "Green"])]`.

Also, I add thisdf.loc[(slice(None), [2, 1]), :], testing if the result was

Ohio Colorado Green Red Green a 2 3 4 5 b 2 9 10 11 a 1 0 1 2 b 1 6 7 8

but maybe a preferable result is this

Ohio Colorado Green Red Green a 2 3 4 5 1 0 1 2 b 2 9 10 11 1 6 7 8

This would be harder to obtain given the current implementation.

jreback

this is getting quite difficult to parse, i proposed some simplifications.

pandas/core/indexes/multi.py

jreback · 2019-12-15T22:08:31Z

pandas/core/indexes/multi.py

                else:
                    # no matches we are done
                    return Int64Index([])._ndarray_values

            elif com.is_null_slice(k):
                # empty slice
-                indexer = _update_indexer(None, indexer=indexer)
+                # index is given to conserve the order of this level
+                indexer = _update_indexer(Int64Index(np.arange(n)), indexer=indexer)


I think you can remove this statement entirely, it actually does't do anything (and neither does the original)
it just yields the indexer

This part define the comportement when having something like
df.loc[(slice(None), [2, 1]), :] (see my comment over it.
Other tests actually expect this to give the following results that keep the first level sorted exactly as existing, and the order of the subsequent level is then ignored.
One of this is test TestMultiIndexSlicers.test_per_axis_per_level_doc_examples in tests/indexings.
Keeping this behavior will lead to this kind of result:

Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11

I think the expected comportment can be one of this two:

# Ignoring the None slice order entirely Ohio Colorado Green Red Green a 2 3 4 5 b 2 9 10 11 a 1 0 1 2 b 1 6 7 8 # Keeping the None slice order by code (and not by absolute position) Ohio Colorado Green Red Green a 2 3 4 5 1 0 1 2 b 2 9 10 11 1 6 7 8

The first proposition is simpler to implement (ignore None slice as proposed).

Can you make a judgment call @jreback ?

pandas/core/indexes/multi.py

pep8speaks · 2019-12-16T19:15:57Z

Hello @nrebena! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-11 10:43:08 UTC

jreback · 2019-12-27T19:50:41Z

can you merge master and will look again

jreback · 2020-01-04T23:54:20Z

I just realized you actually had 2 impls, sorry (#28933)

I think this one is more concise and easier to folow as it sorts in-line. which one do you prefer?

are the tests the same? (this doesn't look as updated).

nrebena · 2020-01-11T10:22:12Z

Test are the same, but not in sync yet.

I will update this one in priority now.

For pro and cons on each impl, this one is more concise, sort in place, but may lack flexibility to handle more complex sort scheme (if ever needed). The other add sorting as a complete separate step, so maybe have a better separation of concern at the cost of added complexity.

Testing return order of MultiIndex.loc MultiIndex.loc try to return the result in the same order as the key given.

From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe. It does for single index. When possible, the returned row now respect the given order for keys.

We need to order the slice(None) to comply with test from pandas/tests/indexing/multiindex/test_slice.py test_per_axis_per_level_doc_examples

jreback · 2020-01-18T19:37:53Z

I just realized you actually had 2 impls, sorry (#28933)

I think this one is more concise and easier to folow as it sorts in-line. which one do you prefer?

are the tests the same? (this doesn't look as updated).

reading the other impl, i agree its simpler to grok. so changing my thoughts here. let's close this one and match tests on the other.

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Oct 24, 2019

nrebena mentioned this pull request Nov 9, 2019

BUG: Preserve key order when using loc on MultiIndex DataFrame #28933

Merged

5 tasks

nrebena force-pushed the multiindex_loc_order_issue_22797 branch from b5b7be0 to 0f30bea Compare November 14, 2019 08:30

WillAyd requested changes Dec 9, 2019

View reviewed changes

jreback requested changes Dec 15, 2019

View reviewed changes

nrebena requested a review from jreback December 16, 2019 20:41

nrebena added 9 commits January 11, 2020 11:27

TST: Test for issue pandas-dev#22797

4bd7e62

Testing return order of MultiIndex.loc MultiIndex.loc try to return the result in the same order as the key given.

BUG: loc() return in order for MultiIndex Dataframe

694c769

From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe. It does for single index. When possible, the returned row now respect the given order for keys.

PERF: Improve performance for sorted label input

14372a5

TST: Selection of columns and slice(None)

a4a5034

FIX: Ensure indexer is sort when needed.

854b5f3

Revert slice(None) test

4ea8570

We need to order the slice(None) to comply with test from pandas/tests/indexing/multiindex/test_slice.py test_per_axis_per_level_doc_examples

CLN: Remove _update_indexer function

07133b0

CLN: Fix a pep8 issue

0c9c37a

TST: Move test to test_multilevel.py

43df881

nrebena force-pushed the multiindex_loc_order_issue_22797 branch from 50baa65 to 43df881 Compare January 11, 2020 10:42

jreback closed this Jan 18, 2020

nrebena deleted the multiindex_loc_order_issue_22797 branch March 16, 2020 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Preserve key order when using loc on MultiIndex DataFrame #29190

BUG: Preserve key order when using loc on MultiIndex DataFrame #29190

nrebena commented Oct 23, 2019 •

edited

Loading

jreback commented Nov 13, 2019

nrebena commented Nov 14, 2019

WillAyd Dec 9, 2019

nrebena Dec 15, 2019

jreback left a comment

jreback Dec 15, 2019

nrebena Dec 16, 2019

pep8speaks commented Dec 16, 2019 •

edited

Loading

jreback commented Dec 27, 2019

jreback commented Jan 4, 2020 •

edited

Loading

nrebena commented Jan 11, 2020

jreback commented Jan 18, 2020

BUG: Preserve key order when using loc on MultiIndex DataFrame #29190

BUG: Preserve key order when using loc on MultiIndex DataFrame #29190

Conversation

nrebena commented Oct 23, 2019 • edited Loading

Benchmark

Checklist

jreback commented Nov 13, 2019

nrebena commented Nov 14, 2019

WillAyd Dec 9, 2019

Choose a reason for hiding this comment

nrebena Dec 15, 2019

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback Dec 15, 2019

Choose a reason for hiding this comment

nrebena Dec 16, 2019

Choose a reason for hiding this comment

pep8speaks commented Dec 16, 2019 • edited Loading

Comment last updated at 2020-01-11 10:43:08 UTC

jreback commented Dec 27, 2019

jreback commented Jan 4, 2020 • edited Loading

nrebena commented Jan 11, 2020

jreback commented Jan 18, 2020

nrebena commented Oct 23, 2019 •

edited

Loading

pep8speaks commented Dec 16, 2019 •

edited

Loading

jreback commented Jan 4, 2020 •

edited

Loading