Skip to content

BUG: Preserve key order when using loc on MultiIndex DataFrame #29190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

nrebena
Copy link
Contributor

@nrebena nrebena commented Oct 23, 2019

This is another implementation to solve the issue described in #22797.

The other PR is #28933. When the other PR keep the main code identical and resort the indexer at the end, this implementation try to resolve the problem in the main loop of the original code.
The solution also as consequences on the performance when the key order is sorted as the index, so it add a flag to only keep the order if needed.

Benchmark

Benchmarks that have improved:

   before           after         ratio
 [e623f0fb]       [b5b7be06]
 <stash~5>        <multiindex_loc_order_issue_22797>
  • 2.50±0.01ms         2.35±0ms     0.94  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')
    
  • 2.55±0.01ms         2.40±0ms     0.94  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
    
  •   767±0.5ms          707±2ms     0.92  groupby.Transform.time_transform_lambda_max
    
  •  2.74±0.2μs       2.49±0.2μs     0.91  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
    
  •  2.64±0.1μs       2.51±0.2μs     0.95  index_cached_properties.IndexCache.time_is_all_dates('Float64Index')
    
  •    517±10μs         491±10μs     0.95  index_cached_properties.IndexCache.time_is_monotonic('MultiIndex')
    
  •     671±6μs          605±6μs     0.90  index_cached_properties.IndexCache.time_is_monotonic_decreasing('MultiIndex')
    
  •    514±10μs         489±10μs     0.95  index_cached_properties.IndexCache.time_is_monotonic_increasing('MultiIndex')
    
  •  6.61±0.3μs         5.68±1μs     0.86  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
    
  •  3.87±0.7μs       2.84±0.2μs     0.73  index_cached_properties.IndexCache.time_values('TimedeltaIndex')
    
  • 2.87±0.03μs      2.72±0.01μs     0.95  indexing.GetItemSingleColumn.time_frame_getitem_single_column_label
    
  • 1.19±0.02ms         787±20μs     0.66  indexing.MultiIndexing.time_frame_ix
    
  • 1.11±0.02ms         722±20μs     0.65  indexing.MultiIndexing.time_series_ix
    
  • 3.54±0.01ms      3.37±0.01ms     0.95  indexing.NonNumericSeriesIndexing.time_getitem_list_like('string', 'unique_monotonic_inc')
    
  •     644±2μs          568±1μs     0.88  multiindex_object.Duplicates.time_remove_unused_levels
    
  •     746±8μs          189±2μs     0.25  multiindex_object.GetLoc.time_large_get_loc
    
  •   169±0.6ms        102±0.4ms     0.61  multiindex_object.GetLoc.time_large_get_loc_warm
    
  •  94.2±0.2μs       23.8±0.3μs     0.25  multiindex_object.GetLoc.time_med_get_loc
    
  •  90.1±0.3ms       23.1±0.3ms     0.26  multiindex_object.GetLoc.time_med_get_loc_warm
    
  •  87.1±0.2ms       22.8±0.2ms     0.26  multiindex_object.GetLoc.time_small_get_loc_warm
    
  •  88.9±0.2μs       23.1±0.2μs     0.26  multiindex_object.GetLoc.time_string_get_loc
    
  •   700±0.8μs        550±0.6μs     0.79  multiindex_object.Values.time_datetime_level_values_sliced
    
  • 1.65±0.03ms      1.21±0.02ms     0.73  reindex.LevelAlign.time_align_level
    
  • 1.65±0.04ms      1.22±0.01ms     0.74  reindex.LevelAlign.time_reindex_level
    
  •  91.7±0.1ms       86.8±0.1ms     0.95  reshape.Crosstab.time_crosstab_normalize_margins
    
  •  25.5±0.4ms       24.0±0.4ms     0.94  reshape.PivotTable.time_pivot_table
    
  •  51.6±0.6ms       48.0±0.2ms     0.93  reshape.PivotTable.time_pivot_table_agg
    
  • 19.7±0.05ms       18.4±0.2ms     0.93  reshape.PivotTable.time_pivot_table_categorical
    
  • 14.2±0.03ms      13.1±0.04ms     0.92  reshape.PivotTable.time_pivot_table_categorical_observed
    
  •   107±0.8ms       94.0±0.9ms     0.88  reshape.PivotTable.time_pivot_table_margins
    
  • 6.02±0.02ms      5.47±0.02ms     0.91  reshape.SimpleReshape.time_stack
    
  • 2.46±0.02ms      2.20±0.02ms     0.89  reshape.SimpleReshape.time_unstack
    
  •    2.30±0ms      1.73±0.01ms     0.75  reshape.SparseIndex.time_unstack
    
  •  18.9±0.1ms      18.0±0.03ms     0.95  rolling.Pairwise.time_pairwise(10, 'cov', True)
    
  •    2.60±0ms         2.33±0ms     0.90  sparse.FromCoo.time_sparse_series_from_coo
    
  •  41.3±0.2ms       39.2±0.2ms     0.95  sparse.ToCoo.time_sparse_series_to_coo
    
  •     3.98±0s       3.75±0.01s     0.94  stat_ops.Correlation.time_corr_wide_nans('spearman', False)
    
  •  3.99±0.01s       3.74±0.01s     0.94  stat_ops.Correlation.time_corr_wide_nans('spearman', True)
    
  • 40.9±0.07ms       36.8±0.1ms     0.90  stat_ops.FrameMultiIndexOps.time_op(0, 'kurt')
    
  •   195±0.3ms        165±0.3ms     0.85  stat_ops.FrameMultiIndexOps.time_op(1, 'kurt')
    
  •    87.7±1ms         78.1±1ms     0.89  stat_ops.FrameMultiIndexOps.time_op(1, 'skew')
    
  •     710±1ms          628±3ms     0.88  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'skew')
    
  •  10.7±0.1ms       9.66±0.3ms     0.90  stat_ops.SeriesMultiIndexOps.time_op(0, 'kurt')
    
  • 10.6±0.05ms       9.15±0.2ms     0.86  stat_ops.SeriesMultiIndexOps.time_op(0, 'skew')
    
  •  4.55±0.1ms      4.01±0.08ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(0, 'std')
    
  • 4.15±0.06ms       3.86±0.1ms     0.93  stat_ops.SeriesMultiIndexOps.time_op(0, 'var')
    
  •  49.4±0.4ms       43.3±0.2ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(1, 'kurt')
    
  •   128±0.3ms        112±0.3ms     0.88  stat_ops.SeriesMultiIndexOps.time_op(1, 'mad')
    
  •  48.0±0.3ms       40.3±0.4ms     0.84  stat_ops.SeriesMultiIndexOps.time_op(1, 'skew')
    
  • 4.55±0.08ms       3.97±0.1ms     0.87  stat_ops.SeriesMultiIndexOps.time_op(1, 'std')
    
  •  4.14±0.1ms      3.81±0.09ms     0.92  stat_ops.SeriesMultiIndexOps.time_op(1, 'var')
    
  •   434±0.5ms        357±0.4ms     0.82  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'kurt')
    
  •     1.20±0s          1.04±0s     0.86  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'mad')
    
  •   422±0.7ms          339±1ms     0.80  stat_ops.SeriesMultiIndexOps.time_op([0, 1], 'skew')
    
  •  48.6±0.2ms      40.5±0.07ms     0.83  timeseries.DatetimeIndex.time_to_pydatetime('repeated')
    
  •   288±0.9ms        271±0.8ms     0.94  timeseries.DatetimeIndex.time_to_pydatetime('tz_aware')
    
  •  48.3±0.1ms       40.6±0.1ms     0.84  timeseries.DatetimeIndex.time_to_pydatetime('tz_naive')
    

Benchmarks that have got worse:

   before           after         ratio
 [e623f0fb]       [b5b7be06]
 <stash~5>        <multiindex_loc_order_issue_22797>
  •   149±0.5ms        170±0.6ms     1.15  categoricals.Rank.time_rank_string
    
  •  7.23±0.2ms       8.38±0.3ms     1.16  categoricals.Rank.time_rank_string_cat
    
  •  9.81±0.1ms       10.4±0.1ms     1.06  categoricals.Rank.time_rank_string_cat_ordered
    
  • 2.45±0.01ms      2.65±0.09ms     1.08  frame_methods.Iteration.time_items_cached
    
  •   227±0.3ms          239±4ms     1.06  groupby.GroupByMethods.time_dtype_as_field('datetime', 'unique', 'transformation')
    
  •   148±0.3ms        157±0.3ms     1.06  groupby.GroupByMethods.time_dtype_as_field('float', 'unique', 'direct')
    
  •   148±0.5ms        157±0.8ms     1.06  groupby.GroupByMethods.time_dtype_as_field('float', 'unique', 'transformation')
    
  •   144±0.5ms          153±3ms     1.06  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'direct')
    
  •   143±0.4ms          155±5ms     1.08  groupby.GroupByMethods.time_dtype_as_field('int', 'unique', 'transformation')
    
  •     489±3μs          541±4μs     1.11  groupby.GroupByMethods.time_dtype_as_field('object', 'first', 'direct')
    
  •     488±2μs          546±3μs     1.12  groupby.GroupByMethods.time_dtype_as_field('object', 'first', 'transformation')
    
  •     481±2μs         547±20μs     1.14  groupby.GroupByMethods.time_dtype_as_field('object', 'last', 'direct')
    
  •     480±2μs          536±6μs     1.12  groupby.GroupByMethods.time_dtype_as_field('object', 'last', 'transformation')
    
  •   173±0.2ms        183±0.5ms     1.06  groupby.GroupByMethods.time_dtype_as_field('object', 'unique', 'direct')
    
  •   173±0.5ms        183±0.4ms     1.06  groupby.GroupByMethods.time_dtype_as_field('object', 'unique', 'transformation')
    
  •   334±0.3ms          358±3ms     1.07  groupby.GroupByMethods.time_dtype_as_group('datetime', 'unique', 'direct')
    
  •   334±0.3ms          356±2ms     1.07  groupby.GroupByMethods.time_dtype_as_group('datetime', 'unique', 'transformation')
    
  •   330±0.4ms          352±2ms     1.07  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct')
    
  •   331±0.7ms        353±0.6ms     1.07  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'transformation')
    
  •   212±0.3ms        224±0.8ms     1.06  groupby.GroupByMethods.time_dtype_as_group('int', 'unique', 'direct')
    
  •   211±0.2ms        224±0.7ms     1.06  groupby.GroupByMethods.time_dtype_as_group('int', 'unique', 'transformation')
    
  •     766±3μs         1.67±0ms     2.19  groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'direct')
    
  •   766±0.7μs         1.68±0ms     2.19  groupby.GroupByMethods.time_dtype_as_group('object', 'unique', 'transformation')
    
  • 4.96±0.05μs       5.42±0.1μs     1.09  index_cached_properties.IndexCache.time_engine('DatetimeIndex')
    
  • 4.25±0.05μs      4.46±0.05μs     1.05  index_cached_properties.IndexCache.time_engine('PeriodIndex')
    
  •  2.92±0.1μs       3.13±0.2μs     1.07  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
    
  •    988±30ns      1.07±0.05μs     1.08  index_cached_properties.IndexCache.time_inferred_type('RangeIndex')
    
  •  2.38±0.1μs       2.74±0.5μs     1.15  index_cached_properties.IndexCache.time_inferred_type('TimedeltaIndex')
    
  •  2.70±0.2μs       3.05±0.5μs     1.13  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')
    
  • 1.41±0.02μs      1.51±0.04μs     1.07  index_cached_properties.IndexCache.time_is_monotonic_increasing('Int64Index')
    
  •    941±20ns      1.01±0.02μs     1.07  index_cached_properties.IndexCache.time_is_unique('RangeIndex')
    
  •  3.70±0.1μs       3.93±0.1μs     1.06  index_cached_properties.IndexCache.time_shape('UInt64Index')
    
  •  2.88±0.2μs       3.05±0.2μs     1.06  index_cached_properties.IndexCache.time_values('UInt64Index')
    
  • 5.43±0.08μs      6.39±0.02μs     1.18  index_object.Indexing.time_get_loc('Int')
    
  • 5.44±0.06μs       6.45±0.2μs     1.19  index_object.Indexing.time_get_loc_sorted('Int')
    
  • 2.76±0.06ms         3.01±0ms     1.09  indexing.IntervalIndexing.time_getitem_scalar
    
  • 3.09±0.06ms      3.34±0.02ms     1.08  indexing.IntervalIndexing.time_loc_list
    
  • 2.78±0.05ms         3.03±0ms     1.09  indexing.IntervalIndexing.time_loc_scalar
    
  • 5.83±0.02ms      16.6±0.03ms     2.85  indexing.MultiIndexing.time_index_slice
    
  • 2.45±0.02ms      2.59±0.02ms     1.06  inference.NumericInferOps.time_divide(<class 'numpy.uint8'>)
    
  • 9.78±0.08ms      10.4±0.07ms     1.06  io.csv.ParseDateComparison.time_read_csv_dayfirst(False)
    
  •     789±2ms          849±4ms     1.08  io.json.ReadJSON.time_read_json('index', 'datetime')
    
  •   840±0.6ms          923±3ms     1.10  io.json.ReadJSON.time_read_json('index', 'int')
    
  • 49.2±0.05ms       52.1±0.3ms     1.06  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
    
  •  68.1±0.4ms       72.3±0.5ms     1.06  io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')
    
  •    1.64±0μs       1.86±0.1μs     1.14  period.PeriodProperties.time_property('M', 'daysinmonth')
    
  • 1.68±0.01μs      1.80±0.03μs     1.07  period.PeriodProperties.time_property('M', 'is_leap_year')
    
  • 1.68±0.02μs      1.77±0.02μs     1.05  period.PeriodProperties.time_property('min', 'is_leap_year')
    
  •   739±0.7ms          1.41±0s     1.91  stat_ops.FrameMultiIndexOps.time_op([0, 1], 'kurt')
    
  •    764±10ns          810±2ns     1.06  timestamp.TimestampProperties.time_freqstr(<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>, None)
    
  •    763±10ns        812±0.9ns     1.07  timestamp.TimestampProperties.time_freqstr(<UTC>, None)
    
  •    763±10ns          812±1ns     1.06  timestamp.TimestampProperties.time_freqstr(None, None)
    
  •    762±10ns          811±2ns     1.06  timestamp.TimestampProperties.time_freqstr(tzutc(), None)
    

Checklist

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Oct 24, 2019
@jreback
Copy link
Contributor

jreback commented Nov 13, 2019

can you merge master and ping on green for another look

@nrebena nrebena force-pushed the multiindex_loc_order_issue_22797 branch from b5b7be0 to 0f30bea Compare November 14, 2019 08:30
@nrebena
Copy link
Contributor Author

nrebena commented Nov 14, 2019

@jreback Of course.

columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
)

res = df.loc[["b", "a"], :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test selection of columns as well? Looks like you have the data for it already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for df.loc[:, ["Colorado", "Ohio"]] and df.loc[:, (["Colorado", "Ohio"], ["Red", "Green"])]`.

Also, I add thisdf.loc[(slice(None), [2, 1]), :], testing if the result was

     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
b 2     9  10       11
a 1     0   1        2
b 1     6   7        8

but maybe a preferable result is this

     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
  1     0   1        2
b 2     9  10       11
  1     6   7        8

This would be harder to obtain given the current implementation.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting quite difficult to parse, i proposed some simplifications.

else:
# no matches we are done
return Int64Index([])._ndarray_values

elif com.is_null_slice(k):
# empty slice
indexer = _update_indexer(None, indexer=indexer)
# index is given to conserve the order of this level
indexer = _update_indexer(Int64Index(np.arange(n)), indexer=indexer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this statement entirely, it actually does't do anything (and neither does the original)
it just yields the indexer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part define the comportement when having something like
df.loc[(slice(None), [2, 1]), :] (see my comment over it.
Other tests actually expect this to give the following results that keep the first level sorted exactly as existing, and the order of the subsequent level is then ignored.
One of this is test TestMultiIndexSlicers.test_per_axis_per_level_doc_examples in tests/indexings.
Keeping this behavior will lead to this kind of result:

     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

I think the expected comportment can be one of this two:

# Ignoring the None slice order entirely
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
b 2     9  10       11
a 1     0   1        2
b 1     6   7        8

# Keeping the None slice order by code (and not by absolute position)
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
  1     0   1        2
b 2     9  10       11
  1     6   7        8

The first proposition is simpler to implement (ignore None slice as proposed).

Can you make a judgment call @jreback ?

@pep8speaks
Copy link

pep8speaks commented Dec 16, 2019

Hello @nrebena! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-11 10:43:08 UTC

@nrebena nrebena requested a review from jreback December 16, 2019 20:41
@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and will look again

@jreback
Copy link
Contributor

jreback commented Jan 4, 2020

I just realized you actually had 2 impls, sorry (#28933)

I think this one is more concise and easier to folow as it sorts in-line. which one do you prefer?

are the tests the same? (this doesn't look as updated).

@nrebena
Copy link
Contributor Author

nrebena commented Jan 11, 2020

Test are the same, but not in sync yet.

I will update this one in priority now.

For pro and cons on each impl, this one is more concise, sort in place, but may lack flexibility to handle more complex sort scheme (if ever needed). The other add sorting as a complete separate step, so maybe have a better separation of concern at the cost of added complexity.

Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
From issue pandas-dev#22797. Loc did not respect order for MultiIndex Dataframe.
It does for single index.

When possible, the returned row now respect the given order for keys.
We need to order the slice(None) to comply with test from
pandas/tests/indexing/multiindex/test_slice.py
test_per_axis_per_level_doc_examples
@nrebena nrebena force-pushed the multiindex_loc_order_issue_22797 branch from 50baa65 to 43df881 Compare January 11, 2020 10:42
@jreback
Copy link
Contributor

jreback commented Jan 18, 2020

I just realized you actually had 2 impls, sorry (#28933)

I think this one is more concise and easier to folow as it sorts in-line. which one do you prefer?

are the tests the same? (this doesn't look as updated).

reading the other impl, i agree its simpler to grok. so changing my thoughts here. let's close this one and match tests on the other.

@jreback jreback closed this Jan 18, 2020
@nrebena nrebena deleted the multiindex_loc_order_issue_22797 branch March 16, 2020 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loc() does not swap two rows in multi-index pandas dataframe
5 participants