Skip to content

Commit f478e4f

Browse files
committed
BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels
closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431 Author: Jeff Reback <[email protected]> Closes pandas-dev#15694 from jreback/sort3 and squashes the following commits: bd17d2b [Jeff Reback] rename sort_index_montonic -> _sort_index_monotonic 31097fc [Jeff Reback] add doc-strings, rename sort_monotonic -> sort_levels_monotonic 48249ab [Jeff Reback] add doc example 527c3a6 [Jeff Reback] simpler algo for remove_used_levels 520c9c1 [Jeff Reback] versionadded tags f2ddc9c [Jeff Reback] replace _reconstruct with: sort_monotonic, and remove_unused_levels (public) 3c4ca22 [Jeff Reback] add degenerate test case 269cb3b [Jeff Reback] small doc updates b234bdb [Jeff Reback] support for removing unused levels (internally) 7be8941 [Jeff Reback] incorrectly raising KeyError rather than UnsortedIndexError, caught by doc-example 47c67d6 [Jeff Reback] BUG: construct MultiIndex identically from levels/labels when concatting
1 parent 3b53202 commit f478e4f

File tree

15 files changed

+593
-57
lines changed

15 files changed

+593
-57
lines changed

asv_bench/benchmarks/timeseries.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -292,7 +292,10 @@ def setup(self):
292292
self.rng3 = date_range(start='1/1/2000', periods=1500000, freq='S')
293293
self.ts3 = Series(1, index=self.rng3)
294294

295-
def time_sort_index(self):
295+
def time_sort_index_monotonic(self):
296+
self.ts2.sort_index()
297+
298+
def time_sort_index_non_monotonic(self):
296299
self.ts.sort_index()
297300

298301
def time_timeseries_slice_minutely(self):

doc/source/advanced.rst

+34-29
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ can find yourself working with hierarchically-indexed data without creating a
136136
may wish to generate your own ``MultiIndex`` when preparing the data set.
137137

138138
Note that how the index is displayed by be controlled using the
139-
``multi_sparse`` option in ``pandas.set_printoptions``:
139+
``multi_sparse`` option in ``pandas.set_options()``:
140140

141141
.. ipython:: python
142142
@@ -175,35 +175,40 @@ completely analogous way to selecting a column in a regular DataFrame:
175175
See :ref:`Cross-section with hierarchical index <advanced.xs>` for how to select
176176
on a deeper level.
177177

178-
.. note::
178+
.. _advanced.shown_levels:
179+
180+
Defined Levels
181+
~~~~~~~~~~~~~~
182+
183+
The repr of a ``MultiIndex`` shows ALL the defined levels of an index, even
184+
if the they are not actually used. When slicing an index, you may notice this.
185+
For example:
179186

180-
The repr of a ``MultiIndex`` shows ALL the defined levels of an index, even
181-
if the they are not actually used. When slicing an index, you may notice this.
182-
For example:
187+
.. ipython:: python
183188
184-
.. ipython:: python
189+
# original multi-index
190+
df.columns
185191
186-
# original multi-index
187-
df.columns
192+
# sliced
193+
df[['foo','qux']].columns
188194
189-
# sliced
190-
df[['foo','qux']].columns
195+
This is done to avoid a recomputation of the levels in order to make slicing
196+
highly performant. If you want to see the actual used levels.
191197

192-
This is done to avoid a recomputation of the levels in order to make slicing
193-
highly performant. If you want to see the actual used levels.
198+
.. ipython:: python
194199
195-
.. ipython:: python
200+
df[['foo','qux']].columns.values
196201
197-
df[['foo','qux']].columns.values
202+
# for a specific level
203+
df[['foo','qux']].columns.get_level_values(0)
198204
199-
# for a specific level
200-
df[['foo','qux']].columns.get_level_values(0)
205+
To reconstruct the multiindex with only the used levels
201206

202-
To reconstruct the multiindex with only the used levels
207+
.. versionadded:: 0.20.0
203208

204-
.. ipython:: python
209+
.. ipython:: python
205210
206-
pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values)
211+
df[['foo','qux']].columns.remove_unused_levels()
207212
208213
Data alignment and using ``reindex``
209214
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -288,7 +293,7 @@ As usual, **both sides** of the slicers are included as this is label indexing.
288293

289294
.. code-block:: python
290295
291-
df.loc[(slice('A1','A3'),.....),:]
296+
df.loc[(slice('A1','A3'),.....), :]
292297
293298
rather than this:
294299

@@ -317,51 +322,51 @@ Basic multi-index slicing using slices, lists, and labels.
317322

318323
.. ipython:: python
319324
320-
dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
325+
dfmi.loc[(slice('A1','A3'), slice(None), ['C1', 'C3']), :]
321326
322327
You can use a ``pd.IndexSlice`` to have a more natural syntax using ``:`` rather than using ``slice(None)``
323328

324329
.. ipython:: python
325330
326331
idx = pd.IndexSlice
327-
dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]
332+
dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
328333
329334
It is possible to perform quite complicated selections using this method on multiple
330335
axes at the same time.
331336

332337
.. ipython:: python
333338
334-
dfmi.loc['A1',(slice(None),'foo')]
335-
dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']]
339+
dfmi.loc['A1', (slice(None), 'foo')]
340+
dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
336341
337342
Using a boolean indexer you can provide selection related to the *values*.
338343

339344
.. ipython:: python
340345
341-
mask = dfmi[('a','foo')]>200
342-
dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
346+
mask = dfmi[('a', 'foo')] > 200
347+
dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]
343348
344349
You can also specify the ``axis`` argument to ``.loc`` to interpret the passed
345350
slicers on a single axis.
346351

347352
.. ipython:: python
348353
349-
dfmi.loc(axis=0)[:,:,['C1','C3']]
354+
dfmi.loc(axis=0)[:, :, ['C1', 'C3']]
350355
351356
Furthermore you can *set* the values using these methods
352357

353358
.. ipython:: python
354359
355360
df2 = dfmi.copy()
356-
df2.loc(axis=0)[:,:,['C1','C3']] = -10
361+
df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10
357362
df2
358363
359364
You can use a right-hand-side of an alignable object as well.
360365

361366
.. ipython:: python
362367
363368
df2 = dfmi.copy()
364-
df2.loc[idx[:,:,['C1','C3']],:] = df2*1000
369+
df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000
365370
df2
366371
367372
.. _advanced.xs:

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1432,6 +1432,7 @@ MultiIndex Components
14321432
MultiIndex.droplevel
14331433
MultiIndex.swaplevel
14341434
MultiIndex.reorder_levels
1435+
MultiIndex.remove_unused_levels
14351436

14361437
.. _api.datetimeindex:
14371438

doc/source/whatsnew/v0.20.0.txt

+69-1
Original file line numberDiff line numberDiff line change
@@ -366,6 +366,8 @@ Other Enhancements
366366
- ``pandas.io.json.json_normalize()`` with an empty ``list`` will return an empty ``DataFrame`` (:issue:`15534`)
367367
- ``pandas.io.json.json_normalize()`` has gained a ``sep`` option that accepts ``str`` to separate joined fields; the default is ".", which is backward compatible. (:issue:`14883`)
368368
- ``pd.read_csv()`` will now raise a ``csv.Error`` error whenever an end-of-file character is encountered in the middle of a data row (:issue:`15913`)
369+
- A new function has been added to a ``MultiIndex`` to facilitate :ref:`Removing Unused Levels <advanced.shown_levels>`. (:issue:`15694`)
370+
- :func:`MultiIndex.remove_unused_levels` has been added to facilitate :ref:`removing unused levels <advanced.shown_levels>`. (:issue:`15694`)
369371

370372

371373
.. _ISO 8601 duration: https://en.wikipedia.org/wiki/ISO_8601#Durations
@@ -714,6 +716,72 @@ If indicated, a deprecation warning will be issued if you reference that module.
714716
"pandas._hash", "pandas.tools.libhash", ""
715717
"pandas._window", "pandas.core.libwindow", ""
716718

719+
.. _whatsnew_0200.api_breaking.sort_index:
720+
721+
DataFrame.sort_index changes
722+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
723+
724+
In certain cases, calling ``.sort_index()`` on a MultiIndexed DataFrame would return the *same* DataFrame without seeming to sort.
725+
This would happen with a ``lexsorted``, but non-monotonic levels. (:issue:`15622`, :issue:`15687`, :issue:`14015`, :issue:`13431`)
726+
727+
This is UNCHANGED between versions, but showing for illustration purposes:
728+
729+
.. ipython:: python
730+
731+
df = DataFrame(np.arange(6), columns=['value'], index=MultiIndex.from_product([list('BA'), range(3)]))
732+
df
733+
734+
.. ipython:: python
735+
736+
df.index.is_lexsorted()
737+
df.index.is_monotonic
738+
739+
Sorting works as expected
740+
741+
.. ipython:: python
742+
743+
df.sort_index()
744+
745+
.. ipython:: python
746+
747+
df.sort_index().index.is_lexsorted()
748+
df.sort_index().index.is_monotonic
749+
750+
However, this example, which has a non-monotonic 2nd level,
751+
doesn't behave as desired.
752+
753+
.. ipython:: python
754+
df = pd.DataFrame(
755+
{'value': [1, 2, 3, 4]},
756+
index=pd.MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
757+
labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))
758+
759+
Previous Behavior:
760+
761+
.. ipython:: python
762+
763+
In [11]: df.sort_index()
764+
Out[11]:
765+
value
766+
a bb 1
767+
aa 2
768+
b bb 3
769+
aa 4
770+
771+
In [14]: df.sort_index().index.is_lexsorted()
772+
Out[14]: True
773+
774+
In [15]: df.sort_index().index.is_monotonic
775+
Out[15]: False
776+
777+
New Behavior:
778+
779+
.. ipython:: python
780+
781+
df.sort_index()
782+
df.sort_index().index.is_lexsorted()
783+
df.sort_index().index.is_monotonic
784+
717785

718786
.. _whatsnew_0200.api_breaking.groupby_describe:
719787

@@ -965,7 +1033,7 @@ Performance Improvements
9651033
- Improve performance of ``pd.core.groupby.GroupBy.apply`` when the applied
9661034
function used the ``.name`` attribute of the group DataFrame (:issue:`15062`).
9671035
- Improved performance of ``iloc`` indexing with a list or array (:issue:`15504`).
968-
1036+
- Improved performance of ``Series.sort_index()`` with a monotonic index (:issue:`15694`)
9691037

9701038
.. _whatsnew_0200.bug_fixes:
9711039

pandas/core/frame.py

+10-9
Original file line numberDiff line numberDiff line change
@@ -3322,6 +3322,10 @@ def trans(v):
33223322
def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
33233323
kind='quicksort', na_position='last', sort_remaining=True,
33243324
by=None):
3325+
3326+
# TODO: this can be combined with Series.sort_index impl as
3327+
# almost identical
3328+
33253329
inplace = validate_bool_kwarg(inplace, 'inplace')
33263330
# 10726
33273331
if by is not None:
@@ -3335,8 +3339,7 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
33353339
axis = self._get_axis_number(axis)
33363340
labels = self._get_axis(axis)
33373341

3338-
# sort by the index
3339-
if level is not None:
3342+
if level:
33403343

33413344
new_axis, indexer = labels.sortlevel(level, ascending=ascending,
33423345
sort_remaining=sort_remaining)
@@ -3346,17 +3349,14 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
33463349

33473350
# make sure that the axis is lexsorted to start
33483351
# if not we need to reconstruct to get the correct indexer
3349-
if not labels.is_lexsorted():
3350-
labels = MultiIndex.from_tuples(labels.values)
3351-
3352+
labels = labels._sort_levels_monotonic()
33523353
indexer = lexsort_indexer(labels.labels, orders=ascending,
33533354
na_position=na_position)
33543355
else:
33553356
from pandas.core.sorting import nargsort
33563357

3357-
# GH11080 - Check monotonic-ness before sort an index
3358-
# if monotonic (already sorted), return None or copy() according
3359-
# to 'inplace'
3358+
# Check monotonic-ness before sort an index
3359+
# GH11080
33603360
if ((ascending and labels.is_monotonic_increasing) or
33613361
(not ascending and labels.is_monotonic_decreasing)):
33623362
if inplace:
@@ -3367,8 +3367,9 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
33673367
indexer = nargsort(labels, kind=kind, ascending=ascending,
33683368
na_position=na_position)
33693369

3370+
baxis = self._get_block_manager_axis(axis)
33703371
new_data = self._data.take(indexer,
3371-
axis=self._get_block_manager_axis(axis),
3372+
axis=baxis,
33723373
convert=False, verify=False)
33733374

33743375
if inplace:

pandas/core/groupby.py

+8-1
Original file line numberDiff line numberDiff line change
@@ -1882,6 +1882,13 @@ def get_group_levels(self):
18821882
'ohlc': lambda *args: ['open', 'high', 'low', 'close']
18831883
}
18841884

1885+
def _is_builtin_func(self, arg):
1886+
"""
1887+
if we define an builtin function for this argument, return it,
1888+
otherwise return the arg
1889+
"""
1890+
return SelectionMixin._builtin_table.get(arg, arg)
1891+
18851892
def _get_cython_function(self, kind, how, values, is_numeric):
18861893

18871894
dtype_str = values.dtype.name
@@ -2107,7 +2114,7 @@ def _aggregate_series_fast(self, obj, func):
21072114
# avoids object / Series creation overhead
21082115
dummy = obj._get_values(slice(None, 0)).to_dense()
21092116
indexer = get_group_index_sorter(group_index, ngroups)
2110-
obj = obj.take(indexer, convert=False)
2117+
obj = obj.take(indexer, convert=False).to_dense()
21112118
group_index = algorithms.take_nd(
21122119
group_index, indexer, allow_fill=False)
21132120
grouper = lib.SeriesGrouper(obj, func, group_index, ngroups,

pandas/core/reshape.py

+2-7
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222
from pandas.sparse.libsparse import IntIndex
2323

2424
from pandas.core.categorical import Categorical, _factorize_from_iterable
25-
from pandas.core.sorting import (get_group_index, compress_group_index,
26-
decons_obs_group_ids)
25+
from pandas.core.sorting import (get_group_index, get_compressed_ids,
26+
compress_group_index, decons_obs_group_ids)
2727

2828
import pandas.core.algorithms as algos
2929
from pandas._libs import algos as _algos, reshape as _reshape
@@ -496,11 +496,6 @@ def _unstack_frame(obj, level, fill_value=None):
496496
return unstacker.get_result()
497497

498498

499-
def get_compressed_ids(labels, sizes):
500-
ids = get_group_index(labels, sizes, sort=True, xnull=False)
501-
return compress_group_index(ids, sort=True)
502-
503-
504499
def stack(frame, level=-1, dropna=True):
505500
"""
506501
Convert DataFrame to Series with multi-level Index. Columns become the

pandas/core/series.py

+16-2
Original file line numberDiff line numberDiff line change
@@ -1751,17 +1751,31 @@ def _try_kind_sort(arr):
17511751
def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
17521752
kind='quicksort', na_position='last', sort_remaining=True):
17531753

1754+
# TODO: this can be combined with DataFrame.sort_index impl as
1755+
# almost identical
17541756
inplace = validate_bool_kwarg(inplace, 'inplace')
17551757
axis = self._get_axis_number(axis)
17561758
index = self.index
1757-
if level is not None:
1759+
1760+
if level:
17581761
new_index, indexer = index.sortlevel(level, ascending=ascending,
17591762
sort_remaining=sort_remaining)
17601763
elif isinstance(index, MultiIndex):
17611764
from pandas.core.sorting import lexsort_indexer
1762-
indexer = lexsort_indexer(index.labels, orders=ascending)
1765+
labels = index._sort_levels_monotonic()
1766+
indexer = lexsort_indexer(labels.labels, orders=ascending)
17631767
else:
17641768
from pandas.core.sorting import nargsort
1769+
1770+
# Check monotonic-ness before sort an index
1771+
# GH11080
1772+
if ((ascending and index.is_monotonic_increasing) or
1773+
(not ascending and index.is_monotonic_decreasing)):
1774+
if inplace:
1775+
return
1776+
else:
1777+
return self.copy()
1778+
17651779
indexer = nargsort(index, kind=kind, ascending=ascending,
17661780
na_position=na_position)
17671781

0 commit comments

Comments
 (0)