sort_index not sorting when multi-index made by different categorical types #24271

wiso · 2018-12-13T16:00:55Z

Code Sample, a copy-pastable example if possible

This is the shorter code to reproduce the problem

import pandas as pd
from pandas.api.types import CategoricalDtype
number_type = CategoricalDtype(['one', 'two', 'three', 'four', 'five'], ordered=True)
day_type = CategoricalDtype(['monday', 'tuesday', 'wednesday', 'thursday', 'friday'], ordered=True)

dd = pd.DataFrame([('two', 'tuesday', 'one', 'up', 10),
                   ('one', 'wednesday', 'two', 'up', 20),
                   ('five', 'monday', 'two', 'up', 1),
                   ('three', 'tuesday', 'three', 'up', 1),
                   ('four', 'monday', 'one', 'up', 2),
                   ('one', 'friday', 'one', 'up', 2),
                   
                   ('two', 'tuesday', 'one', 'down', 10),
                   ('one', 'wednesday', 'two', 'down', 20),
                   ('five', 'monday', 'two', 'down', 1),
                   ('three', 'tuesday', 'three', 'down', 1),
                   ('four', 'monday', 'one', 'down', 2),
                   ('one', 'friday', 'one', 'down', 2),
                  ])
dd = dd.set_index([0, 1, 2, 3])
dd = dd.unstack(3)[4]


dd.index = dd.index.set_levels(dd.index.levels[0].astype(number_type), 0)
dd.index = dd.index.set_levels(dd.index.levels[1].astype(day_type), 1)
dd.index = dd.index.set_levels(dd.index.levels[2].astype(number_type), 2)

print (dd.sort_index())

Problem description

The dataframe is not sorted. I get

0     1         2              
five  monday    two       1   1
four  monday    one       2   2
one   friday    one       2   2
      wednesday two      20  20
three tuesday   three     1   1
two   tuesday   one      10  10

which is exactly what you get without any sorting.

If all the index levels have the same categorical type it seems to work.

It works if I reset the index:

print(dd.reset_index().set_index([0, 1, 2]).sort_index())

Expected Output

0     1         2              
one   wednesday two      20  20
      friday    one       2   2
two   tuesday   one      10  10
three tuesday   three     1   1
four  monday    one       2   2
five  monday    two       1   1

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-300.fc29.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.6
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

mfenner1 · 2019-04-12T14:03:08Z

It seems like this might be the same issue. I have this MWE with pandas-0.23.4

import pandas as pd
import numpy as np

df = pd.DataFrame({'group':['A']*6 + ['B']*6,
                   'dose':['high', 'med', 'low']*4,
                   'outcomes':np.arange(12.0)})

df.dose = pd.Categorical(df.dose, 
                         categories=['low', 'med', 'high'], 
                         ordered=True)

# dose is sorted low, med, high (works as expected)
# df.groupby('dose')['outcomes'].mean() 

df.groupby('group')['dose'].value_counts().sort_index(level=0, 
                                                       sort_remaining=True)

Output:

group  dose
A      high    2
       low     2
       med     2
B      high    2
       low     2
       med     2
Name: dose, dtype: int64

With or without the sort_index call, the inner level (dose) of the MultiIndex for the value counts appears to be sorting in lexicographic order, not in the defined pd.Categorical order. Somewhere along the way, either the values lose their Categorical-ness or it isn't interpreted within the MultiIndex.

mroeschke · 2020-05-05T05:20:27Z

Looks like this is fixed on master. Could use a test

In [51]: df.groupby('group')['dose'].value_counts().sort_index(level=0,
    ...:                                                        sort_remaining=True)
Out[51]:
group  dose
A      low     2
       med     2
       high    2
B      low     2
       med     2
       high    2
Name: dose, dtype: int64

In [52]: pd.__version__
Out[52]: '1.1.0.dev0+1466.ga3477c769.dirty'

quangngd · 2020-06-21T05:52:28Z

pd.__version__

1.1.0.dev0+1901.gaaa9cd03f

Setting type then setting index gives expected output:

dd = pd.DataFrame([
    ("five", "monday", "two", 1),
    ("four", "monday", "one", 1),
    ("one", "friday", "one", 1),
    ("one", "wednesday", "two", 1),
    ("three", "tuesday", "three", 1),
    ("two", "tuesday", "one", 1),
])
dd[0] = dd[0].astype(number_type)
dd[1] = dd[1].astype(day_type)
dd[2] = dd[2].astype(number_type)
dd.set_index([0,1,2]).sort_index()

3
0     1         2       
one   wednesday two    1
      friday    one    1
two   tuesday   one    1
three tuesday   three  1
four  monday    one    1
five  monday    two    1

Conversely, setting index then setting dtype gives the described "bug":

dd = pd.DataFrame([
    ("five", "monday", "two", 1),
    ("four", "monday", "one", 1),
    ("one", "friday", "one", 1),
    ("one", "wednesday", "two", 1),
    ("three", "tuesday", "three", 1),
    ("two", "tuesday", "one", 1),
])
dd = dd.set_index([0, 1, 2])
dd.index = dd.index.set_levels(dd.index.levels[0].astype(number_type), 0)
dd.index = dd.index.set_levels(dd.index.levels[1].astype(day_type), 1)
dd.index = dd.index.set_levels(dd.index.levels[2].astype(number_type), 2)

3
0     1         2       
five  monday    two    1
four  monday    one    1
one   friday    one    1
      wednesday two    1
three tuesday   three  1
two   tuesday   one    1

This is because in the latter case, we only modify the levels but not the codes. Don't know if we have proper way for this usecase yet. Maybe it's in df.reindex but i couldnt work it out.

mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Jan 13, 2019

toobaz added Index Related to the Index class or subclasses MultiIndex and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 29, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Index Related to the Index class or subclasses MultiIndex labels May 5, 2020

MarcoGorelli mentioned this issue Feb 23, 2021

sort_index not sorting when multi-index made by different categorical types #39986

Merged

4 tasks

jreback added this to the 1.3 milestone Feb 25, 2021

jreback added Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 25, 2021

jreback closed this as completed in #39986 Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort_index not sorting when multi-index made by different categorical types #24271

sort_index not sorting when multi-index made by different categorical types #24271

wiso commented Dec 13, 2018 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

mfenner1 commented Apr 12, 2019

mroeschke commented May 5, 2020

quangngd commented Jun 21, 2020 •

edited

Loading

sort_index not sorting when multi-index made by different categorical types #24271

sort_index not sorting when multi-index made by different categorical types #24271

Comments

wiso commented Dec 13, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

mfenner1 commented Apr 12, 2019

mroeschke commented May 5, 2020

quangngd commented Jun 21, 2020 • edited Loading

wiso commented Dec 13, 2018 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

quangngd commented Jun 21, 2020 •

edited

Loading