Skip to content

sort_index not sorting when multi-index made by different categorical types #24271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wiso opened this issue Dec 13, 2018 · 3 comments · Fixed by #39986
Closed

sort_index not sorting when multi-index made by different categorical types #24271

wiso opened this issue Dec 13, 2018 · 3 comments · Fixed by #39986
Labels
good first issue Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@wiso
Copy link

wiso commented Dec 13, 2018

Code Sample, a copy-pastable example if possible

This is the shorter code to reproduce the problem

import pandas as pd
from pandas.api.types import CategoricalDtype
number_type = CategoricalDtype(['one', 'two', 'three', 'four', 'five'], ordered=True)
day_type = CategoricalDtype(['monday', 'tuesday', 'wednesday', 'thursday', 'friday'], ordered=True)

dd = pd.DataFrame([('two', 'tuesday', 'one', 'up', 10),
                   ('one', 'wednesday', 'two', 'up', 20),
                   ('five', 'monday', 'two', 'up', 1),
                   ('three', 'tuesday', 'three', 'up', 1),
                   ('four', 'monday', 'one', 'up', 2),
                   ('one', 'friday', 'one', 'up', 2),
                   
                   ('two', 'tuesday', 'one', 'down', 10),
                   ('one', 'wednesday', 'two', 'down', 20),
                   ('five', 'monday', 'two', 'down', 1),
                   ('three', 'tuesday', 'three', 'down', 1),
                   ('four', 'monday', 'one', 'down', 2),
                   ('one', 'friday', 'one', 'down', 2),
                  ])
dd = dd.set_index([0, 1, 2, 3])
dd = dd.unstack(3)[4]


dd.index = dd.index.set_levels(dd.index.levels[0].astype(number_type), 0)
dd.index = dd.index.set_levels(dd.index.levels[1].astype(day_type), 1)
dd.index = dd.index.set_levels(dd.index.levels[2].astype(number_type), 2)

print (dd.sort_index())

Problem description

The dataframe is not sorted. I get

0     1         2              
five  monday    two       1   1
four  monday    one       2   2
one   friday    one       2   2
      wednesday two      20  20
three tuesday   three     1   1
two   tuesday   one      10  10

which is exactly what you get without any sorting.

If all the index levels have the same categorical type it seems to work.

It works if I reset the index:

print(dd.reset_index().set_index([0, 1, 2]).sort_index())

Expected Output

0     1         2              
one   wednesday two      20  20
      friday    one       2   2
two   tuesday   one      10  10
three tuesday   three     1   1
four  monday    one       2   2
five  monday    two       1   1

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.16-300.fc29.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.6
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@mroeschke mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Jan 13, 2019
@mfenner1
Copy link

It seems like this might be the same issue. I have this MWE with pandas-0.23.4

import pandas as pd
import numpy as np

df = pd.DataFrame({'group':['A']*6 + ['B']*6,
                   'dose':['high', 'med', 'low']*4,
                   'outcomes':np.arange(12.0)})

df.dose = pd.Categorical(df.dose, 
                         categories=['low', 'med', 'high'], 
                         ordered=True)

# dose is sorted low, med, high (works as expected)
# df.groupby('dose')['outcomes'].mean() 

df.groupby('group')['dose'].value_counts().sort_index(level=0, 
                                                       sort_remaining=True)

Output:

group  dose
A      high    2
       low     2
       med     2
B      high    2
       low     2
       med     2
Name: dose, dtype: int64

With or without the sort_index call, the inner level (dose) of the MultiIndex for the value counts appears to be sorting in lexicographic order, not in the defined pd.Categorical order. Somewhere along the way, either the values lose their Categorical-ness or it isn't interpreted within the MultiIndex.

@toobaz toobaz added Index Related to the Index class or subclasses MultiIndex and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 29, 2019
@mroeschke
Copy link
Member

Looks like this is fixed on master. Could use a test

In [51]: df.groupby('group')['dose'].value_counts().sort_index(level=0,
    ...:                                                        sort_remaining=True)
Out[51]:
group  dose
A      low     2
       med     2
       high    2
B      low     2
       med     2
       high    2
Name: dose, dtype: int64

In [52]: pd.__version__
Out[52]: '1.1.0.dev0+1466.ga3477c769.dirty'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Index Related to the Index class or subclasses MultiIndex labels May 5, 2020
@quangngd
Copy link
Contributor

quangngd commented Jun 21, 2020

pd.__version__

1.1.0.dev0+1901.gaaa9cd03f

Setting type then setting index gives expected output:

dd = pd.DataFrame([
    ("five", "monday", "two", 1),
    ("four", "monday", "one", 1),
    ("one", "friday", "one", 1),
    ("one", "wednesday", "two", 1),
    ("three", "tuesday", "three", 1),
    ("two", "tuesday", "one", 1),
])
dd[0] = dd[0].astype(number_type)
dd[1] = dd[1].astype(day_type)
dd[2] = dd[2].astype(number_type)
dd.set_index([0,1,2]).sort_index()
3
0     1         2       
one   wednesday two    1
      friday    one    1
two   tuesday   one    1
three tuesday   three  1
four  monday    one    1
five  monday    two    1

Conversely, setting index then setting dtype gives the described "bug":

dd = pd.DataFrame([
    ("five", "monday", "two", 1),
    ("four", "monday", "one", 1),
    ("one", "friday", "one", 1),
    ("one", "wednesday", "two", 1),
    ("three", "tuesday", "three", 1),
    ("two", "tuesday", "one", 1),
])
dd = dd.set_index([0, 1, 2])
dd.index = dd.index.set_levels(dd.index.levels[0].astype(number_type), 0)
dd.index = dd.index.set_levels(dd.index.levels[1].astype(day_type), 1)
dd.index = dd.index.set_levels(dd.index.levels[2].astype(number_type), 2)
3
0     1         2       
five  monday    two    1
four  monday    one    1
one   friday    one    1
      wednesday two    1
three tuesday   three  1
two   tuesday   one    1

This is because in the latter case, we only modify the levels but not the codes. Don't know if we have proper way for this usecase yet. Maybe it's in df.reindex but i couldnt work it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants