Skip to content

Bug: groupby with sort=False creates buggy MultiIndex #32259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xin-jin opened this issue Feb 26, 2020 · 8 comments · Fixed by #38701
Closed

Bug: groupby with sort=False creates buggy MultiIndex #32259

xin-jin opened this issue Feb 26, 2020 · 8 comments · Fixed by #38701
Labels
Deprecate Functionality to remove in pandas Groupby MultiIndex
Milestone

Comments

@xin-jin
Copy link

xin-jin commented Feb 26, 2020

d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-13'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

The index of t is certainly not sorted, but t.index.is_lexsorted() returns True.

Another more subtle example is

d = [3,4,10,0,1,2,5,3]
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b})
t = df.groupby(['d', 'a'], sort=False).mean()

This time the lexsort flag is correct. However, calling sortlevel will not sort the new MultiIndex correctly, that is, t.index.sortlevel(['d', 'a'])[0] returns

MultiIndex([( 3, 0),
            ( 3, 7),
            ( 4, 1),
            (10, 2),
            ( 0, 3),
            ( 1, 4),
            ( 2, 5),
            ( 5, 6)],
           names=['d', 'a'])

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-72-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.5.4
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: 0.8.1
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.45.1

@MarcoGorelli
Copy link
Member

take

@MarcoGorelli MarcoGorelli removed their assignment Feb 26, 2020
@MarcoGorelli
Copy link
Member

(just some notes, will come back to this)

is_lexsorted assumes the order of .codes reflects the order of the elements in the index. But .codes is actually determined by

                codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)

in pandas/core/groupby/grouper.py, where self.sort may be False

@xin-jin
Copy link
Author

xin-jin commented Feb 26, 2020

I found that stack/unstack has a similar bug:

d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14'])
a = np.arange(len(d))
b = np.random.rand(len(d))
df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])

If you do df.unstack('d').stack().index, the resulting index is unsorted on level 'd', but its .is_lexsorted() returns True.

I see that stack also relies on algorithms.factorize, which probably leads to the same issue.

@MarcoGorelli
Copy link
Member

If you do df.unstack('d').stack().index, the resulting index is unsorted on level 'd', but its .is_lexsorted() returns True.

In this case, isn't the current result correct though?

In [1]: import pandas as pd; import numpy as np                                                                                                                                                                  

In [2]: d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14']) 
   ...: a = np.arange(len(d)) 
   ...: b = np.random.rand(len(d)) 
   ...: df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])                                                                                                                

In [3]: df.unstack('d').stack().index                                                                                                                                                                            
Out[3]: 
MultiIndex([(0, '2020-11-02'),
            (1, '2019-01-02'),
            (2, '2020-01-02'),
            (3, '2020-02-04'),
            (4, '2020-11-03'),
            (5, '2019-11-03'),
            (6, '2019-11-13'),
            (7, '2019-11-14')],
           names=['a', 'd'])

The index is sorted by its first argument, and so it is correct to say that it's lexically sorted

@xin-jin
Copy link
Author

xin-jin commented Feb 27, 2020

If you do df.unstack('d').stack().index, the resulting index is unsorted on level 'd', but its .is_lexsorted() returns True.

In this case, isn't the current result correct though?

In [1]: import pandas as pd; import numpy as np                                                                                                                                                                  

In [2]: d = pd.to_datetime(['2020-11-02', '2019-01-02', '2020-01-02', '2020-02-04', '2020-11-03', '2019-11-03', '2019-11-13', '2019-11-14']) 
   ...: a = np.arange(len(d)) 
   ...: b = np.random.rand(len(d)) 
   ...: df = pd.DataFrame({'d': d, 'a': a, 'b': b}).set_index(['d', 'a']).take([3,2,4,1,0,5,6,7])                                                                                                                

In [3]: df.unstack('d').stack().index                                                                                                                                                                            
Out[3]: 
MultiIndex([(0, '2020-11-02'),
            (1, '2019-01-02'),
            (2, '2020-01-02'),
            (3, '2020-02-04'),
            (4, '2020-11-03'),
            (5, '2019-11-03'),
            (6, '2019-11-13'),
            (7, '2019-11-14')],
           names=['a', 'd'])

The index is sorted by its first argument, and so it is correct to say that it's lexically sorted

Oh you are right. I was confused a bit. Yes, this is actually correctly sorted.

@MarcoGorelli
Copy link
Member

Decision has been to deprecate is_lexsorted as a public method - users should use is_monotonic_increasing

@jorisvandenbossche
Copy link
Member

Decision has been to deprecate is_lexsorted as a public method - users should use is_monotonic_increasing

Where has this been discussed?

@MarcoGorelli
Copy link
Member

Hey @jorisvandenbossche - it was in call when we had the sprint on the 31st of August

@jreback jreback added this to the 1.3 milestone Jan 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment