Skip to content

groupby/quantile breaks #28307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daishi opened this issue Sep 5, 2019 · 4 comments
Closed

groupby/quantile breaks #28307

daishi opened this issue Sep 5, 2019 · 4 comments
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@daishi
Copy link

daishi commented Sep 5, 2019

Code Sample, a copy-pastable example if possible

In [34]: pd.DataFrame({'x': [0, 1, 2], 'y': [0, 1, 2]}).groupby('x')['y'].quantile([0.5])                                                                                                                          
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-34-287422cf6b27> in <module>
----> 1 pd.DataFrame({'x': [0, 1, 2], 'y': [0, 1, 2]}).groupby('x')['y'].quantile([0.5])

~/.venv-3.7.2/p4p_backend-1ulE3AqG/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
   1951             indices = np.concatenate(arrays)
   1952             assert len(indices) == len(result)
-> 1953             return result.take(indices)
   1954 
   1955     @Substitution(name="groupby")

~/.venv-3.7.2/p4p_backend-1ulE3AqG/lib/python3.7/site-packages/pandas/core/series.py in take(self, indices, axis, is_copy, **kwargs)
   4430 
   4431         indices = ensure_platform_int(indices)
-> 4432         new_index = self.index.take(indices)
   4433 
   4434         if is_categorical_dtype(self):

~/.venv-3.7.2/p4p_backend-1ulE3AqG/lib/python3.7/site-packages/pandas/core/indexes/multi.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
   2030             allow_fill=allow_fill,
   2031             fill_value=fill_value,
-> 2032             na_value=-1,
   2033         )
   2034         return MultiIndex(

~/.venv-3.7.2/p4p_backend-1ulE3AqG/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _assert_take_fillable(self, values, indices, allow_fill, fill_value, na_value)
   2058                 taken = masked
   2059         else:
-> 2060             taken = [lab.take(indices) for lab in self.codes]
   2061         return taken
   2062 

~/.venv-3.7.2/p4p_backend-1ulE3AqG/lib/python3.7/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
   2058                 taken = masked
   2059         else:
-> 2060             taken = [lab.take(indices) for lab in self.codes]
   2061         return taken
   2062 

IndexError: index 3 is out of bounds for size 3

Problem description

An exception is thrown above. This was not an issue in Pandas 0.24.2.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.0.0-27-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.1
numpy            : 1.17.1
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.2.0
Cython           : None
pytest           : 5.1.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : 1.2.0
lxml.etree       : 4.4.1
html5lib         : None
pymysql          : None
psycopg2         : 2.8.3 (dt dec pq3 ext lo64)
jinja2           : 2.10.1
IPython          : 7.5.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.4.1
matplotlib       : 3.1.1
numexpr          : None
odfpy            : None
openpyxl         : 2.6.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.3.1
sqlalchemy       : 1.3.8
tables           : None
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : 1.2.0
@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

Can you try on master? I believe this has already been fixed

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 6, 2019
@josesho
Copy link

josesho commented Sep 6, 2019

I think I'm seeing the same issue here: #28312.

Basically, if the GroupBy column has more than 2 categories, the quantile code breaks.

@TomAugspurger
Copy link
Contributor

Fixed by #28113 I think. That's included in 0.25.2 which will be out in a few weeks, or you can use master.

@daishi
Copy link
Author

daishi commented Sep 6, 2019

Confirming fixed on master:

In [4]: import pandas as pd                                                                                                                                                                                        

In [5]: pd.DataFrame({'x': [0, 1, 2], 'y': [0, 1, 2]}).groupby('x')['y'].quantile([0.5])                                                                                                                           
Out[5]: 
x     
0  0.5    0.0
1  0.5    1.0
2  0.5    2.0
Name: y, dtype: float64

In [6]: pd.__version__                                                                                                                                                                                             
Out[6]: '0.25.0+296.g2d65e38f5'

FWIW as a side note it took me a little bit to figure out how to build and run off of master. Also, I didn't initially anticipate the build time required for the native libraries. Happy to help but maybe a link giving a quick outline of process when suggesting "try on master" would help reduce friction for casual bug reporters like myself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

4 participants