-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Indexing on series where index is output from pd.cut #27437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This looks to work on master now. Could use a test
|
Is this still open? Also want to confirm that this should go in |
take |
@djgray780 are you still working on this? |
take |
Is this issue still open , if yes then i would like to contribute ! |
@ryuusama09 No one is assigned to the issue so I think you can go ahead and work on it. |
take |
take |
Just started looking into this issue (as I see its still open). As far as I can see the issue still exists. The behaviour using >>> import pandas as pd
>>> data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
>>> grouped = data.groupby("quintiles")["values"]
>>> samples = grouped.agg(lambda x: list(x.sample(3)))
>>> samples
quintiles
(-0.019, 3.8] [3, 2, 1]
(3.8, 7.6] [6, 4, 7]
(7.6, 11.4] [11, 8, 9]
(11.4, 15.2] [13, 14, 12]
(15.2, 19.0] [18, 19, 16]
>>> samples[0]
[3, 2, 1]
>>> samples[1]
[3, 2, 1]
>>> samples.loc[0]
[3, 2, 1]
>>> samples.loc[1]
[3, 2, 1]
>>> samples.iloc[0]
[3, 2, 1]
>>> samples.iloc[1] # expected behaviour
[6, 4, 7] Extending this, the bug also occurs when constructing a series from an instance of >>> samples.index.categories
IntervalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], dtype='interval[float64, right]')
>>> new_series = pd.Series(samples.values, index=samples.index.categories)
>>> new_series
(-0.019, 3.8] [3, 2, 1]
(3.8, 7.6] [6, 4, 7]
(7.6, 11.4] [11, 8, 9]
(11.4, 15.2] [13, 14, 12]
(15.2, 19.0] [18, 19, 16]
dtype: object
>>> new_series[0]
[3, 2, 1]
>>> new_series[1]
[3, 2, 1] I will attempt to write up some unittests for this (seems the logical grouping for |
^^ I think having worked on this issue I realised this is expected behaviour and changing this ends up breaking a lot of tests. The expected behaviour was already implemented in tests a few months back here. The call stack was roughly going from checking the fallback from series To then checking the categorical Which obviously fell back to the IntervalIndex In this case I initially tried modifying the list of accepted strings to include This matter should be closed and will delete my PR above. |
Just tried on master. This issue still exists.
This might be expected behavior. |
take |
Can I work on this if it's still open as I'm a beginner? |
take |
take |
Code Sample, a copy-pastable example if possible
Problem description
Indexing a series object that has pd.cut() intervals as its index results in membership test of the provided index on the intervals, which I found interesting, but it was very unexpected. Is this intentional and documented somewhere? When I give the labels names (labels=list("abcde") in pd.cut), this doesn't happen anymore.
Expected Output
I expectived it to fall back on .iloc[].
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.12.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.24.2
pytest: 4.4.1
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.6
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.4.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: