-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Imprecise intervals/labels by pd.cut #16276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have seen PR #15309, looked at the code for pd.Categorical([pd.Interval(0, 2, 'both'), pd.Interval(2, 4, 'right')])
# ValueError: intervals must all be closed on the same side So, the expected input posted above does not indicate an unknown issue. Also, the meaning of the
to this:
So it seems like to get back the old (and still documentated) meaning, the vectors of the Would this change come with any disastrous consequences? |
In the prior implementation it was allowed to have mixed intervals, IOW, ones that didn't conform in a uniform way. This is not allowed directly from
This is not a regression but eliminates a non-tested / non-documented case, which IMHO is pretty useless. It is theoretically possible to support it in an index, but you would need a clear cut usecase. |
If you really really needed to replicate this, you can do this.
This is supported as much as scalar labels are, meaning you can directly index them, but this is NOT an IntervalIndex. |
Here is a minimal usecase import pandas as pd
from plotnine import *
df = pd.DataFrame({'x': range(5),
'y': range(5),
})
p = (ggplot(df, aes('x', 'y'))
+ geom_point()
+ facet_wrap('pd.cut(x, 2, include_lowest=True)')
+ lims(x=(0, 4))
)
print(p) The labels to the panels are not precise, so you would have to reexamine the data limits to be certain of what data is at the edges. Second, integers that can be otherwise cut along integer lines have floating-point artefacts. These two issues add up when the code that generates the plots is a script not amendable to changes specific to the data currently being plotted. Another usecase -- planned for the plotting system -- is a tree faceting feature. In this case without automatic precise intervals/labels I envision a more gnarly tree generating algorithm. |
Yes, I think this summarizes things correctly. I agree that exposing a floating point lower bound is non-ideal. This was the simplest thing to do for I think IntervalIndex could indeed be adjusted to accommodate an external interval that is closed on both sides -- and this might be the nicest solution, especially if the more complex indexing behavior is not really needed -- but it would indeed take some work to make this happen. |
Closing since this was just a usage question. |
Code Sample
Output
Problem description
The correct cutting behaviour but the labels are awkward (given the input), they leak the internal adjustments used to do the cutting. The expected output has precise intervals/labels. This is a regression from the previous version
0.19.2
.Expected Output
Output of
pd.show_versions()
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.6-gentoo-r1
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf8
LANG: en_US.utf8
LOCALE: en_US.UTF-8
pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
feather: None
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: