-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pd.cut regression since version 1.4.0 when operating on datetime #46218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! Confirmed on main. |
@simonjayhawkins you never ping me on my good commits |
The bad is copy and pasted from the git bisect output and is the default label. All your commits are good, but some are maybe less good than others? I cc you on these since IIRC you requested me to do so. |
Just kidding around simon. my bad that wasn't clear. |
in #42227, This is the changed code that the code sample in the OP hits. it appears |
In _bins_to_cuts there is a |
moving to 1.4.3 |
Is there a recommended workaround? Some of my code stopped working and it took me a couple of hours of debugging to realize that pd.cut was broken in recent Pandas releases, and then find this page. |
@dom-insytesys comments like this are not helpful. pandas is a community project. Feel free to post a workaround for the benefit of other users or submit a PR to fix and it will be included in the next pandas patch release. |
@simonjayhawkins I'm so sorry if my comment annoyed you. I didn't intend to be rude. What I was hoping to convey was firstly that this is not an obscure corner case. One of the most common uses of pd.cut() is to bin time series data into windows. So I assume this bug affects a lot of Pandas users, myself included. Until Pandas 1.4.3 is released, I need to figure out some sort of a workaround. Obviously, because Pandas is open-source (hoorah!) I can theoretically get the source code, find the bug, and fix it myself. Or I can use pd.cut() source as a guide to writing my own alternative version. However, that has a pretty steep learning curve (I'm not very proficient in Python, nor familiar with Pandas internals, so I'd guess it would take me a day or more), and I was hoping that there was some sort of quick workaround that would be childishly obvious to guys like you who have a good understanding of this code. For example, @jbrockmendel provides an explanation above that I think says that the bug involves the conversion of Timestamp data to float64. That suggests to me that if I convert the timeseries data that needs binning from datetime64[ns] to a float64 (seconds since epoch?), that will avoid the bug. I can't figure out how to do this (e.g. "astype('float64') produces TypeError) , but I bet it is trivial for you. That's why I was hoping one of the Pandas team would say, "here's a quick way to avoid the bug: just convert your data from datetime64 to float64 using XXXX function". That would help affected users get quickly back on course. Again, I apologize wholeheartedly if I caused offense. I hugely appreciate the work of the volunteer community that works on Pandas. It's an amazing project. I'm just asking for some help, with the expectation that any guidance would probably be useful to other users who end up stumbling to this page for the same reason. |
The quick fix is described in my bug report. Use strings. |
In the spirit of attempting to help out others, here's my crude effort at a workaround. Firstly, some code to illustrate the bug: # Example code to illustrate pd.cut() bug
df = pd.DataFrame({"clock":pd.date_range(start='2020/1/1', end='2020/1/2', periods=25), "randomdata":[random.random() for _ in range(25)]})
windows = pd.IntervalIndex.from_breaks(pd.date_range(start='2019/12/31 23:30', end='2020/1/1 23:30', periods=13))
bins = pd.cut(df.clock, bins=windows)
df['randomdata'].groupby(bins).mean() This produces the following output:
If I understand correctly, the bug relates to the fact that df.clock has a Timestamp-like dtype. This means it gets on a fasttrack code path. That can be avoided by converting the datetime to a string. Changing the line: bins = pd.cut(df.clock.dt.strftime("%Y/%m/%d %H:%M:%S.%f'"), bins=windows) Produces the desired output:
This seems to be working for me. But I'm a little concerned that it is fragile. (What happens, for example if IntervalIndex is in a different timezone? etc.) |
moving to 1.4.4 |
Closing as a duplicate of #46218. Discussion of a solution seems further along there |
I think you closed the wrong one. |
Thanks @sergiykhan. Reopened |
I opened #47771 with a potential fix. It doesn't fix the root cause inside |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Binning is not correct in Pandas 1.4.x.
Note that the same series gets processed correctly, when the
dtype
is notdatetime64[ns]
Expected Behavior
The expected behavior is
Installed Versions
INSTALLED VERSIONS
commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.60.1-microsoft-standard-WSL2
Version : #1 SMP Wed Aug 25 23:20:18 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.1
numpy : 1.22.1
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.7.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
The text was updated successfully, but these errors were encountered: