-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: behavior change in dataframe.shift(, freq=infer) when upgrading from pandas 1.1.5 to 1.2.x #40799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the freq inference changed with 1.2. Do you know if this is expected now @jbrockmendel ? Labelling as a regression for now |
DataFrame.shift still accepts |
|
moving to 1.2.5 |
more minimal example
pd.to_datetime can handle the low precision strings, but not after being converted to a numpy float array. The PR that caused the regression changed the conversion from floats.
|
@jbrockmendel thoughts on this one? |
The only way around this I see is reverting #35027. maybe parts of it can be salvaged? |
#35027 did give a very respectable perf gain, but the changes in the tests that were included with that PR indicated that there were some numerical differences. We have not had other reports, suggesting that these numerical differences are not generally an issue. This bug report, shows that where the numerical precision is limited, these numerical differences become relevant. So on one hand, don't think we should revert #35027, but not sure how to fix either. |
we should maybe change the title of this issue, as the current title is misleading. |
Tracing the calls to
the with the following dirty hack to core/tools/datetimes.py around line 310 : import numpy
if isinstance(arg, (list, tuple, numpy.ndarray)):
arg = np.array(arg, dtype="O") the bug is solved. |
not checked, but guessing this works since it doesn't use the faster code for floats. interestingly, I guess if we are converting a list or tuple to an object array, we may be missing the perf gain for floats in some cases. It maybe, that we need to patch the code in the read_csv stage, so that an object array is always passed to pd.to_datetime, to account for the lower precision from the csv string based storage (although that might not always be the case). I assume a float array is currently passed otherwise we would not be getting the error? |
|
@pandas-dev/pandas-core not sure what to do here. I'm removing from 1.2.5 milestone so that can proceed with the 1.2.5 release. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
I am reading time series results from an other software with
pd.read_csv
and I put the time vector converted as datetime in index. At some stage in the data processing I need to usedataframe.shift
usingfreq='infer'
. This was working fine with pandas 1.1.5 but stopped working when upgrading to pandas 1.2.0 getting aValueError
. I could not find any reference of this in the change log so I guess it might be considered as a regression.I worked it around forcing the proper frequency in the shift function instead of using
infer
Investigations
Trying to go into more depth in the issue using the above dataframe
df
returns for pandas 1.1.5
and for pandas 1.2.x
There is some kind of rounding issue when converting the floats serie to datatime serie. This is not the case if the index is created directly from the
dates
ndarray so there might be something with theread_csv
part.This is working fine waterver version of pandas:
Output of
pd.show_versions()
for pandas 1.2.3
INSTALLED VERSIONS
commit : f2c8480
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252
pandas : 1.2.3
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
for pandas 1.1.5
INSTALLED VERSIONS
commit : b5958ee
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252
pandas : 1.1.5
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: