-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: objects_to_datetime64ns performance degradation for datetime columns #50537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thanks @kenha37 for your report I can't reproduce, see below: (.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.5.2
Collecting pandas==1.5.2
Using cached pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2.8.2)
Requirement already satisfied: numpy>=1.20.3 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2022.6)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.5.2) (1.16.0)
Installing collected packages: pandas
Attempting uninstall: pandas
Found existing installation: pandas 1.4.4
Uninstalling pandas-1.4.4:
Successfully uninstalled pandas-1.4.4
Successfully installed pandas-1.5.2
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008793
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.4.4
Collecting pandas==1.4.4
Using cached pandas-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Requirement already satisfied: numpy>=1.18.5 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2022.6)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2.8.2)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.4.4) (1.16.0)
Installing collected packages: pandas
Attempting uninstall: pandas
Found existing installation: pandas 1.5.2
Uninstalling pandas-1.5.2:
Successfully uninstalled pandas-1.5.2
Successfully installed pandas-1.4.4
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008626
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
conversion,
)
from pandas._libs import (
tslib,
)
date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values
print('Starting conversion')
if pd.__version__ == '1.5.2':
start = dt.datetime.now()
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
require_iso8601=True,
allow_mixed=True)
print(f'Execution time: {dt.datetime.now() - start}')
if pd.__version__ == '1.4.4':
start = dt.datetime.now()
try:
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
require_iso8601=True,
allow_mixed=True)
except ValueError:
values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
print(f'Execution time: {dt.datetime.now() - start}') |
Thank you for looking at the issue. I have done some further testing, and the issue seems to only show itself on windows systems. When trying to step through the array_to_datetime method in a debugger, it seems the code is rotating through the find_tzfile and load_tzdata functions. I have tested that this is a problem both when using virtual environments and the system interpreter. |
that's useful, thanks could you try on the |
Maybe related to #49048. |
I can confirm that installing the tzdata package removes the problem.
The main branch by itself retains the problem. |
I tried building from source on the User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ pip uninstall tzdata
Found existing installation: tzdata 2022.7
Uninstalling tzdata-2022.7:
Would remove:
c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata-2022.7.dist-info\*
c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata\*
Proceed (Y/n)? y
Successfully uninstalled tzdata-2022.7
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ python t.py
Execution time: 0:00:00.011298
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
conversion,
)
from pandas._libs import (
tslib,
)
date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values
start = dt.datetime.now()
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
errors='raise',
utc=False,
dayfirst=False,
yearfirst=False,
)
print(f'Execution time: {dt.datetime.now() - start}') How did you run on the main branch? |
I installed it in a pipenv using "pipenv install PATH/TO/LOCAL/PANDAS-GIT". Here I also add and remove tzdata from the pipenv to reproduce the effect of having tzdata in the environment. I have tested this on a couple of windows 11 pc's, most running Version 10.0.22621 Build 22621. |
thanks for the info no idea how to debug this TBH if I can't reproduce it, sorry - hopefully someone else can help |
I have installed pandas from source again just to be sure the install was successful, and I am still seeing the same problem when the tzdata package isn't installed. I am not sure how the timezone information normally would be installed on windows systems, but if it is installed externally from the python environment on your system, you probably wont see the issue. There are some suggested solutions in #49048, but I wouldn't know where to start to help with a solution. |
The solution is either a) require tzdata |
Not sure if that's the same core issue, however, here's a simple reproduction script of a performance issue introduced in Pandas 1.5.0, tested on Windows 10: |
@MarcoGorelli is this fixed by #51247? |
thanks for the ping - yes, that's right! let's close then |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I have been seeing slow performance when using the pandas read_sql_query function when the returned data includes datetimes with a non UTC timezone offset.
I have located the issue to the objects_to_datetime64ns function called when constructing the dataframe.
In pandas 1.4.* the data function would try to call tslib.array_to_datetime and fail, then having a failover to conversion.datetime_to_datetime64, but for pandas 1.5.* the tslib.array_to_datetime is executed which takes a significant amount of time compared to the 1.4.* execution time.
Installed Versions
INSTALLED VERSIONS
commit : 8dab54d
python : 3.9.13.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Danish_Denmark.1252
pandas : 1.5.2
numpy : 1.24.1
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
Prior Performance
On my computer, the difference between the two version is
Pandas 1.4.4: Execution time: 0:00:00.013004
Pandas 1.5.2: Execution time: 0:00:21.586688
The text was updated successfully, but these errors were encountered: