Skip to content

PERF: objects_to_datetime64ns performance degradation for datetime columns #50537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
kenha37 opened this issue Jan 3, 2023 · 13 comments
Closed
3 tasks done
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Windows Windows OS

Comments

@kenha37
Copy link

kenha37 commented Jan 3, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I have been seeing slow performance when using the pandas read_sql_query function when the returned data includes datetimes with a non UTC timezone offset.

I have located the issue to the objects_to_datetime64ns function called when constructing the dataframe.
In pandas 1.4.* the data function would try to call tslib.array_to_datetime and fail, then having a failover to conversion.datetime_to_datetime64, but for pandas 1.5.* the tslib.array_to_datetime is executed which takes a significant amount of time compared to the 1.4.* execution time.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.9.13.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Danish_Denmark.1252
pandas : 1.5.2
numpy : 1.24.1
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Prior Performance

On my computer, the difference between the two version is
Pandas 1.4.4: Execution time: 0:00:00.013004
Pandas 1.5.2: Execution time: 0:00:21.586688

import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
    conversion,
)
from pandas._libs import (
    tslib,
)

date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values

print('Starting conversion')
if pd.__version__ == '1.5.2':
    start = dt.datetime.now()
    result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                errors='raise',
                                                utc=False,
                                                dayfirst=False,
                                                yearfirst=False,
                                                require_iso8601=True,
                                                allow_mixed=True)
    print(f'Execution time: {dt.datetime.now() - start}')

if pd.__version__ == '1.4.4':
    start = dt.datetime.now()
    try:
        result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                    errors='raise',
                                                    utc=False,
                                                    dayfirst=False,
                                                    yearfirst=False,
                                                    require_iso8601=True,
                                                    allow_mixed=True)
    except ValueError:
        values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
    print(f'Execution time: {dt.datetime.now() - start}')
@kenha37 kenha37 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 3, 2023
@MarcoGorelli
Copy link
Member

thanks @kenha37 for your report

I can't reproduce, see below:

(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.5.2
Collecting pandas==1.5.2
  Using cached pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2.8.2)
Requirement already satisfied: numpy>=1.20.3 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.5.2) (2022.6)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.5.2) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.4.4
    Uninstalling pandas-1.4.4:
      Successfully uninstalled pandas-1.4.4
Successfully installed pandas-1.5.2
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008793
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ pip install pandas==1.4.4
Collecting pandas==1.4.4
  Using cached pandas-1.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Requirement already satisfied: numpy>=1.18.5 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (1.23.5)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2022.6)
Requirement already satisfied: python-dateutil>=2.8.1 in ./.venv/lib/python3.8/site-packages (from pandas==1.4.4) (2.8.2)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.4.4) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.2
    Uninstalling pandas-1.5.2:
      Successfully uninstalled pandas-1.5.2
Successfully installed pandas-1.4.4
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
Starting conversion
Execution time: 0:00:00.008626
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
    conversion,
)
from pandas._libs import (
    tslib,
)

date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values

print('Starting conversion')
if pd.__version__ == '1.5.2':
    start = dt.datetime.now()
    result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                errors='raise',
                                                utc=False,
                                                dayfirst=False,
                                                yearfirst=False,
                                                require_iso8601=True,
                                                allow_mixed=True)
    print(f'Execution time: {dt.datetime.now() - start}')

if pd.__version__ == '1.4.4':
    start = dt.datetime.now()
    try:
        result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                                    errors='raise',
                                                    utc=False,
                                                    dayfirst=False,
                                                    yearfirst=False,
                                                    require_iso8601=True,
                                                    allow_mixed=True)
    except ValueError:
        values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
    print(f'Execution time: {dt.datetime.now() - start}')

@MarcoGorelli MarcoGorelli added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 4, 2023
@kenha37
Copy link
Author

kenha37 commented Jan 4, 2023

Thank you for looking at the issue.

I have done some further testing, and the issue seems to only show itself on windows systems.
I only have ubuntu and windows computers available, and the issue is consistent on windows but doesn't appear on ubuntu.

When trying to step through the array_to_datetime method in a debugger, it seems the code is rotating through the find_tzfile and load_tzdata functions.
Counting the function calls to find_tzfile, it is called 43809 times in the above code.

I have tested that this is a problem both when using virtual environments and the system interpreter.

@MarcoGorelli MarcoGorelli added Regression Functionality that used to work in a prior pandas version Windows Windows OS and removed Needs Info Clarification about behavior needed to assess issue labels Jan 4, 2023
@MarcoGorelli
Copy link
Member

that's useful, thanks

could you try on the main branch too please?

@lithomas1
Copy link
Member

Maybe related to #49048.

@kenha37
Copy link
Author

kenha37 commented Jan 4, 2023

Maybe related to #49048.

I can confirm that installing the tzdata package removes the problem.

that's useful, thanks

could you try on the main branch too please?

The main branch by itself retains the problem.

@MarcoGorelli
Copy link
Member

I tried building from source on the main branch on windows, and uninstalling tzdata, and can't reproduce:

User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ pip uninstall tzdata
Found existing installation: tzdata 2022.7
Uninstalling tzdata-2022.7:
  Would remove:
    c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata-2022.7.dist-info\*
    c:\users\user\pandas-dev\.venv\lib\site-packages\tzdata\*
Proceed (Y/n)? y
  Successfully uninstalled tzdata-2022.7
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ python t.py
Execution time: 0:00:00.011298
(.venv)
User@DESKTOP-U8OKFP3 MINGW64 ~/pandas-dev (main)
$ cat t.py
import pandas as pd
import datetime as dt
import pytz
import logging
from pandas._libs.tslibs import (
    conversion,
)
from pandas._libs import (
    tslib,
)

date_range = pd.date_range(dt.datetime(2021, 1, 1, tzinfo=pytz.timezone('CET')), dt.datetime(2022, 1, 1, tzinfo=pytz.timezone('CET')), freq='1H')
df = pd.DataFrame({'dtColumn': date_range})
df['dtColumn'] = pd.to_datetime(df['dtColumn'])
data = df.iloc[:, 0:].values

start = dt.datetime.now()
result, tz_parsed = tslib.array_to_datetime(data.ravel("K"),
                                            errors='raise',
                                            utc=False,
                                            dayfirst=False,
                                            yearfirst=False,
                                            )
print(f'Execution time: {dt.datetime.now() - start}')

How did you run on the main branch?

@kenha37
Copy link
Author

kenha37 commented Jan 5, 2023

I installed it in a pipenv using "pipenv install PATH/TO/LOCAL/PANDAS-GIT".

Here I also add and remove tzdata from the pipenv to reproduce the effect of having tzdata in the environment.

I have tested this on a couple of windows 11 pc's, most running Version 10.0.22621 Build 22621.

@MarcoGorelli
Copy link
Member

thanks for the info

no idea how to debug this TBH if I can't reproduce it, sorry - hopefully someone else can help

@kenha37 kenha37 closed this as completed Jan 6, 2023
@kenha37 kenha37 reopened this Jan 6, 2023
@kenha37
Copy link
Author

kenha37 commented Jan 6, 2023

I have installed pandas from source again just to be sure the install was successful, and I am still seeing the same problem when the tzdata package isn't installed.

I am not sure how the timezone information normally would be installed on windows systems, but if it is installed externally from the python environment on your system, you probably wont see the issue.

There are some suggested solutions in #49048, but I wouldn't know where to start to help with a solution.

@jbrockmendel
Copy link
Member

The solution is either

a) require tzdata
b) edit is_utz_zoneinfo as suggested #49048 (comment)

@tkarni
Copy link

tkarni commented Feb 4, 2023

Not sure if that's the same core issue, however, here's a simple reproduction script of a performance issue introduced in Pandas 1.5.0, tested on Windows 10:
https://stackoverflow.com/questions/74765179/python-pandas-1-5-x-takes-too-long-to-retrieve-dict-values-when-the-values-are
This is too resolved by merely installing tzdata.

@jbrockmendel
Copy link
Member

@MarcoGorelli is this fixed by #51247?

@MarcoGorelli
Copy link
Member

thanks for the ping - yes, that's right! let's close then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Windows Windows OS
Projects
None yet
Development

No branches or pull requests

5 participants