Skip to content

PERF: timezoned series created 6x faster than non-timezoned series #56860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
soerenwolfers opened this issue Jan 13, 2024 · 4 comments
Open
2 of 3 tasks
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Timestamp pd.Timestamp and associated methods Timezones Timezone data dtype

Comments

@soerenwolfers
Copy link

soerenwolfers commented Jan 13, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

In a jupyter notebook

import pandas as pd
import numpy as np
n = 1_000_000
%time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None), n))
%time b = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz='UTC'), n))
PU times: user 1.88 s, sys: 11.9 ms, total: 1.89 s
Wall time: 1.88 s
CPU times: user 315 ms, sys: 3.99 ms, total: 319 ms
Wall time: 319 ms

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-76-generic
Version : #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.utf8
LANG : C.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.4
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

Prior Performance

No response

@soerenwolfers soerenwolfers added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 13, 2024
@soerenwolfers soerenwolfers changed the title PERF: timezoned series created 10x faster than non-timezoned series PERF: timezoned series created 6x faster than non-timezoned series Jan 13, 2024
@soerenwolfers soerenwolfers reopened this Jan 13, 2024
@rhshadrach
Copy link
Member

Thanks for the report, confirmed on main. Almost all the time is being spent in Series construction. Further investigations are welcome.

@rhshadrach rhshadrach added Timezones Timezone data dtype Constructors Series/DataFrame/Index/pd.array Constructors Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2024
@jrmylow
Copy link
Contributor

jrmylow commented Jan 16, 2024

I've started poking into this, posting a summary note before I pick this up later:

  • Using cProfile the big difference comes up in the call to objects_to_datetime64 here
    def objects_to_datetime64(
  • This calls out to Cython code next so it doesn't get picked up by cProfile
    cpdef array_to_datetime(
  • The key difference in the two cases appears to happen in the next call, which is conversion.pyx.
    cdef int64_t parse_pydatetime(
  • Specifically, because case b's values contain the tzinfo attribute, it calls out to convert_datetime_to_tsobject while case a calls out to (<_Timestamp>val)._as_creso(creso, round_ok=True)._value

At this stage I probably still need to find a way to confirm if there is a significant performance difference between these branches, and whether there is a reasonable fix.

@jrmylow
Copy link
Contributor

jrmylow commented Jan 16, 2024

If I had to guess - it would seem that the _TSObject that is created in case b is lighter/faster than the _Timestamp created in case a?

@mroeschke
Copy link
Member

The difference is in the resolution conversion to ns

In [13]: %time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None), n))
CPU times: user 809 ms, sys: 6.04 ms, total: 815 ms
Wall time: 813 ms

In [14]: %time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None).as_unit("ns"), n))
CPU times: user 344 ms, sys: 3.31 ms, total: 347 ms
Wall time: 347 ms

AFAICT, the non-tz case actually does a resolution conversion from "s" to "ns" while the tz case just creates a new object "replacing" the "s" resolution with "ns".

Technically, both should return "s" resolution here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Timestamp pd.Timestamp and associated methods Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

4 participants