PERF: timezoned series created 6x faster than non-timezoned series #56860

soerenwolfers · 2024-01-13T13:04:01Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

In a jupyter notebook

import pandas as pd
import numpy as np
n = 1_000_000
%time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None), n))
%time b = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz='UTC'), n))

PU times: user 1.88 s, sys: 11.9 ms, total: 1.89 s
Wall time: 1.88 s
CPU times: user 315 ms, sys: 3.99 ms, total: 319 ms
Wall time: 319 ms

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-76-generic
Version : #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.utf8
LANG : C.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.4
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-01-14T13:39:13Z

Thanks for the report, confirmed on main. Almost all the time is being spent in Series construction. Further investigations are welcome.

jrmylow · 2024-01-16T12:26:34Z

I've started poking into this, posting a summary note before I pick this up later:

Using cProfile the big difference comes up in the call to objects_to_datetime64 here

pandas/pandas/core/arrays/datetimes.py

Line 2368 in e379692

def objects_to_datetime64(
This calls out to Cython code next so it doesn't get picked up by cProfile

pandas/pandas/_libs/tslib.pyx

Line 414 in e379692

cpdef array_to_datetime(
The key difference in the two cases appears to happen in the next call, which is conversion.pyx.

pandas/pandas/_libs/tslibs/conversion.pyx

Line 773 in e379692

cdef int64_t parse_pydatetime(
Specifically, because case b's values contain the tzinfo attribute, it calls out to convert_datetime_to_tsobject while case a calls out to (<_Timestamp>val)._as_creso(creso, round_ok=True)._value

At this stage I probably still need to find a way to confirm if there is a significant performance difference between these branches, and whether there is a reasonable fix.

jrmylow · 2024-01-16T12:32:06Z

If I had to guess - it would seem that the _TSObject that is created in case b is lighter/faster than the _Timestamp created in case a?

mroeschke · 2024-01-30T17:12:30Z

The difference is in the resolution conversion to ns

In [13]: %time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None), n))
CPU times: user 809 ms, sys: 6.04 ms, total: 815 ms
Wall time: 813 ms

In [14]: %time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None).as_unit("ns"), n))
CPU times: user 344 ms, sys: 3.31 ms, total: 347 ms
Wall time: 347 ms

AFAICT, the non-tz case actually does a resolution conversion from "s" to "ns" while the tz case just creates a new object "replacing" the "s" resolution with "ns".

Technically, both should return "s" resolution here.

soerenwolfers added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 13, 2024

soerenwolfers changed the title ~~PERF: timezoned series created 10x faster than non-timezoned series~~ PERF: timezoned series created 6x faster than non-timezoned series Jan 13, 2024

soerenwolfers mentioned this issue Jan 13, 2024

DATE cast: "read_parquet, then cast" is orders of magnitude faster than "cast from read_parquet" on timezoned column duckdb/duckdb#10221

Closed

1 task

soerenwolfers closed this as completed Jan 13, 2024

soerenwolfers reopened this Jan 13, 2024

rhshadrach added Timezones Timezone data dtype Constructors Series/DataFrame/Index/pd.array Constructors Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2024

jrmylow mentioned this issue Jan 30, 2024

PERF: Performance Discrepancy in MultiIndex Creation with and without Timezones #57147

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: timezoned series created 6x faster than non-timezoned series #56860

PERF: timezoned series created 6x faster than non-timezoned series #56860

soerenwolfers commented Jan 13, 2024 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Jan 14, 2024

jrmylow commented Jan 16, 2024

jrmylow commented Jan 16, 2024

mroeschke commented Jan 30, 2024

PERF: timezoned series created 6x faster than non-timezoned series #56860

PERF: timezoned series created 6x faster than non-timezoned series #56860

Comments

soerenwolfers commented Jan 13, 2024 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

rhshadrach commented Jan 14, 2024

jrmylow commented Jan 16, 2024

jrmylow commented Jan 16, 2024

mroeschke commented Jan 30, 2024

soerenwolfers commented Jan 13, 2024 •

edited

Loading