BUG: to_datetime very slow with unsigned ints for unix seconds #42606

cdeil · 2021-07-19T12:13:07Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

I had a data pipeline that was terribly slow. Turns out the issue was that all the time was spent in pd.to_datetime calls with unix sec integers as input, because I was passing big-endian ints. With normal ints it's about 1000x faster.

Can anyone reproduce this performance issue?
is it possible to improve on this "gotcha", e.g. by forcing a typecast on input, or even some other way that doesn't require a copy and extra memory?

In [10]: %timeit index = pd.to_datetime(np.arange(1_000_000), unit="s", utc=True)
7.83 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [23]: %time index = pd.to_datetime(np.arange(1_000_000).astype("uint32"), unit="s", utc=True)
CPU times: user 4.03 s, sys: 12.5 ms, total: 4.05 s
Wall time: 4.04 s

In [24]: %time index = pd.to_datetime(np.arange(1_000_000).astype("uint64"), unit="s", utc=True)
CPU times: user 8.42 s, sys: 12.9 ms, total: 8.44 s
Wall time: 8.45 s

In [25]: %time index = pd.to_datetime(np.arange(1_000_000).astype("int64"), unit="s", utc=True)
CPU times: user 13.1 ms, sys: 4.05 ms, total: 17.2 ms
Wall time: 15.9 ms

In [26]: %time index = pd.to_datetime(np.arange(1_000_000).astype("int32"), unit="s", utc=True)
CPU times: user 13.4 ms, sys: 4.54 ms, total: 18 ms
Wall time: 16.5 ms

Output of `pd.show_versions()`

In [11]: pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.10.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.2.4
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.05.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : 1.4.18
tables : None
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 1.2.0
xlwt : None
numba : 0.53.1

The text was updated successfully, but these errors were encountered:

jreback · 2021-07-19T12:32:17Z

this is fixed for 1.3

cdeil · 2021-07-19T12:44:47Z

I still see this with pandas pandas 1.3.0 and numpy 1.20.3.

(hpred) cdeil@Christophs-MacBook-Pro hpred % pip install -U pandas
Requirement already satisfied: pandas in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (1.2.4)
Collecting pandas
  Downloading pandas-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl (11.4 MB)
     |████████████████████████████████| 11.4 MB 6.5 MB/s 
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: numpy>=1.17.3 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from pandas) (1.20.3)
Requirement already satisfied: six>=1.5 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.2.4
    Uninstalling pandas-1.2.4:
      Successfully uninstalled pandas-1.2.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdpbox 0.2.1 requires sklearn, which is not installed.
dtale 1.49.0 requires flask-ngrok; python_version > "3.0", which is not installed.
dtale 1.49.0 requires kaleido; python_version > "3.6", which is not installed.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.2 which is incompatible.
Successfully installed pandas-1.3.0
(hpred) cdeil@Christophs-MacBook-Pro hpred % ipython
Python 3.8.10 | packaged by conda-forge | (default, May 10 2021, 22:58:09) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: %time index = pd.to_datetime(np.arange(1_000_000).astype("uint32"), unit="s", utc=True)
CPU times: user 4.02 s, sys: 7.98 ms, total: 4.03 s
Wall time: 4.03 s

In [4]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.8.10.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.5.0
Version          : Darwin Kernel Version 20.5.0: Sat May  8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.3.0
numpy            : 1.20.3
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.1.2
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : 6.2.4
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.24.1
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 2021.05.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : None
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : 4.0.1
pyxlsb           : None
s3fs             : None
scipy            : 1.6.3
sqlalchemy       : 1.4.18
tables           : None
tabulate         : 0.8.9
xarray           : 0.18.2
xlrd             : 1.2.0
xlwt             : None
numba            : 0.53.1

cdeil · 2021-07-19T13:40:05Z

After changing my pipeline to use signed integers, the bottleneck then becomes joining ~ 100 Series with a few million data points and sorted datetime index, as in this example:

n_rows = 1_000_000
n_cols = 100
series = {}
for i in range(n_cols):
    data = i * np.arange(n_rows)
    index = np.arange(n_rows) + i
    index = pd.to_datetime(index, unit="s", utc=True)
    series[f"col{i:03d}"] = pd.Series(data, index)

df = pd.DataFrame(series)

Is calling pd.DataFrame(series) OK, or is there a better way?

In my real code with IoT sensor time series the indexes are int unix seconds and sorted, but a little bit different for each of the time series, so I need some kind of join / concat.

realead · 2021-07-19T19:53:09Z

Profiling shows:

au = np.arange(1_000_000).astype("uint64")
%prun  -s cumulative pd.to_datetime(au, unit="s", utc=True)

that

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   11.079   11.079 {built-in method builtins.exec}
        1    0.000    0.000   11.079   11.079 <string>:1(<module>)
        1    0.000    0.000   11.079   11.079 datetimes.py:605(to_datetime)
        1    0.000    0.000   11.077   11.077 datetimes.py:259(_convert_listlike_datetimes)
        1   11.077   11.077   11.077   11.077 {pandas._libs.tslib.array_with_unit_to_datetime}
        1    0.000    0.000    0.001    0.001 datetimes.py:135(_maybe_cache)

the whole time is spent in pandas._libs.tslib.array_with_unit_to_datetime.

realead · 2021-07-20T08:54:53Z

Here is line_profiling of array_with_unit_to_datetime (which should be taken with prise of salt as every line profiling):

Function: array_with_unit_to_datetime at line 189

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   189                                           def array_with_unit_to_datetime(
   ...
   288         2          0.0      0.0      0.0      try:
   289         2          0.0      0.0      0.0          for i in range(n):
   290   2000000    1364135.0      0.7      4.0              val = values[i]
   291                                           
   292   2000000     989321.0      0.5      2.9              if checknull_with_nat_and_na(val):
   293                                                           iresult[i] = NPY_NAT
   294                                           
   295   2000000    2394646.0      1.2      6.9              elif is_integer_object(val) or is_float_object(val):
   296                                           
   297   2000000    6870335.0      3.4     19.9                  if val != val or val == NPY_NAT:
   298                                                               iresult[i] = NPY_NAT
   299                                                           else:
   300   2000000     893807.0      0.4      2.6                      try:
   301   2000000   21944205.0     11.0     63.7                          iresult[i] = cast_from_unit(val, unit)

Note: if val != val or val == NPY_NAT: is responsible for about 20% and is probably not needed, because checknull_with_nat_and_na covers these cases already.

However, the issue compared to fast int64 case is this line:

pandas/pandas/_libs/tslib.pyx

Lines 246 to 251 in ef99443

    
           if is_raise: 
        
               # try a quick conversion to i8/f8 
        
               # if we have nulls that are not type-compat 
        
               # then need to iterate 
        
               if values.dtype.kind == "i" or values.dtype.kind == "f":

because for values.dtype.kind=="u" the fast branch is not taken.

realead · 2021-07-20T08:58:56Z

A similar issue was fixed with #35027 for floats. The question is whether just adding

if values.dtype.kind == "i" or values.dtype.kind == "f" or values.dtype.kind == "u":

will do the trick? @arw2019

cdeil added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2021

cdeil changed the title ~~BUG:~~ BUG: to_datetime very slow with big-endian ints for unix seconds Jul 19, 2021

cdeil changed the title ~~BUG: to_datetime very slow with big-endian ints for unix seconds~~ BUG: to_datetime very slow with unsigned ints for unix seconds Jul 19, 2021

realead added the Performance Memory or execution speed performance label Jul 20, 2021

simonjayhawkins added Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 2, 2021

simonjayhawkins added this to the Contributions Welcome milestone Aug 2, 2021

mroeschke removed the Bug label Aug 21, 2021

debnathshoham mentioned this issue Aug 28, 2021

PERF: to_datetime with uint #43268

Merged

4 tasks

jreback closed this as completed in #43268 Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_datetime very slow with unsigned ints for unix seconds #42606

BUG: to_datetime very slow with unsigned ints for unix seconds #42606

cdeil commented Jul 19, 2021 •

edited

Loading

INSTALLED VERSIONS

jreback commented Jul 19, 2021

cdeil commented Jul 19, 2021

cdeil commented Jul 19, 2021

realead commented Jul 19, 2021

realead commented Jul 20, 2021

realead commented Jul 20, 2021

BUG: to_datetime very slow with unsigned ints for unix seconds #42606

BUG: to_datetime very slow with unsigned ints for unix seconds #42606

Comments

cdeil commented Jul 19, 2021 • edited Loading

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jul 19, 2021

cdeil commented Jul 19, 2021

cdeil commented Jul 19, 2021

realead commented Jul 19, 2021

realead commented Jul 20, 2021

realead commented Jul 20, 2021

cdeil commented Jul 19, 2021 •

edited

Loading

Output of `pd.show_versions()`