Skip to content

BUG: to_datetime very slow with unsigned ints for unix seconds #42606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
cdeil opened this issue Jul 19, 2021 · 6 comments · Fixed by #43268
Closed
1 of 3 tasks

BUG: to_datetime very slow with unsigned ints for unix seconds #42606

cdeil opened this issue Jul 19, 2021 · 6 comments · Fixed by #43268
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@cdeil
Copy link
Contributor

cdeil commented Jul 19, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


I had a data pipeline that was terribly slow. Turns out the issue was that all the time was spent in pd.to_datetime calls with unix sec integers as input, because I was passing big-endian ints. With normal ints it's about 1000x faster.

Can anyone reproduce this performance issue?
is it possible to improve on this "gotcha", e.g. by forcing a typecast on input, or even some other way that doesn't require a copy and extra memory?

In [10]: %timeit index = pd.to_datetime(np.arange(1_000_000), unit="s", utc=True)
7.83 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [23]: %time index = pd.to_datetime(np.arange(1_000_000).astype("uint32"), unit="s", utc=True)
CPU times: user 4.03 s, sys: 12.5 ms, total: 4.05 s
Wall time: 4.04 s

In [24]: %time index = pd.to_datetime(np.arange(1_000_000).astype("uint64"), unit="s", utc=True)
CPU times: user 8.42 s, sys: 12.9 ms, total: 8.44 s
Wall time: 8.45 s

In [25]: %time index = pd.to_datetime(np.arange(1_000_000).astype("int64"), unit="s", utc=True)
CPU times: user 13.1 ms, sys: 4.05 ms, total: 17.2 ms
Wall time: 15.9 ms

In [26]: %time index = pd.to_datetime(np.arange(1_000_000).astype("int32"), unit="s", utc=True)
CPU times: user 13.4 ms, sys: 4.54 ms, total: 18 ms
Wall time: 16.5 ms

Output of pd.show_versions()

In [11]: pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.10.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.2.4
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.05.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : 1.4.18
tables : None
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 1.2.0
xlwt : None
numba : 0.53.1

@cdeil cdeil added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2021
@cdeil cdeil changed the title BUG: BUG: to_datetime very slow with big-endian ints for unix seconds Jul 19, 2021
@cdeil cdeil changed the title BUG: to_datetime very slow with big-endian ints for unix seconds BUG: to_datetime very slow with unsigned ints for unix seconds Jul 19, 2021
@jreback
Copy link
Contributor

jreback commented Jul 19, 2021

this is fixed for 1.3

@cdeil
Copy link
Contributor Author

cdeil commented Jul 19, 2021

I still see this with pandas pandas 1.3.0 and numpy 1.20.3.

(hpred) cdeil@Christophs-MacBook-Pro hpred % pip install -U pandas
Requirement already satisfied: pandas in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (1.2.4)
Collecting pandas
  Downloading pandas-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl (11.4 MB)
     |████████████████████████████████| 11.4 MB 6.5 MB/s 
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: numpy>=1.17.3 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from pandas) (1.20.3)
Requirement already satisfied: six>=1.5 in /Users/cdeil/opt/anaconda3/envs/hpred/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.2.4
    Uninstalling pandas-1.2.4:
      Successfully uninstalled pandas-1.2.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdpbox 0.2.1 requires sklearn, which is not installed.
dtale 1.49.0 requires flask-ngrok; python_version > "3.0", which is not installed.
dtale 1.49.0 requires kaleido; python_version > "3.6", which is not installed.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.2 which is incompatible.
Successfully installed pandas-1.3.0
(hpred) cdeil@Christophs-MacBook-Pro hpred % ipython
Python 3.8.10 | packaged by conda-forge | (default, May 10 2021, 22:58:09) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: %time index = pd.to_datetime(np.arange(1_000_000).astype("uint32"), unit="s", utc=True)
CPU times: user 4.02 s, sys: 7.98 ms, total: 4.03 s
Wall time: 4.03 s

In [4]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.8.10.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.5.0
Version          : Darwin Kernel Version 20.5.0: Sat May  8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.3.0
numpy            : 1.20.3
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.1.2
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : 6.2.4
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.24.1
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 2021.05.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : None
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : 4.0.1
pyxlsb           : None
s3fs             : None
scipy            : 1.6.3
sqlalchemy       : 1.4.18
tables           : None
tabulate         : 0.8.9
xarray           : 0.18.2
xlrd             : 1.2.0
xlwt             : None
numba            : 0.53.1

@cdeil
Copy link
Contributor Author

cdeil commented Jul 19, 2021

After changing my pipeline to use signed integers, the bottleneck then becomes joining ~ 100 Series with a few million data points and sorted datetime index, as in this example:

n_rows = 1_000_000
n_cols = 100
series = {}
for i in range(n_cols):
    data = i * np.arange(n_rows)
    index = np.arange(n_rows) + i
    index = pd.to_datetime(index, unit="s", utc=True)
    series[f"col{i:03d}"] = pd.Series(data, index)

df = pd.DataFrame(series)

Is calling pd.DataFrame(series) OK, or is there a better way?

In my real code with IoT sensor time series the indexes are int unix seconds and sorted, but a little bit different for each of the time series, so I need some kind of join / concat.

Screenshot 2021-07-19 at 15 34 50

@realead
Copy link
Contributor

realead commented Jul 19, 2021

Profiling shows:

au = np.arange(1_000_000).astype("uint64")
%prun  -s cumulative pd.to_datetime(au, unit="s", utc=True)

that

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   11.079   11.079 {built-in method builtins.exec}
        1    0.000    0.000   11.079   11.079 <string>:1(<module>)
        1    0.000    0.000   11.079   11.079 datetimes.py:605(to_datetime)
        1    0.000    0.000   11.077   11.077 datetimes.py:259(_convert_listlike_datetimes)
        1   11.077   11.077   11.077   11.077 {pandas._libs.tslib.array_with_unit_to_datetime}
        1    0.000    0.000    0.001    0.001 datetimes.py:135(_maybe_cache)

the whole time is spent in pandas._libs.tslib.array_with_unit_to_datetime.

@realead
Copy link
Contributor

realead commented Jul 20, 2021

Here is line_profiling of array_with_unit_to_datetime (which should be taken with prise of salt as every line profiling):

Function: array_with_unit_to_datetime at line 189

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   189                                           def array_with_unit_to_datetime(
   ...
   288         2          0.0      0.0      0.0      try:
   289         2          0.0      0.0      0.0          for i in range(n):
   290   2000000    1364135.0      0.7      4.0              val = values[i]
   291                                           
   292   2000000     989321.0      0.5      2.9              if checknull_with_nat_and_na(val):
   293                                                           iresult[i] = NPY_NAT
   294                                           
   295   2000000    2394646.0      1.2      6.9              elif is_integer_object(val) or is_float_object(val):
   296                                           
   297   2000000    6870335.0      3.4     19.9                  if val != val or val == NPY_NAT:
   298                                                               iresult[i] = NPY_NAT
   299                                                           else:
   300   2000000     893807.0      0.4      2.6                      try:
   301   2000000   21944205.0     11.0     63.7                          iresult[i] = cast_from_unit(val, unit)

Note: if val != val or val == NPY_NAT: is responsible for about 20% and is probably not needed, because checknull_with_nat_and_na covers these cases already.

However, the issue compared to fast int64 case is this line:

if is_raise:
# try a quick conversion to i8/f8
# if we have nulls that are not type-compat
# then need to iterate
if values.dtype.kind == "i" or values.dtype.kind == "f":

because for values.dtype.kind=="u" the fast branch is not taken.

@realead
Copy link
Contributor

realead commented Jul 20, 2021

A similar issue was fixed with #35027 for floats. The question is whether just adding

if values.dtype.kind == "i" or values.dtype.kind == "f" or values.dtype.kind == "u":

will do the trick? @arw2019

@realead realead added the Performance Memory or execution speed performance label Jul 20, 2021
@simonjayhawkins simonjayhawkins added Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 2, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 2, 2021
@mroeschke mroeschke removed the Bug label Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants