Skip to content

PERF: to_datime fastpath for %Y%m%d is slower #17410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Sep 1, 2017 · 9 comments · Fixed by #50242
Closed

PERF: to_datime fastpath for %Y%m%d is slower #17410

jorisvandenbossche opened this issue Sep 1, 2017 · 9 comments · Fixed by #50242
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@jorisvandenbossche
Copy link
Member

We have a check for whether format == '%Y%m%d', but this actually seems to be slower:

In [86]: s = pd.Series(['20120101']*1000000)

In [87]: %timeit pd.to_datetime(s)
229 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [88]: %timeit pd.to_datetime(s, format='%Y%m%d')
749 ms ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@jreback
Copy link
Contributor

jreback commented Sep 1, 2017

something changed recently? with this; the path now excepts every time (inside the _attemptYYMMDD). this should be much faster.

@jreback jreback added Difficulty Intermediate Performance Memory or execution speed performance Datetime Datetime data dtype labels Sep 1, 2017
@jreback jreback added this to the Next Major Release milestone Sep 1, 2017
@pratapvardhan
Copy link
Contributor

pratapvardhan commented Sep 1, 2017

I suspect, for format='%Y%m%d' it goes through _attempt_YYYYMMDD in pandas.core.tools.datetimes and is reconverting the type from object to int to object again and then having lib.try_parse_year_month_day prior to tslib.array_to_datetime seems taking up time. Whereas, without format, it directly uses tslib.array_to_datetime

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 1, 2017

One change, though not especially recent, is that the iso 8601 path will now handle '%Y%m%d', so it will be much faster than it once was. Still seems like it would be possible to make _attempt_YYYYMMDD faster, not clear to me why we're casting back and forth to objects.

@mroeschke
Copy link
Member

I looked into this briefly in https://github.com/mroeschke/pandas/tree/non_object_parsing and was able to get it down ~2x from master by getting rid of casting to object, but it's still slower than not providing a format:

In [13]: pd.__version__
Out[13]: '0.24.0.dev0+890.gd57115285.dirty'

In [14]: s = pd.Series(['20120101']*1000000)

In [15]: %timeit pd.to_datetime(s)
135 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Branch
In [16]: %timeit pd.to_datetime(s, format='%Y%m%d')
240 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# master
In [3]: %timeit pd.to_datetime(s, format='%Y%m%d')
531 ms ± 34.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Given that this branch can provide a slowdown, it might be worth removing this path for now.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2018

this used to be way faster than naive parsing
something changed -

@jreback
Copy link
Contributor

jreback commented Nov 4, 2018

This patch makes the speed almost the same. But since its just duplicating a lot of paths, maybe just easier to simplify this (e.g. if format is %Y%m%d then just remove the format and parse).

I think this got way slower because of floating point math was introduced (e.g. / rather than //)

diff --git a/pandas/_libs/tslibs/conversion.pyx b/pandas/_libs/tslibs/conversion.pyx
index 8cf42bf93..2f303e1f2 100644
--- a/pandas/_libs/tslibs/conversion.pyx
+++ b/pandas/_libs/tslibs/conversion.pyx
@@ -481,6 +481,34 @@ cdef _TSObject convert_str_to_tsobject(object ts, object tz, object unit,
     return convert_to_tsobject(ts, tz, unit, dayfirst, yearfirst)
 
 
+def try_parse_year_month_day(int64_t[:] years, int64_t[:] months,
+                             int64_t[:] days):
+    cdef:
+        Py_ssize_t i, n
+        int64_t[:] iresult
+        npy_datetimestruct dts
+
+    n = len(years)
+    if len(months) != n or len(days) != n:
+        raise ValueError('Length of years/months/days must all be equal')
+    result = np.empty(n, dtype='M8[ns]')
+    iresult = result.view('i8')
+
+    dts.hour = 0
+    dts.min = 0
+    dts.sec = 0
+    dts.us = 0
+    dts.ps = 0
+
+    for i in range(n):
+        dts.year = years[i]
+        dts.month = months[i]
+        dts.day = days[i]
+        result[i] = dtstruct_to_dt64(&dts)
+
+    return np.asarray(result)
+
+
 cdef inline check_overflows(_TSObject obj):
     """
     Check that we haven't silently overflowed in timezone conversion
diff --git a/pandas/_libs/tslibs/parsing.pyx b/pandas/_libs/tslibs/parsing.pyx
index 71bb8f796..cd911153c 100644
--- a/pandas/_libs/tslibs/parsing.pyx
+++ b/pandas/_libs/tslibs/parsing.pyx
@@ -10,7 +10,6 @@ from cython import Py_ssize_t
 
 from cpython.datetime cimport datetime
 
-
 import numpy as np
 
 # Avoid import from outside _libs
@@ -19,7 +18,6 @@ if sys.version_info.major == 2:
 else:
     from io import StringIO
 
-
 # dateutil compat
 from dateutil.tz import (tzoffset,
                          tzlocal as _dateutil_tzlocal,
@@ -465,23 +463,6 @@ def try_parse_date_and_time(object[:] dates, object[:] times,
     return result.base  # .base to access underlying ndarray
 
 
-def try_parse_year_month_day(object[:] years, object[:] months,
-                             object[:] days):
-    cdef:
-        Py_ssize_t i, n
-        object[:] result
-
-    n = len(years)
-    if len(months) != n or len(days) != n:
-        raise ValueError('Length of years/months/days must all be equal')
-    result = np.empty(n, dtype='O')
-
-    for i in range(n):
-        result[i] = datetime(int(years[i]), int(months[i]), int(days[i]))
-
-    return result.base  # .base to access underlying ndarray
-
-
 def try_parse_datetime_components(object[:] years,
                                   object[:] months,
                                   object[:] days,
diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index dcba51d26..c82140eff 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -244,6 +244,7 @@ def _convert_listlike_datetimes(arg, box, format, name=None, tz=None,
             if format == '%Y%m%d':
                 try:
                     result = _attempt_YYYYMMDD(arg, errors=errors)
+                    return DatetimeIndex._simple_new(result, name=name, tz=tz)
                 except (ValueError, TypeError, tslibs.OutOfBoundsDatetime):
                     raise ValueError("cannot convert the input to "
                                      "'%Y%m%d' date format")
@@ -713,12 +714,9 @@ def _attempt_YYYYMMDD(arg, errors):
     """
 
     def calc(carg):
-        # calculate the actual result
-        carg = carg.astype(object)
-        parsed = parsing.try_parse_year_month_day(carg / 10000,
-                                                  carg / 100 % 100,
-                                                  carg % 100)
-        return tslib.array_to_datetime(parsed, errors=errors)[0]
+        return conversion.try_parse_year_month_day(carg // 10000,
+                                                   (carg // 100) % 100,
+                                                   carg % 100)
 
     def calc_with_mask(carg, mask):
         result = np.empty(carg.shape, dtype='M8[ns]')

@jreback
Copy link
Contributor

jreback commented Nov 4, 2018

of course for this example, cache=True helps quite a bit :>

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 5, 2022

This path is buggy anyway, I'd suggest just removing it #50054

@MarcoGorelli
Copy link
Member

Sending it down the same path that it would go down if format isn't specified isn't always possible, as it doesn't handle 6-digit %Y%m%d dates:

In [5]: to_datetime('199934', format='%Y%m%d')
Out[5]: Timestamp('1999-03-04 00:00:00')

In [6]: to_datetime('199934')
---------------------------------------------------------------------------
ParserError: month must be in 1..12: 199934 present at position 0

I still think we should remove this path though, even if it means going down the strptime path - if the %Y%m%d fastpath is buggy, then it doesn't matter how fast it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
7 participants