Skip to content

PERF: pd.to_datetime, unit='s' much slower for float64 than for int64 #35027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
Sep 19, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Jun 27, 2020

As per discussion in #20445 this PR addresses performance of to_datetime on a uniform type array of floats. The aim is to get a speed-up by implementing astype-ing, and avoid looping, for floats.

@arw2019
Copy link
Member Author

arw2019 commented Jun 29, 2020

As a quick measure of the efficacy of this PR, I ran the checks from #20445 on current branch and master.

The set-up is:

timestamp_seconds_int = pd.Series(np.random.randint(1521685107 - 604800, 1521685107, 1000000, dtype='int64'))
timestamp_seconds_float = timestamp_seconds_int.astype('float64')

For ints

%%timeit -r 3 
pd.to_datetime(timestamp_seconds_int, unit='s')

I get

current branch: 34.8 ms ± 934 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
master: 35.1 ms ± 694 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)

and for floats

%%timeit -r 3 
pd.to_datetime(timestamp_seconds_float, unit='s')

I get

current branch: 30.9 ms ± 228 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
master: 15 s ± 448 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

@WillAyd WillAyd added Performance Memory or execution speed performance Datetime Datetime data dtype labels Jun 29, 2020
@arw2019
Copy link
Member Author

arw2019 commented Jul 9, 2020

@WillAyd @jbrockmendel I have something that works on some Linux environments but not on others and not on Mac/Windows. Not sure what the problem(s) are and the best way to go about debugging

@simonjayhawkins
Copy link
Member

ping @WillAyd @jbrockmendel

# fill by comparing to NPY_NAT constant
mask = fresult == NPY_NAT
fresult[mask] = 0
fvalues = fresult * (<float64_t>m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast doesn't seem right and is probably the cause of your portability issue. What is the warning that this generates without?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traceback if I remove this line:

  File "/workspaces/pandas-arw2019/pandas/core/tools/datetimes.py", line 802, in to_datetime
    values = convert_listlike(arg._values, format)
  File "/workspaces/pandas-arw2019/pandas/core/tools/datetimes.py", line 338, in _convert_listlike_datetimes
    result, tz_parsed = tslib.array_with_unit_to_datetime(
  File "pandas/_libs/tslib.pyx", line 262, in pandas._libs.tslib.array_with_unit_to_datetime
    if ((fvalues < Timestamp.min.value).any()
SystemError: /home/conda/feedstock_root/build_artifacts/python_1591030388223/work/Objects/object.c:769: bad argument to internal function

@arw2019
Copy link
Member Author

arw2019 commented Sep 15, 2020

Benchmark results are below. Would appreciate any comments on how to triage the areas which take a hit

asv continuous -f 1.1 upstream/master HEAD -b tslibs

       before           after         ratio
     [fd20f7d3]       [b308ba71]
     <pd-todatetime-unit_s-float-vs-int~1^2>       <pd-todatetime-unit_s-float-vs-int>
+     2.36±0.04μs       2.97±0.2μs     1.26  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'end', 'year', 'QS', 3)
+      8.51±0.2μs       10.6±0.7μs     1.25  tslibs.offsets.OffestDatetimeArithmetic.time_apply_np_dt64(<SemiMonthBegin: day_of_month=15>)
+     2.39±0.04μs       2.98±0.2μs     1.25  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'end', 'year', 'QS', 12)
+     2.33±0.02μs       2.89±0.3μs     1.24  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', None, 12)
+      11.2±0.6μs       13.8±0.3μs     1.23  tslibs.offsets.OffestDatetimeArithmetic.time_subtract_10(<BusinessQuarterEnd: startingMonth=3>)
+      5.82±0.2μs       7.13±0.4μs     1.23  tslibs.normalize.Normalize.time_normalize_i8_timestamps(0, tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
+     2.48±0.03μs       2.95±0.2μs     1.19  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'end', 'quarter', 'QS', 3)
+      5.77±0.1μs       6.74±0.2μs     1.17  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1, tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
+     2.47±0.01μs      2.87±0.06μs     1.16  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', 'QS', 3)
+     13.0±0.07μs       15.0±0.8μs     1.16  tslibs.offsets.OffestDatetimeArithmetic.time_apply_np_dt64(<DateOffset: days=2, months=2>)
+     2.47±0.01μs      2.85±0.04μs     1.15  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', 'QS', 12)
+      8.14±0.2μs       9.29±0.2μs     1.14  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(100, 'start', 'quarter', 'QS', 12)
+         566±1μs          641±8μs     1.13  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'end', 'quarter', 'QS', 5)
+         295±3ns          333±6ns     1.13  tslibs.period.PeriodProperties.time_property('min', 'quarter')
+       538±0.9μs         607±10μs     1.13  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'end', 'year', 'B', 5)
+     1.72±0.04μs       1.94±0.3μs     1.13  tslibs.fields.TimeGetDateField.time_get_date_field(0, 'q')
+         350±8ns         395±20ns     1.13  tslibs.period.PeriodProperties.time_property('min', 'day')
+     2.35±0.03μs       2.64±0.2μs     1.13  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'quarter', None, 3)
+     7.20±0.06μs       8.10±0.5μs     1.12  tslibs.fields.TimeGetDateField.time_get_date_field(100, 'h')
+         558±2μs          628±7μs     1.12  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'end', 'quarter', 'QS', 3)
+     2.23±0.01μs      2.51±0.06μs     1.12  tslibs.tz_convert.TimeTZConvert.time_tz_convert_from_utc(100, datetime.timezone.utc)
+     2.34±0.01μs      2.63±0.07μs     1.12  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'end', 'quarter', None, 5)
+     2.48±0.01μs       2.78±0.2μs     1.12  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'end', 'month', 'B', 12)
+         617±4μs         691±40μs     1.12  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'start', 'month', 'B', 12)
+         549±1μs         613±30μs     1.12  tslibs.fields.TimeGetDateField.time_get_date_field(10000, 'm')
+     2.39±0.02μs      2.67±0.04μs     1.11  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'end', 'quarter', None, 3)
+     8.28±0.07μs       9.22±0.8μs     1.11  tslibs.offsets.OffestDatetimeArithmetic.time_apply_np_dt64(<MonthEnd>)
+         303±1ns          336±6ns     1.11  tslibs.period.PeriodProperties.time_property('min', 'qyear')
+     2.45±0.04μs       2.71±0.1μs     1.11  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', 'QS', 5)
+         386±4ns         427±10ns     1.11  tslibs.period.PeriodProperties.time_property('min', 'is_leap_year')
+         541±3μs          598±9μs     1.11  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'end', 'year', None, 12)
+         538±2μs         594±10μs     1.11  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'start', 'year', 'B', 5)
+     2.48±0.02μs       2.74±0.3μs     1.10  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'month', 'B', 3)
+         557±1μs         614±10μs     1.10  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'start', 'year', None, 12)
+         550±8μs         606±10μs     1.10  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(10000, 'end', 'year', None, 5)
+     1.72±0.05μs       1.90±0.1μs     1.10  tslibs.fields.TimeGetDateField.time_get_date_field(0, 'woy')
+     2.37±0.01μs       2.61±0.3μs     1.10  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', None, 5)
-         573±9ms         520±10ms     0.91  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 1000000, datetime.timezone.utc)
-      7.30±0.3μs      6.61±0.04μs     0.91  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 5000, datetime.timezone(datetime.timedelta(seconds=3600)))
-      6.37±0.3μs       5.78±0.2μs     0.91  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 2011, tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
-        707±20μs          640±8μs     0.91  tslibs.resolution.TimeResolution.time_get_resolution('m', 10000, tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
-      11.7±0.2μs      10.5±0.08μs     0.90  tslibs.period.TimePeriodArrToDT64Arr.time_periodarray_to_dt64arr(100, 4006)
-     2.29±0.05μs      2.06±0.02μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4000, None)
-     2.31±0.07μs      2.08±0.02μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 9000, tzlocal())
-      8.12±0.8μs      7.31±0.05μs     0.90  tslibs.timedelta.TimedeltaConstructor.time_from_iso_format
-     2.27±0.09μs      2.04±0.03μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 2011, tzlocal())
-      8.03±0.2μs      7.21±0.04μs     0.90  tslibs.resolution.TimeResolution.time_get_resolution('us', 100, None)
-     2.15±0.07μs      1.93±0.01μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 5000, datetime.timezone.utc)
-     2.27±0.03μs      2.04±0.02μs     0.90  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 7000, tzlocal())
-     2.30±0.08μs      2.05±0.01μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4006, None)
-      7.45±0.3μs      6.63±0.04μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, datetime.timezone(datetime.timedelta(seconds=3600)))
-      2.33±0.2μs      2.06±0.05μs     0.89  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 5000, tzlocal())
-         125±4ms          111±2ms     0.88  tslibs.resolution.TimeResolution.time_get_resolution('us', 10000, tzlocal())
-      6.12±0.1μs       5.39±0.2μs     0.88  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 4006, <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>)
-      2.34±0.1μs      2.05±0.01μs     0.88  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 12000, tzlocal())
-     2.35±0.04μs      2.06±0.01μs     0.87  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, tzlocal())
-      2.38±0.2μs      2.06±0.02μs     0.86  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, None)
-      2.45±0.3μs      2.09±0.06μs     0.85  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 3000, None)
-     2.33±0.07μs      1.99±0.04μs     0.85  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 6000, datetime.timezone.utc)
-     2.25±0.07μs      1.91±0.05μs     0.85  tslibs.period.TimePeriodArrToDT64Arr.time_periodarray_to_dt64arr(0, 1000)
-      6.77±0.3μs       5.74±0.1μs     0.85  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
-      2.46±0.2μs      2.08±0.02μs     0.85  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, None)
-      2.46±0.2μs      2.06±0.02μs     0.84  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1000, None)
-        7.08±1μs      5.91±0.09μs     0.83  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 5000, tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
-      2.50±0.8μs      2.04±0.02μs     0.82  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4000, tzlocal())
-      6.64±0.1μs       5.41±0.3μs     0.81  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 5000, <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>)
-     1.62±0.08μs      1.29±0.02μs     0.80  tslibs.resolution.TimeResolution.time_get_resolution('h', 1, None)
-      2.40±0.4μs      1.91±0.02μs     0.79  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 7000, datetime.timezone.utc)
-      2.59±0.1μs      2.03±0.03μs     0.79  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 6000, None)
-        7.80±2μs      5.83±0.02μs     0.75  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
-        8.13±2μs       5.46±0.2μs     0.67  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>)
 asv continuous -f 1.1 upstream/master HEAD -b timeseries

       before           after         ratio
     [fd20f7d3]       [a3f42df7]
     <pd-todatetime-unit_s-float-vs-int~3^2>       <pd-todatetime-unit_s-float-vs-int>
+        559±10μs        799±100μs     1.43  timeseries.DatetimeIndex.time_unique('repeated')
+      2.64±0.2ms       3.68±0.1ms     1.39  timeseries.Lookup.time_lookup_and_cleanup
+     1.94±0.08ms       2.49±0.3ms     1.28  timeseries.ResampleDataFrame.time_method('max')
+         304±8μs         380±60μs     1.25  timeseries.ToDatetimeCacheSmallCount.time_unique_date_strings(True, 500)
+      2.15±0.1ms       2.54±0.1ms     1.18  timeseries.DatetimeIndex.time_unique('tz_local')
+     2.88±0.07ms       3.30±0.2ms     1.15  timeseries.ToDatetimeNONISO8601.time_same_offset
+      3.45±0.1ms       3.89±0.2ms     1.13  timeseries.ToDatetimeCache.time_unique_seconds_and_unit(False)
+      3.05±0.1ms      3.40±0.09ms     1.12  timeseries.ToDatetimeISO8601.time_iso8601_nosep
+      1.23±0.04s       1.37±0.07s     1.11  timeseries.DatetimeIndex.time_to_time('tz_local')
+      5.36±0.1ms       5.94±0.2ms     1.11  timeseries.AsOf.time_asof('DataFrame')
+      15.5±0.1ms       17.1±0.4ms     1.10  timeseries.ToDatetimeFormat.time_no_exact
+      3.64±0.1ms       4.01±0.2ms     1.10  timeseries.ToDatetimeCache.time_dup_seconds_and_unit(False)
+      4.76±0.2ms       5.24±0.1ms     1.10  timeseries.ToDatetimeYYYYMMDD.time_format_YYYYMMDD

)
self.timestamp_seconds_float = self.timestamp_seconds_int.astype("float64")

def to_datetime_int(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these need to be def time_foo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrote this

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a couple of comments. can you post an updated benchmark

result = values.astype('M8[ns]')
if unit == "ns":
if issubclass(values.dtype.type, (np.integer, np.float_)):
result = values.astype("M8[ns]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does astype of floats directly to M8 work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it does (as you use it below), but do we have a test specifically for float with unit='ns'?

also can try .astype(..., copy=False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does astype of floats directly to M8 work?

would it be better to do, here and below:

ivalues = values.view("i8")
result = ivalues.astype("M8[ns]")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it does (as you use it below), but do we have a test specifically for float with unit='ns'?

I'll look some more but I think we don't. Will add unless I find one

@arw2019
Copy link
Member Author

arw2019 commented Sep 19, 2020

@jreback with your suggestion (bb8c35b) benchmarks are now (largely) unaffected:

asv continuous -f 1.1 upstream/master HEAD -b tslibs
       before           after         ratio
     [80f0a74c]       [bb8c35bf]
     <GH35612^2>       <pd-todatetime-unit_s-float-vs-int>
+     1.72±0.02ms      1.93±0.02ms     1.13  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1000000, datetime.timezone(datetime.timedelta(seconds=3600)))
+      23.1±0.1μs       25.7±0.2μs     1.12  tslibs.normalize.Normalize.time_normalize_i8_timestamps(10000, datetime.timezone(datetime.timedelta(seconds=3600)))
-      2.89±0.3μs      2.63±0.06μs     0.91  tslibs.fields.TimeGetTimedeltaField.time_get_timedelta_field(100, 'ns')
-      25.4±0.1μs      23.0±0.07μs     0.91  tslibs.normalize.Normalize.time_normalize_i8_timestamps(10000, None)
-      2.89±0.2μs      2.62±0.09μs     0.91  tslibs.fields.TimeGetTimedeltaField.time_get_timedelta_field(100, 'us')
-      25.3±0.2μs      22.9±0.09μs     0.90  tslibs.normalize.Normalize.time_normalize_i8_timestamps(10000, datetime.timezone.utc)
-      1.87±0.2μs      1.69±0.03μs     0.90  tslibs.fields.TimeGetTimedeltaField.time_get_timedelta_field(0, 's')
-      2.14±0.2μs      1.91±0.07μs     0.89  tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('date', 0, datetime.timezone(datetime.timedelta(seconds=3600)))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jreback jreback added this to the 1.2 milestone Sep 19, 2020
@jreback jreback merged commit c1484b1 into pandas-dev:master Sep 19, 2020
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

thanks @arw2019 very nice!

@arw2019
Copy link
Member Author

arw2019 commented Sep 19, 2020

thanks @jreback @WillAyd @jbrockmendel for reviewing!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.to_datetime, unit='s' much slower for float64 than for int64
5 participants