Skip to content

Handle "today" and "now" in cython instead of C #18666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 9, 2017

Conversation

jbrockmendel
Copy link
Member

ATM the Timestamp constructor (specifically tslibs.conversion.convert_str_to_tsobject) catches "now" and "today" before passing strings to _string_to_dts, which eventually dispatches to np_datetime_strings. By contrast, to_datetime (specifically tslib.array_to_datetime) does not handle "now" or "today" internally.

This PR makes array_to_datetime handle "now" and "today" internally, removing a big chunk of np_datetime_strings in the process.

NB: This does not change the fact that to_datetime("today") != Timestamp("today"), just makes this fact more obvious. My hope is that making this obvious will make it easier to change/fix in the future.

asv results are as usual all over the place, but if anything look slightly faster.

return -1;
}


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an 49 line version of time.time()

*out_local = 0;
}

return convert_datetime_to_datetimestruct(PANDAS_FR_s, rawtime, out);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side-benefit: this is the only usage of convert_datetime_to_datetimestruct outside of np_datetime.c, so after this we can get rid of the extra layer there.

@jreback jreback added Clean Datetime Datetime data dtype labels Dec 7, 2017
@jreback
Copy link
Contributor

jreback commented Dec 7, 2017

looks fine. can you run some asv's to see if anything changes significantly. (we should also have ones for Timestamp('now') and pd.to_datetime(['now']), if we don't already)

@codecov
Copy link

codecov bot commented Dec 7, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@9ec7212). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #18666   +/-   ##
=========================================
  Coverage          ?   91.58%           
=========================================
  Files             ?      153           
  Lines             ?    51272           
  Branches          ?        0           
=========================================
  Hits              ?    46960           
  Misses            ?     4312           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.45% <ø> (?)
#single 40.68% <ø> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ec7212...6273f51. Read the comment docs.

@jbrockmendel
Copy link
Member Author

Yesterday morning the asvs were mixed leaning towards slightly faster post-PR. Now they're distinctly slower post-PR. Need to look into this.

@jbrockmendel
Copy link
Member Author

asv results are back to All Over The Place. ipython %timeit results make sense in py2 and are confusing in py3.

In py2 %timeit shows a small slowdown in to_datetime and a slightly-larger-but-still-small speedup in Timestamp for str inputs. The Timestamp outcome in particular makes sense because this PR only affects it by removing if/elif checks that are never True in the status quo.

In py3 %timeit shows significant slowdowns across the board, including in Timestmap. No idea what the deal is there.

@jbrockmendel
Copy link
Member Author

Just pushed with an update that delays checking for "now" and "today" until as late as possible -- the idea being that these are fairly rare cases.

Ran %timeit overnight to aggregate 10k runs for each of the relevant cases:

pd.Timestamp('2016/12/07 09:26 PM')       # _string_to_dts raises, falls back to dateutil
pd.to_datetime('2016/12/07 09:26 PM')     # _string_to_dts raises, falls back to dateutil
pd.Timestamp('2016-12-07 09:26')          # handled by _string_to_dts
pd.to_datetime('2016-12-07 09:26')        # handled by _string_to_dts
pd.Timestamp('now')
pd.to_datetime('now')
pd.Timestamp('today')
pd.to_datetime('today')

Results:

                                     py27                                          py35                                    
                                   master        PR          diff           t    master        PR          diff           t
Timestamp   2016/12/07 09:26 PM  0.000201  0.000195 -5.132584e-06  -12.424955  0.000164  0.000159 -5.005332e-06  -17.471177
            2016-12-07 09:26     0.000001  0.000001 -3.821889e-08  -10.961579  0.000001  0.000001 -8.732139e-08  -29.778350
            now                  0.000003  0.000003 -2.443057e-07  -27.362516  0.000003  0.000003 -3.389956e-08   -5.326515
            today                0.000003  0.000003 -1.795984e-07  -20.129438  0.000003  0.000003 -2.423781e-08   -3.798430
to_datetime 2016/12/07 09:26 PM  0.000311  0.000299 -1.228289e-05  -23.529981  0.000258  0.000255 -3.012189e-06   -7.910489
            2016-12-07 09:26     0.000075  0.000071 -4.063470e-06  -24.860716  0.000064  0.000062 -2.034531e-06  -17.767926
            now                  0.000075  0.000190  1.149521e-04  414.750570  0.000063  0.000171  1.078343e-04  501.837848
            today                0.000075  0.000230  1.546294e-04  503.477338  0.000064  0.000218  1.540099e-04  624.599432

i.e. in all cases except to_datetime('now') and to_datetime('today') the PR is more performant by a small but clear margin. The remaining two cases could be made more performant; in their current form they are optimized to clarify the contrast with Timestamp(...) and not for perf.

Finally asv results:

taskset 3 asv continuous -f 1.1 -E virtualenv master HEAD -b timeseries
[...]
    before     after       ratio
  [13f62672] [6273f51e]
+  312.89ms   419.98ms      1.34  timeseries.AsOfDataFrame.time_asof_nan
+   31.78ms    36.66ms      1.15  timeseries.DatetimeIndex.time_dti_factorize
-  344.11μs   309.34μs      0.90  timeseries.DatetimeIndex.time_timeseries_is_month_start
-   15.21ms    13.65ms      0.90  timeseries.TimeSeries.time_add_irregular
-   25.76ms    21.88ms      0.85  timeseries.DatetimeIndex.time_to_pydatetime
-  450.18μs   381.62μs      0.85  timeseries.DatetimeIndex.time_reset_index
-    8.66ms     7.19ms      0.83  timeseries.DatetimeIndex.time_infer_freq_daily
-   14.90ms    12.16ms      0.82  timeseries.AsOfDataFrame.time_asof_single
-   10.85ms     8.85ms      0.82  timeseries.TimeSeries.time_sort_index_non_monotonic
-    4.78ms     3.88ms      0.81  timeseries.AsOf.time_asof
-    1.95ms     1.57ms      0.81  timeseries.ToDatetime.time_cache_false_with_dup_string_dates_and_format
-    1.28ms     1.03ms      0.80  timeseries.ResampleSeries.time_1min_5min_mean
-   25.07μs    19.71μs      0.79  timeseries.DatetimeAccessor.time_dt_accessor
-    2.83ms     2.22ms      0.78  timeseries.ResampleDataFrame.time_mean_string
-    2.82ms     2.20ms      0.78  timeseries.ResampleDataFrame.time_mean_numpy
-    4.47ms     3.46ms      0.77  timeseries.AsOf.time_asof_nan
-    4.92ms     3.78ms      0.77  timeseries.ToDatetime.time_cache_true_with_dup_seconds_and_unit
-   16.18ms    12.28ms      0.76  timeseries.AsOfDataFrame.time_asof_nan_single
-    3.78ms     2.85ms      0.75  timeseries.ResampleDataFrame.time_max_string
-    2.26ms     1.69ms      0.75  timeseries.DatetimeIndex.time_add_timedelta
-    2.88ms     2.13ms      0.74  timeseries.ToDatetime.time_cache_true_with_dup_string_dates_and_format
-    3.12ms     2.29ms      0.74  timeseries.ToDatetime.time_cache_true_with_dup_string_tzoffset_dates
-   12.42ms     8.96ms      0.72  timeseries.Iteration.time_iter_datetimeindex_preexit
-  624.27ms   445.29ms      0.71  timeseries.ToDatetime.time_format_no_exact
-  113.36ms    80.72ms      0.71  timeseries.ToDatetime.time_cache_false_with_dup_string_tzoffset_dates
-    6.33ms     4.39ms      0.69  timeseries.DatetimeAccessor.time_dt_accessor_normalize
-  164.86μs   110.51μs      0.67  timeseries.DatetimeIndex.time_unique
-  222.24ms   148.55ms      0.67  timeseries.DatetimeIndex.time_dti_time
-   27.67ms    18.34ms      0.66  timeseries.DatetimeIndex.time_to_date
-   17.54ms    11.63ms      0.66  timeseries.DatetimeIndex.time_infer_freq_none
-    4.37ms     2.88ms      0.66  timeseries.ResampleDataFrame.time_min_numpy
-   21.73μs    14.18μs      0.65  offset.YearBegin.time_timeseries_year_incr
-    8.40ms     5.41ms      0.64  timeseries.ToDatetime.time_format_YYYYMMDD
-    4.57ms     2.88ms      0.63  timeseries.ResampleDataFrame.time_max_numpy
-    3.08ms     1.89ms      0.61  timeseries.ResampleSeries.time_resample_datetime64
-   45.07μs    26.64μs      0.59  timeseries.AsOf.time_asof_single
-    2.66ms     1.55ms      0.58  timeseries.ToDatetime.time_cache_false_with_dup_string_dates
-     3.03s      1.73s      0.57  timeseries.Iteration.time_iter_periodindex
-    4.74ms     2.70ms      0.57  timeseries.TimeSeries.time_large_lookup_value
-    4.57ms     2.59ms      0.57  timeseries.ToDatetime.time_cache_false_with_unique_seconds_and_unit
-  110.23μs    61.07μs      0.55  timeseries.TimeSeries.time_timeseries_slice_minutely
-  869.47ms   474.60ms      0.55  timeseries.Iteration.time_iter_datetimeindex
-   42.84μs    22.99μs      0.54  offset.Day.time_timeseries_day_incr
-   40.65μs    21.41μs      0.53  offset.Day.time_timeseries_day_apply
-    7.07ms     3.55ms      0.50  timeseries.ToDatetime.time_iso8601_nosep
-   34.21ms    16.77ms      0.49  timeseries.Iteration.time_iter_periodindex_preexit
-   42.31μs    19.26μs      0.46  timeseries.AsOf.time_asof_single_early
-   12.42ms     5.04ms      0.41  timeseries.ResampleSeries.time_timestamp_downsample_mean

As mentioned elsewhere, time_asof_nan and time_dti_factorize are among the ones that I have come to completely ignore.

@@ -706,6 +716,19 @@ cpdef array_to_datetime(ndarray[object] values, errors='raise',
return oresult


cdef inline bint _parse_today_now(str val, int64_t* iresult):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are moving these routines to conversion, yes? (obviously not in this PR)

@jreback jreback added this to the 0.22.0 milestone Dec 9, 2017
@jreback jreback merged commit 34a8d36 into pandas-dev:master Dec 9, 2017
@jreback
Copy link
Contributor

jreback commented Dec 9, 2017

thansk!

@jbrockmendel
Copy link
Member Author

Shoot, I was wrong, this does change the behavior of to_datetime('today'). (See #18705).

0.21.0

>>> pd.Timestamp.utcnow()  # for context
Timestamp('2017-12-10 01:47:39.139989+0000', tz='UTC')
>>> pd.to_datetime('today')
Timestamp('2017-12-09 00:00:00')

Post-PR:

>>> pd.Timestamp.utcnow()  # for context
Timestamp('2017-12-10 01:48:13.939360+0000', tz='UTC')
>>> pd.to_datetime('today')
Timestamp('2017-12-10 00:00:00')

I actually think this new behavior is better, but the intent of the PR was to be behavior-preserving. I can make a follow-up with a quick fix. LMK.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

ok great

and add some explicit tests which lock things down (until / unless we change them)

@jbrockmendel
Copy link
Member Author

As in revert to previous behavior?

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

revert to previous and put in place some tests for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants