Ensure TDA.init validates freq #24666

jbrockmendel · 2019-01-07T23:36:52Z

Users should not be able to construct invalid instances with the public constructor.

De-duplicates some code.

codecov · 2019-01-08T00:10:07Z

Codecov Report

Merging #24666 into master will increase coverage by <.01%.
The diff coverage is 96%.

@@            Coverage Diff             @@
##           master   #24666      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52327    52310      -17     
==========================================
- Hits        48342    48327      -15     
+ Misses       3985     3983       -2

Flag	Coverage Δ
#multiple	`90.8% <96%> (-0.01%)`	⬇️
#single	`43.06% <88%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/timedeltas.py	`90.25% <100%> (+0.02%)`	⬆️
pandas/core/arrays/timedeltas.py	`87.86% <95.45%> (-0.24%)`	⬇️
pandas/util/testing.py	`88.09% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea8c9bf...02acf9d. Read the comment docs.

TomAugspurger

What's the perf impact with and without a user-provided freq vs. master?

pandas/core/arrays/timedeltas.py

jreback

lgtm. I think @TomAugspurger has a comment.

TomAugspurger · 2019-01-08T15:08:29Z

Setup

import numpy as np
import pandas as pd

data = np.arange(10_000, dtype='i8') * 24 * 3600 * 10**9
%timeit pd.arrays.TimedeltaArray(data)
%timeit pd.arrays.TimedeltaArray(data, freq='D')
%timeit pd.timedelta_range("1H", periods=10_000)

master

>>> %timeit pd.arrays.TimedeltaArray(data)
2.65 µs ± 61.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit pd.arrays.TimedeltaArray(data, freq='D')
70.3 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit pd.timedelta_range("1H", periods=10_000)
209 µs ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PR (with master merged in)

>>> %timeit pd.arrays.TimedeltaArray(data)
19.6 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit pd.arrays.TimedeltaArray(data, freq='D')
588 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit pd.timedelta_range("1H", periods=10_000)
135 µs ± 5.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The one with freq="D" passed is (70us -> 588us) is expected, and can maybe be improved.

The firs one, no freq, is surprising. (2.65us -> 20us). I'll see where the time is being spent now.

TomAugspurger · 2019-01-08T15:13:49Z

~80% of TDA.__init__ is in _from_sequence. For that ~75% is in sequence_to_td64ns, and ~10% in _simple_new.

Unfortunately, nothing in sequence_to_td64ns really stands out... The largest offender is

   845                                               # Convert whatever we have into timedelta64[ns] dtype
   846         1         21.0     21.0     31.8      if is_object_dtype(data) or is_string_dtype(data):
   847                                                   # no need to make a copy, need to convert if string-dtyped
   848                                                   data = objects_to_td64ns(data, unit=unit, errors=errors)
   849                                                   copy = False

It's just the cumulative effect of all the is_<foo>_dtype checks that's adding up.

TomAugspurger · 2019-01-08T15:19:56Z

Passing dtype to the is_foo_dtype checks helps a tad. Brings it down to 13us (so only a 5x slowdown, instead of 8x).

diff --git a/pandas/core/arrays/timedeltas.py b/pandas/core/arrays/timedeltas.py
index d876862c9..e11d841c6 100644
--- a/pandas/core/arrays/timedeltas.py
+++ b/pandas/core/arrays/timedeltas.py
@@ -843,17 +843,18 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
         data = data._data
 
     # Convert whatever we have into timedelta64[ns] dtype
-    if is_object_dtype(data) or is_string_dtype(data):
+    dtype = getattr(data, 'dtype', data)
+    if is_object_dtype(dtype) or is_string_dtype(dtype):
         # no need to make a copy, need to convert if string-dtyped
         data = objects_to_td64ns(data, unit=unit, errors=errors)
         copy = False
 
-    elif is_integer_dtype(data):
+    elif is_integer_dtype(dtype):
         # treat as multiples of the given unit
         data, copy_made = ints_to_td64ns(data, unit=unit)
         copy = copy and not copy_made
 
-    elif is_float_dtype(data):
+    elif is_float_dtype(dtype):
         # treat as multiples of the given unit.  If after converting to nanos,
         #  there are fractional components left, these are truncated
         #  (i.e. NOT rounded)
@@ -863,14 +864,14 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
         data[mask] = iNaT
         copy = False
 
-    elif is_timedelta64_dtype(data):
+    elif is_timedelta64_dtype(dtype):
         if data.dtype != _TD_DTYPE:
             # non-nano unit
             # TODO: watch out for overflows
             data = data.astype(_TD_DTYPE)
             copy = False
 
-    elif is_datetime64_dtype(data):
+    elif is_datetime64_dtype(dtype):
         # GH#23539
         warnings.warn("Passing datetime64-dtype data to TimedeltaIndex is "
                       "deprecated, will raise a TypeError in a future "

I see two options

continue to speed up sequence_to_td64ns (possible a fast path argument, when the input is known to be an ndarary[i8] or m8[ns].
split out the freq inference of sequence_to_td64ns into a separate method.

Either of these can be done during the RC I think.

TomAugspurger · 2019-01-08T15:20:14Z

So my request in
#24666 (review) is my only blocker right now.

jbrockmendel · 2019-01-08T16:30:32Z

It's just the cumulative effect of all the is__dtype checks that's adding up.

These are checks we do a ton across the code-base, so I'm definitely on board for optimizing them.

Two options come to mind:

use data.dtype.kind == 'O', which %timeit is telling me takes ~75ns vs ~1.25us for a is__dtype check
do a lib.infer_dtype call up-front

The data.dtype.kind checks seem more straightforward to me.

continue to speed up sequence_to_td64ns (possible a fast path argument, when the input is known to be an ndarary[i8] or m8[ns].

certainly moving the i8 and m8[ns] cases to the top of the checks can short-circuit checking for other dtypes.

split out the freq inference of sequence_to_td64ns into a separate method.

The freq inference in sequence_to_td64ns is pretty trivial. Do you mean the freq validation in _from_sequence?

jbrockmendel · 2019-01-08T16:34:06Z

just pushed with faster dtype checks

TomAugspurger

Hows the perf for the non-user-provided-freq case compare to master now?

pandas/core/arrays/timedeltas.py

jbrockmendel · 2019-01-08T21:37:15Z

Hows the perf for the non-user-provided-freq case compare to master now?

__init__ is ~2.8x slower than master, _simple_new is about 2.5x faster

setup

In [3]: arr = np.arange(1000)
In [4]: arr2 = arr.view('timedelta64[ns]')

PR

In [5]: %timeit pd.core.arrays.TimedeltaArray(arr)
The slowest run took 30.48 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.04 µs per loop

In [6]: %timeit pd.core.arrays.TimedeltaArray(arr2)
The slowest run took 14.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.47 µs per loop

In [7]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr)
The slowest run took 63.68 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 880 ns per loop

In [8]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr2)
The slowest run took 24.56 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 893 ns per loop

master

In [5]: %timeit pd.core.arrays.TimedeltaArray(arr)
The slowest run took 40.33 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.21 µs per loop

In [6]: %timeit pd.core.arrays.TimedeltaArray(arr2)
The slowest run took 11.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.94 µs per loop

In [7]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr)
The slowest run took 10.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.5 µs per loop

In [8]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr2)
The slowest run took 17.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.18 µs per loop

jreback · 2019-01-08T22:26:25Z

pandas/core/arrays/timedeltas.py

@@ -880,7 +863,7 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
        data[mask] = iNaT
        copy = False

-    elif is_timedelta64_dtype(data):
+    elif data.dtype.kind == 'm':


do these actually make a difference? I am -1 on changing these, this is the reason we have the is_* routines

The is_foo_dtype call takes about 16x longer than this check. Tom says these checks are the main source of his perf concern.

As an alternative, sequence_to_td64ns could take a fastpath argument when we know that data is an ndarray of the correct dtype. Then you could have (roughly)

if fastpath: is_float = is_integer = is_object = is_string = is_timedelta = is_datetime = False else: is_object = is_object_dtype(data) is_... if is_object: ... elif is_... if not fastpath: data = np.array(data, copy=copy) elif copy: data = data.copy()

or use the even-faster _simple_new? fastpath is a pattern we're still trying to deprecate elsewhere

I am -1 on actually changing to the in [....] checks. The entire point is consistency in use. I am not convinced these are actual perf issues in the real world. micro seconds on a single construction pales in comparision to inconsistent code.

If you can use _simple_new then great.

_simple_new cannot be used, as that does ~zero validation on the input.

fastpath is a pattern we're still trying to deprecate elsewhere

We're deprecating that from public methods. sequnce_to_td64ns isn't public is it?

TomAugspurger · 2019-01-09T12:46:38Z

Just to be clear, my only blocker for the RC was the validation of the dtype in simple_new, which has been fixed.

My blockers for the final release are

DTA/TDA.__init__not accepts strings like TimedeltaArray(np.array(['1H', '2H'])).
The __init__ methods be simple and don't do unnecessary work. We know that they should only accept ndarray[m8[ns] (or i8) or a TDA / Series / Index boxing those, so checking for additional dtypes is unnecessary.

So I'm perfectly fine with filing issues and going away.

jreback · 2019-01-09T13:49:51Z

I agree this is not a blocker

jbrockmendel · 2019-01-09T14:45:32Z

so... revert the dtype checks to use is_foo_dtype and we're good?

TomAugspurger · 2019-01-09T14:46:32Z

Fine for the RC.

…

On Wed, Jan 9, 2019 at 8:45 AM jbrockmendel ***@***.***> wrote: so... revert the dtype checks to use is_foo_dtype and we're good? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24666 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIi097Mbr-IidMWtASBP_sngm_r3Jks5vBgCRgaJpZM4Z0gKU> .

jreback · 2019-01-09T14:54:49Z

yep

jbrockmendel · 2019-01-09T16:08:44Z

Done

jreback · 2019-01-09T16:21:59Z

thanks

jorisvandenbossche · 2019-01-09T16:24:31Z

Shouldn't starting to use sequence_to_td64 not wait until the discussion in #24684 ?

jorisvandenbossche · 2019-01-09T16:25:14Z

OK too late :)
(anyway, this only impact performance, not behaviour I think, so that is not a blocker for the RC)

TomAugspurger · 2019-01-09T16:51:19Z

This does change behavior. I believe that as of this PR, TImedeltaArray.__init__ accepts strings

In [3]: pd.arrays.TimedeltaArray(np.array(['1H']))
Out[3]:
<TimedeltaArray>
['01:00:00']
Length: 1, dtype: timedelta64[ns]

previously that would have raised a TypeError.

My vote is for raising an error, but not necessarily blocking the RC for that change.

jreback · 2019-01-09T17:24:08Z

hmm we should have the guarantees as in DTA right? that we only accept arrays of typed objects?

TomAugspurger · 2019-01-09T17:39:05Z

Yes they should be consistent. TDA and DTA both recently started using sequcne_to_*ns in their `__init__`s, which meant they started parsing strings.

…

On Wed, Jan 9, 2019 at 11:34 AM Jeff Reback ***@***.***> wrote: hmm we should have the guarantees as in DTA right? that we only accept arrays of typed objects? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24666 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIunCa4gkSJdhfr9nLBTNZ_K2IQtJks5vBigGgaJpZM4Z0gKU> .

jbrockmendel · 2019-01-09T17:55:04Z

that we only accept arrays of typed objects?

right, inasmuch as it has been decided that crippling the public constructors is A Good Idea.

TomAugspurger · 2019-01-09T18:12:24Z

If you disagree, then #24684 is probably the place to discuss, though it's probably all been said at this point. Not to shut that discussion down premature though... Apologies if it came off that way.

…

On Wed, Jan 9, 2019 at 12:10 PM jbrockmendel ***@***.***> wrote: that we only accept arrays of typed objects? right, inasmuch as it has been decided that crippling the public constructors is A Good Idea. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24666 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIvUAN3TxZV5OZDEhXHwpdYMQpk1Tks5vBjCjgaJpZM4Z0gKU> .

jbrockmendel added 2 commits January 7, 2019 15:34

24024 followup, ensure freq validation

e2a63c5

revert unnecessary change

09c2a37

jbrockmendel added 2 commits January 7, 2019 19:03

32bit compat

9097bf0

Merge branch 'master' of https://github.com/pandas-dev/pandas into td64

86c5658

TomAugspurger reviewed Jan 8, 2019

View reviewed changes

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

jreback added Timedelta Timedelta data type Clean labels Jan 8, 2019

jreback added this to the 0.24.0 milestone Jan 8, 2019

jreback approved these changes Jan 8, 2019

View reviewed changes

TomAugspurger mentioned this pull request Jan 8, 2019

EA: revert treatment of i8values #24623

Closed

1 task

Merge branch 'master' of https://github.com/pandas-dev/pandas into td64

3d9cd7d

faster dtype checks

eba0c51

jbrockmendel added 2 commits January 8, 2019 09:17

remove unused import

ab0c928

py3 compat

37352bb

TomAugspurger reviewed Jan 8, 2019

View reviewed changes

pandas/core/arrays/timedeltas.py Outdated Show resolved Hide resolved

assert equality instead of identity

f273a2c

jreback requested changes Jan 8, 2019

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into td64

3e9eaa1

revert dtype checks

02acf9d

TomAugspurger mentioned this pull request Jan 9, 2019

DTA/TDI init methods #24684

Closed

jreback approved these changes Jan 9, 2019

View reviewed changes

jreback merged commit decc8ce into pandas-dev:master Jan 9, 2019

jbrockmendel deleted the td64 branch January 9, 2019 17:08

jorisvandenbossche mentioned this pull request Jan 11, 2019

TimedeltaArray freq validation without _from_sequence #24723

Merged

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Ensure TDA.__init__ validates freq (pandas-dev#24666)

a22e3af

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Ensure TDA.__init__ validates freq (pandas-dev#24666)

61ae95e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure TDA.init validates freq #24666

Ensure TDA.init validates freq #24666

jbrockmendel commented Jan 7, 2019

codecov bot commented Jan 8, 2019 •

edited

Loading

TomAugspurger left a comment

jreback left a comment

TomAugspurger commented Jan 8, 2019

TomAugspurger commented Jan 8, 2019

TomAugspurger commented Jan 8, 2019 •

edited

Loading

TomAugspurger commented Jan 8, 2019

jbrockmendel commented Jan 8, 2019

jbrockmendel commented Jan 8, 2019

TomAugspurger left a comment

jbrockmendel commented Jan 8, 2019

jreback Jan 8, 2019

jbrockmendel Jan 8, 2019

TomAugspurger Jan 8, 2019

jbrockmendel Jan 8, 2019

jreback Jan 9, 2019

jreback Jan 9, 2019

TomAugspurger Jan 9, 2019

TomAugspurger commented Jan 9, 2019

jreback commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019 via email

jreback commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019

jreback commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019

jreback commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019 via email

jbrockmendel commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019 via email •

edited

Loading

Ensure TDA.__init__ validates freq #24666

Ensure TDA.__init__ validates freq #24666

Conversation

jbrockmendel commented Jan 7, 2019

codecov bot commented Jan 8, 2019 • edited Loading

Codecov Report

TomAugspurger left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 8, 2019

TomAugspurger commented Jan 8, 2019

TomAugspurger commented Jan 8, 2019 • edited Loading

TomAugspurger commented Jan 8, 2019

jbrockmendel commented Jan 8, 2019

jbrockmendel commented Jan 8, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

jbrockmendel commented Jan 8, 2019

jreback Jan 8, 2019

Choose a reason for hiding this comment

jbrockmendel Jan 8, 2019

Choose a reason for hiding this comment

TomAugspurger Jan 8, 2019

Choose a reason for hiding this comment

jbrockmendel Jan 8, 2019

Choose a reason for hiding this comment

jreback Jan 9, 2019

Choose a reason for hiding this comment

jreback Jan 9, 2019

Choose a reason for hiding this comment

TomAugspurger Jan 9, 2019

Choose a reason for hiding this comment

TomAugspurger commented Jan 9, 2019

jreback commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019 via email

jreback commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019

jreback commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019

jreback commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019 via email

jbrockmendel commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019 via email • edited Loading

Ensure TDA.init validates freq #24666

Ensure TDA.init validates freq #24666

codecov bot commented Jan 8, 2019 •

edited

Loading

TomAugspurger commented Jan 8, 2019 •

edited

Loading

TomAugspurger commented Jan 9, 2019 via email •

edited

Loading