Skip to content

Ensure TDA.__init__ validates freq #24666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 9, 2019
Merged

Conversation

jbrockmendel
Copy link
Member

Users should not be able to construct invalid instances with the public constructor.

De-duplicates some code.

@codecov
Copy link

codecov bot commented Jan 8, 2019

Codecov Report

Merging #24666 into master will increase coverage by <.01%.
The diff coverage is 96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24666      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52327    52310      -17     
==========================================
- Hits        48342    48327      -15     
+ Misses       3985     3983       -2
Flag Coverage Δ
#multiple 90.8% <96%> (-0.01%) ⬇️
#single 43.06% <88%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/timedeltas.py 90.25% <100%> (+0.02%) ⬆️
pandas/core/arrays/timedeltas.py 87.86% <95.45%> (-0.24%) ⬇️
pandas/util/testing.py 88.09% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea8c9bf...02acf9d. Read the comment docs.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the perf impact with and without a user-provided freq vs. master?

@jreback jreback added Timedelta Timedelta data type Clean labels Jan 8, 2019
@jreback jreback added this to the 0.24.0 milestone Jan 8, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I think @TomAugspurger has a comment.

@TomAugspurger
Copy link
Contributor

Setup

import numpy as np
import pandas as pd

data = np.arange(10_000, dtype='i8') * 24 * 3600 * 10**9
%timeit pd.arrays.TimedeltaArray(data)
%timeit pd.arrays.TimedeltaArray(data, freq='D')
%timeit pd.timedelta_range("1H", periods=10_000)

master

>>> %timeit pd.arrays.TimedeltaArray(data)
2.65 µs ± 61.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit pd.arrays.TimedeltaArray(data, freq='D')
70.3 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit pd.timedelta_range("1H", periods=10_000)
209 µs ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

PR (with master merged in)

>>> %timeit pd.arrays.TimedeltaArray(data)
19.6 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit pd.arrays.TimedeltaArray(data, freq='D')
588 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit pd.timedelta_range("1H", periods=10_000)
135 µs ± 5.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The one with freq="D" passed is (70us -> 588us) is expected, and can maybe be improved.

The firs one, no freq, is surprising. (2.65us -> 20us). I'll see where the time is being spent now.

@TomAugspurger
Copy link
Contributor

~80% of TDA.__init__ is in _from_sequence. For that ~75% is in sequence_to_td64ns, and ~10% in _simple_new.

Unfortunately, nothing in sequence_to_td64ns really stands out... The largest offender is

   845                                               # Convert whatever we have into timedelta64[ns] dtype
   846         1         21.0     21.0     31.8      if is_object_dtype(data) or is_string_dtype(data):
   847                                                   # no need to make a copy, need to convert if string-dtyped
   848                                                   data = objects_to_td64ns(data, unit=unit, errors=errors)
   849                                                   copy = False

It's just the cumulative effect of all the is_<foo>_dtype checks that's adding up.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 8, 2019

Passing dtype to the is_foo_dtype checks helps a tad. Brings it down to 13us (so only a 5x slowdown, instead of 8x).

diff --git a/pandas/core/arrays/timedeltas.py b/pandas/core/arrays/timedeltas.py
index d876862c9..e11d841c6 100644
--- a/pandas/core/arrays/timedeltas.py
+++ b/pandas/core/arrays/timedeltas.py
@@ -843,17 +843,18 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
         data = data._data
 
     # Convert whatever we have into timedelta64[ns] dtype
-    if is_object_dtype(data) or is_string_dtype(data):
+    dtype = getattr(data, 'dtype', data)
+    if is_object_dtype(dtype) or is_string_dtype(dtype):
         # no need to make a copy, need to convert if string-dtyped
         data = objects_to_td64ns(data, unit=unit, errors=errors)
         copy = False
 
-    elif is_integer_dtype(data):
+    elif is_integer_dtype(dtype):
         # treat as multiples of the given unit
         data, copy_made = ints_to_td64ns(data, unit=unit)
         copy = copy and not copy_made
 
-    elif is_float_dtype(data):
+    elif is_float_dtype(dtype):
         # treat as multiples of the given unit.  If after converting to nanos,
         #  there are fractional components left, these are truncated
         #  (i.e. NOT rounded)
@@ -863,14 +864,14 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
         data[mask] = iNaT
         copy = False
 
-    elif is_timedelta64_dtype(data):
+    elif is_timedelta64_dtype(dtype):
         if data.dtype != _TD_DTYPE:
             # non-nano unit
             # TODO: watch out for overflows
             data = data.astype(_TD_DTYPE)
             copy = False
 
-    elif is_datetime64_dtype(data):
+    elif is_datetime64_dtype(dtype):
         # GH#23539
         warnings.warn("Passing datetime64-dtype data to TimedeltaIndex is "
                       "deprecated, will raise a TypeError in a future "

I see two options

  1. continue to speed up sequence_to_td64ns (possible a fast path argument, when the input is known to be an ndarary[i8] or m8[ns].
  2. split out the freq inference of sequence_to_td64ns into a separate method.

Either of these can be done during the RC I think.

@TomAugspurger
Copy link
Contributor

So my request in
#24666 (review) is my only blocker right now.

@jbrockmendel
Copy link
Member Author

It's just the cumulative effect of all the is__dtype checks that's adding up.

These are checks we do a ton across the code-base, so I'm definitely on board for optimizing them.

Two options come to mind:

  • use data.dtype.kind == 'O', which %timeit is telling me takes ~75ns vs ~1.25us for a is__dtype check
  • do a lib.infer_dtype call up-front

The data.dtype.kind checks seem more straightforward to me.

continue to speed up sequence_to_td64ns (possible a fast path argument, when the input is known to be an ndarary[i8] or m8[ns].

certainly moving the i8 and m8[ns] cases to the top of the checks can short-circuit checking for other dtypes.

split out the freq inference of sequence_to_td64ns into a separate method.

The freq inference in sequence_to_td64ns is pretty trivial. Do you mean the freq validation in _from_sequence?

@jbrockmendel
Copy link
Member Author

just pushed with faster dtype checks

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hows the perf for the non-user-provided-freq case compare to master now?

@jbrockmendel
Copy link
Member Author

Hows the perf for the non-user-provided-freq case compare to master now?

__init__ is ~2.8x slower than master, _simple_new is about 2.5x faster

setup

In [3]: arr = np.arange(1000)
In [4]: arr2 = arr.view('timedelta64[ns]')

PR

In [5]: %timeit pd.core.arrays.TimedeltaArray(arr)
The slowest run took 30.48 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.04 µs per loop

In [6]: %timeit pd.core.arrays.TimedeltaArray(arr2)
The slowest run took 14.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.47 µs per loop

In [7]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr)
The slowest run took 63.68 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 880 ns per loop

In [8]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr2)
The slowest run took 24.56 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 893 ns per loop

master

In [5]: %timeit pd.core.arrays.TimedeltaArray(arr)
The slowest run took 40.33 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.21 µs per loop

In [6]: %timeit pd.core.arrays.TimedeltaArray(arr2)
The slowest run took 11.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.94 µs per loop

In [7]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr)
The slowest run took 10.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.5 µs per loop

In [8]: %timeit pd.core.arrays.TimedeltaArray._simple_new(arr2)
The slowest run took 17.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.18 µs per loop

@@ -880,7 +863,7 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
data[mask] = iNaT
copy = False

elif is_timedelta64_dtype(data):
elif data.dtype.kind == 'm':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these actually make a difference? I am -1 on changing these, this is the reason we have the is_* routines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_foo_dtype call takes about 16x longer than this check. Tom says these checks are the main source of his perf concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, sequence_to_td64ns could take a fastpath argument when we know that data is an ndarray of the correct dtype. Then you could have (roughly)

if fastpath:
    is_float = is_integer = is_object = is_string = is_timedelta = is_datetime = False
else:
    is_object = is_object_dtype(data)
    is_...


if is_object:
    ...
elif is_...

if not fastpath:
    data = np.array(data, copy=copy)
elif copy:
    data = data.copy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or use the even-faster _simple_new? fastpath is a pattern we're still trying to deprecate elsewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -1 on actually changing to the in [....] checks. The entire point is consistency in use. I am not convinced these are actual perf issues in the real world. micro seconds on a single construction pales in comparision to inconsistent code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can use _simple_new then great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_simple_new cannot be used, as that does ~zero validation on the input.

fastpath is a pattern we're still trying to deprecate elsewhere

We're deprecating that from public methods. sequnce_to_td64ns isn't public is it?

@TomAugspurger
Copy link
Contributor

Just to be clear, my only blocker for the RC was the validation of the dtype in simple_new, which has been fixed.

My blockers for the final release are

  1. DTA/TDA.__init__not accepts strings like TimedeltaArray(np.array(['1H', '2H'])).
  2. The __init__ methods be simple and don't do unnecessary work. We know that they should only accept ndarray[m8[ns] (or i8) or a TDA / Series / Index boxing those, so checking for additional dtypes is unnecessary.

So I'm perfectly fine with filing issues and going away.

@jreback
Copy link
Contributor

jreback commented Jan 9, 2019

I agree this is not a blocker

@jbrockmendel
Copy link
Member Author

so... revert the dtype checks to use is_foo_dtype and we're good?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 9, 2019 via email

@jreback
Copy link
Contributor

jreback commented Jan 9, 2019

yep

@jbrockmendel
Copy link
Member Author

Done

@jreback jreback merged commit decc8ce into pandas-dev:master Jan 9, 2019
@jreback
Copy link
Contributor

jreback commented Jan 9, 2019

thanks

@jorisvandenbossche
Copy link
Member

Shouldn't starting to use sequence_to_td64 not wait until the discussion in #24684 ?

@jorisvandenbossche
Copy link
Member

OK too late :)
(anyway, this only impact performance, not behaviour I think, so that is not a blocker for the RC)

@TomAugspurger
Copy link
Contributor

This does change behavior. I believe that as of this PR, TImedeltaArray.__init__ accepts strings

In [3]: pd.arrays.TimedeltaArray(np.array(['1H']))
Out[3]:
<TimedeltaArray>
['01:00:00']
Length: 1, dtype: timedelta64[ns]

previously that would have raised a TypeError.

My vote is for raising an error, but not necessarily blocking the RC for that change.

@jbrockmendel jbrockmendel deleted the td64 branch January 9, 2019 17:08
@jreback
Copy link
Contributor

jreback commented Jan 9, 2019

hmm we should have the guarantees as in DTA right? that we only accept arrays of typed objects?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 9, 2019 via email

@jbrockmendel
Copy link
Member Author

that we only accept arrays of typed objects?

right, inasmuch as it has been decided that crippling the public constructors is A Good Idea.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 9, 2019 via email

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants