-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Ensure TDA.__init__ validates freq #24666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24666 +/- ##
==========================================
+ Coverage 92.38% 92.38% +<.01%
==========================================
Files 166 166
Lines 52327 52310 -17
==========================================
- Hits 48342 48327 -15
+ Misses 3985 3983 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the perf impact with and without a user-provided freq vs. master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I think @TomAugspurger has a comment.
Setup import numpy as np
import pandas as pd
data = np.arange(10_000, dtype='i8') * 24 * 3600 * 10**9
%timeit pd.arrays.TimedeltaArray(data)
%timeit pd.arrays.TimedeltaArray(data, freq='D')
%timeit pd.timedelta_range("1H", periods=10_000) master
PR (with master merged in)
The one with The firs one, no freq, is surprising. (2.65us -> 20us). I'll see where the time is being spent now. |
~80% of Unfortunately, nothing in
It's just the cumulative effect of all the |
Passing diff --git a/pandas/core/arrays/timedeltas.py b/pandas/core/arrays/timedeltas.py
index d876862c9..e11d841c6 100644
--- a/pandas/core/arrays/timedeltas.py
+++ b/pandas/core/arrays/timedeltas.py
@@ -843,17 +843,18 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
data = data._data
# Convert whatever we have into timedelta64[ns] dtype
- if is_object_dtype(data) or is_string_dtype(data):
+ dtype = getattr(data, 'dtype', data)
+ if is_object_dtype(dtype) or is_string_dtype(dtype):
# no need to make a copy, need to convert if string-dtyped
data = objects_to_td64ns(data, unit=unit, errors=errors)
copy = False
- elif is_integer_dtype(data):
+ elif is_integer_dtype(dtype):
# treat as multiples of the given unit
data, copy_made = ints_to_td64ns(data, unit=unit)
copy = copy and not copy_made
- elif is_float_dtype(data):
+ elif is_float_dtype(dtype):
# treat as multiples of the given unit. If after converting to nanos,
# there are fractional components left, these are truncated
# (i.e. NOT rounded)
@@ -863,14 +864,14 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
data[mask] = iNaT
copy = False
- elif is_timedelta64_dtype(data):
+ elif is_timedelta64_dtype(dtype):
if data.dtype != _TD_DTYPE:
# non-nano unit
# TODO: watch out for overflows
data = data.astype(_TD_DTYPE)
copy = False
- elif is_datetime64_dtype(data):
+ elif is_datetime64_dtype(dtype):
# GH#23539
warnings.warn("Passing datetime64-dtype data to TimedeltaIndex is "
"deprecated, will raise a TypeError in a future " I see two options
Either of these can be done during the RC I think. |
So my request in |
These are checks we do a ton across the code-base, so I'm definitely on board for optimizing them. Two options come to mind:
The data.dtype.kind checks seem more straightforward to me.
certainly moving the i8 and m8[ns] cases to the top of the checks can short-circuit checking for other dtypes.
The freq inference in sequence_to_td64ns is pretty trivial. Do you mean the freq validation in _from_sequence? |
just pushed with faster dtype checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hows the perf for the non-user-provided-freq case compare to master now?
setup
PR
master
|
pandas/core/arrays/timedeltas.py
Outdated
@@ -880,7 +863,7 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"): | |||
data[mask] = iNaT | |||
copy = False | |||
|
|||
elif is_timedelta64_dtype(data): | |||
elif data.dtype.kind == 'm': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these actually make a difference? I am -1 on changing these, this is the reason we have the is_* routines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_foo_dtype call takes about 16x longer than this check. Tom says these checks are the main source of his perf concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, sequence_to_td64ns
could take a fastpath
argument when we know that data
is an ndarray of the correct dtype. Then you could have (roughly)
if fastpath:
is_float = is_integer = is_object = is_string = is_timedelta = is_datetime = False
else:
is_object = is_object_dtype(data)
is_...
if is_object:
...
elif is_...
if not fastpath:
data = np.array(data, copy=copy)
elif copy:
data = data.copy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use the even-faster _simple_new? fastpath is a pattern we're still trying to deprecate elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am -1 on actually changing to the in [....] checks. The entire point is consistency in use. I am not convinced these are actual perf issues in the real world. micro seconds on a single construction pales in comparision to inconsistent code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can use _simple_new
then great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_simple_new
cannot be used, as that does ~zero validation on the input.
fastpath is a pattern we're still trying to deprecate elsewhere
We're deprecating that from public methods. sequnce_to_td64ns isn't public is it?
Just to be clear, my only blocker for the RC was the validation of the dtype in simple_new, which has been fixed. My blockers for the final release are
So I'm perfectly fine with filing issues and going away. |
I agree this is not a blocker |
so... revert the dtype checks to use is_foo_dtype and we're good? |
Fine for the RC.
…On Wed, Jan 9, 2019 at 8:45 AM jbrockmendel ***@***.***> wrote:
so... revert the dtype checks to use is_foo_dtype and we're good?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24666 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIi097Mbr-IidMWtASBP_sngm_r3Jks5vBgCRgaJpZM4Z0gKU>
.
|
yep |
Done |
thanks |
Shouldn't starting to use |
OK too late :) |
This does change behavior. I believe that as of this PR, In [3]: pd.arrays.TimedeltaArray(np.array(['1H']))
Out[3]:
<TimedeltaArray>
['01:00:00']
Length: 1, dtype: timedelta64[ns] previously that would have raised a TypeError. My vote is for raising an error, but not necessarily blocking the RC for that change. |
hmm we should have the guarantees as in DTA right? that we only accept arrays of typed objects? |
Yes they should be consistent.
TDA and DTA both recently started using sequcne_to_*ns in their
`__init__`s, which meant they started parsing strings.
…On Wed, Jan 9, 2019 at 11:34 AM Jeff Reback ***@***.***> wrote:
hmm we should have the guarantees as in DTA right? that we only accept
arrays of typed objects?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24666 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIunCa4gkSJdhfr9nLBTNZ_K2IQtJks5vBigGgaJpZM4Z0gKU>
.
|
right, inasmuch as it has been decided that crippling the public constructors is A Good Idea. |
If you disagree, then #24684 is
probably the place to discuss, though it's probably all been said at this point.
Not to shut that discussion down premature though... Apologies if it came off that way.
…On Wed, Jan 9, 2019 at 12:10 PM jbrockmendel ***@***.***> wrote:
that we only accept arrays of typed objects?
right, inasmuch as it has been decided that crippling the public
constructors is A Good Idea.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24666 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvUAN3TxZV5OZDEhXHwpdYMQpk1Tks5vBjCjgaJpZM4Z0gKU>
.
|
Users should not be able to construct invalid instances with the public constructor.
De-duplicates some code.