-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Prevent addition overflow with TimedeltaIndex #14816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Prevent addition overflow with TimedeltaIndex #14816
Conversation
Current coverage is 85.33% (diff: 89.18%)@@ master #14816 diff @@
==========================================
Files 144 144
Lines 51043 51058 +15
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43552 43569 +17
+ Misses 7491 7489 -2
Partials 0 0
|
@@ -812,15 +812,21 @@ def unique1d(values): | |||
return uniques | |||
|
|||
|
|||
def _checked_add_with_arr(arr, b): | |||
def _checked_add_with_arr(arr, b, arr_nans=None, b_nans=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we make this signature more consistent maybe
_checked_add_with_arr(a, b, a_mask=None, b_mask=None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, this function is different than most in nanops.py
. we normally actually just call isnull
directly on the arrays, so we don't first send in the .view('i8')
for datetimelikes (this is done in the function). I think you should make this more consistent e.g. ideally this signature would actually be: _check_add_with_arr(a, b, skipna=True)
if you can do this I would prefer over passing the masks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure but you do understand that that isn't what we're getting when doing TDI addition / subtraction. In addition, there is no need to check for isnull
since the work already done for us during instantiation of the TDI instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making the signature _checked_add_with_arr(a, b, a_mask=None, b_mask=None)
is OK, but I agree with @gfyoung that passing the actual masks to the function like it is now is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is future readers will just be plain confused
change this to compute the mask internally like all other functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not with good documentation they won't. I'm not convinced by your consistency argument because you're going to sacrifice performance. Every function in nanops.py
calls isnull
, but we have no reason to do this because we already know the null
values before calling the function. So why do duplicate work?
The other option is that we would call the specific attribute of the TDI object internally but then that makes no other universal to all pandas
objects.
|
||
Parameters | ||
---------- | ||
arr : array addend. | ||
b : array or scalar addend. | ||
arr_nans : array indicating which elements are NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a_mask : boolean array or None
array indicated which elements to exclude from checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Can you show the results of the related benchmarks before/after PR? |
ce48455
to
7990eec
Compare
Setup script: import pandas as pd
import numpy as np
checked_add = pd.core.nanops._checked_add_with_arr
arr = np.arange(1000000)
arrpos = np.arange(1000000)
arrneg = np.arange(-1000000, 0)
arrmixed = np.array([1, -1]).repeat(500000) Timing info is with 100 loops, best of 3 On
On
|
@jreback , @jorisvandenbossche : How does the performance look to you both? If it looks good, should be ready to merge if there are no other concerns. |
Those timings look good. Can you add a test for timedelta where the overflow is raised? (for the cases where you added the check) |
you didn't address my api comments |
this is inconsistent with the rest of the code |
@jreback @jorisvandenbossche : Do you have anything more to add about this API consistency issue, especially in light of my response? |
7990eec
to
e57f3bb
Compare
As a possible way out of the discussion: @jreback, as your main point was a lack of consistency with the rest of the functions in |
would be ok with a move to core/algorithms.py and api as currently (meaning the last iteration) |
d238fbb
to
59ba12b
Compare
Changed |
59ba12b
to
4f8a12d
Compare
@@ -4,7 +4,8 @@ | |||
|
|||
import numpy as np | |||
|
|||
from pandas.core.algorithms import factorize, match, unique, value_counts | |||
from pandas.core.algorithms import (checked_add_with_arr, factorize, match, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO! this is not a top-level function, nor should be actually exposed to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name is fine, but no reason to expose this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be ok with a move to core/algorithms.py and api as currently
Those were your words. Fair enough. Will remove.
4f8a12d
to
7a3e54e
Compare
@gfyoung moving to an internal pandas module and exposing as a public method are 2 very separate and distinct things. |
@jreback : Yes, I am aware, so then what did you mean by "a move to |
api as you have currently, e.g. |
@jreback : Ah, I see. Signature is |
@jreback , @jorisvandenbossche : Moved checked add function to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Can you add a test for timedelta where the overflow is raised? (for the cases where you added the check)
-
Can you show the output of the
Algorithms
benchmark, just to be sure they run?
|
||
self.arr = np.arange(1000000) | ||
self.arrpos = np.arange(1000000) | ||
self.arrneg = np.arange(-1000000, 0) | ||
self.arrmixed = np.array([1, -1]).repeat(500000) | ||
|
||
self.arr_nan = np.random.choice([True, False], size=1000000) | ||
self.arrmixed_nan = np.random.choice([True, False], size=1000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between those two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the masks for self.arr
and self.arrmixed
respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they are defined exactly the same, that's the reason I am confused on why two different variables are needed and the mask is not just reused. (but they are of course both random, so not exactly the same)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to see your confusion here. Why re-use the mask when that won't be the case in reality (i.e. the masks will likely be different in most cases)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, no problem (I also merged BTW :-)), I was just skimming the code, and saw two apparantly identical lines of code, hence my comment (which was not needed)
|
||
def time_add_overflow_both_arg_nan(self): | ||
self.checked_add(self.arr, self.arrmixed, arr_mask=self.arr_nan, | ||
b_mask=self.arrmixed_arr_nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrmixed_arr_nan
should this be arrmixed_nan
? (above as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Will fix.
@@ -56,6 +56,7 @@ Bug Fixes | |||
|
|||
|
|||
|
|||
Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move to 0.20.0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Helper function to broadcast arrays / scalars to the desired shape. | ||
|
||
This function is compatible with different versions of NumPy and is | ||
implemented for performance reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"implemented for performance reasons" is a bit unclear I find. We broadcast for performance reasons, but the function is not implemented for that reason, that is to broadcast for older numpy versions.
I would just keep your original comment about performance and broadcasting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
@@ -40,6 +41,94 @@ | |||
# top-level algos # | |||
# --------------- # | |||
|
|||
def checked_add_with_arr(arr, b, arr_mask=None, b_mask=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this function somewhere more below in this file? As it does not belong to the "top-level algos" (the title just above this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.
|
7a3e54e
to
a086db6
Compare
@jreback , @jorisvandenbossche : Addressed all of the comments and ran the benchmarks as requested. Everything is green, so ready to merge if there is nothing else. |
@gfyoung thanks! |
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.
Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.
Expands checked-add array addition introduced in #14237 to include all other addition cases (i.e.
TimedeltaIndex
andTimedelta
). Follow-up to #14453.