BUG: Prevent addition overflow with TimedeltaIndex #14816

gfyoung · 2016-12-07T15:50:04Z

Expands checked-add array addition introduced in #14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to #14453.

codecov-io · 2016-12-07T19:09:07Z

Current coverage is 85.33% (diff: 89.18%)

Merging #14816 into master will increase coverage by <.01%

@@             master     #14816   diff @@
==========================================
  Files           144        144          
  Lines         51043      51058    +15   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43552      43569    +17   
+ Misses         7491       7489     -2   
  Partials          0          0

Powered by Codecov. Last update dd8cba2...a086db6

jreback · 2016-12-09T21:43:40Z

pandas/core/nanops.py

@@ -812,15 +812,21 @@ def unique1d(values):
    return uniques


-def _checked_add_with_arr(arr, b):
+def _checked_add_with_arr(arr, b, arr_nans=None, b_nans=None):


why don't we make this signature more consistent maybe
_checked_add_with_arr(a, b, a_mask=None, b_mask=None)

btw, this function is different than most in nanops.py. we normally actually just call isnull directly on the arrays, so we don't first send in the .view('i8') for datetimelikes (this is done in the function). I think you should make this more consistent e.g. ideally this signature would actually be: _check_add_with_arr(a, b, skipna=True)

if you can do this I would prefer over passing the masks

Sure but you do understand that that isn't what we're getting when doing TDI addition / subtraction. In addition, there is no need to check for isnull since the work already done for us during instantiation of the TDI instance.

Making the signature _checked_add_with_arr(a, b, a_mask=None, b_mask=None) is OK, but I agree with @gfyoung that passing the actual masks to the function like it is now is good.

the problem is future readers will just be plain confused

change this to compute the mask internally like all other functions

Not with good documentation they won't. I'm not convinced by your consistency argument because you're going to sacrifice performance. Every function in nanops.py calls isnull, but we have no reason to do this because we already know the null values before calling the function. So why do duplicate work?

The other option is that we would call the specific attribute of the TDI object internally but then that makes no other universal to all pandas objects.

jreback · 2016-12-09T21:44:45Z

pandas/core/nanops.py


    Parameters
    ----------
    arr : array addend.
    b : array or scalar addend.
+    arr_nans : array indicating which elements are NaN


a_mask : boolean array or None array indicated which elements to exclude from checking

jorisvandenbossche · 2016-12-10T10:32:28Z

Can you show the results of the related benchmarks before/after PR?

gfyoung · 2016-12-11T01:49:27Z

Setup script:

import pandas as pd
import numpy as np

checked_add = pd.core.nanops._checked_add_with_arr

arr = np.arange(1000000)
arrpos = np.arange(1000000)
arrneg = np.arange(-1000000, 0)
arrmixed = np.array([1, -1]).repeat(500000)

Timing info is with 100 loops, best of 3

On master:

checked_add(arr, 1): 8.39 ms / loop
checked_add(arr, -1): 8.38 ms / loop
checked_add(arr, 0): 8.37 ms / loop
checked_add(arr, arrpos): 9.53 ms / loop
checked_add(arr, arrneg): 9.34 ms / loop
checked_add(arr, arrmixed): 14.9 ms / loop

On PR:

checked_add(arr, 1): 8.35 ms / loop
checked_add(arr, -1): 8.34 ms / loop
checked_add(arr, 0): 8.42 ms / loop
checked_add(arr, arrpos): 10.1 ms / loop
checked_add(arr, arrneg): 9.6 ms / loop
checked_add(arr, arrmixed): 16.8 ms / loop

gfyoung · 2016-12-11T09:46:00Z

@jreback , @jorisvandenbossche : How does the performance look to you both? If it looks good, should be ready to merge if there are no other concerns.

jorisvandenbossche · 2016-12-11T14:36:15Z

Those timings look good. Can you add a test for timedelta where the overflow is raised? (for the cases where you added the check)

jreback · 2016-12-11T14:37:37Z

you didn't address my api comments

jreback · 2016-12-11T15:02:22Z

this is inconsistent with the rest of the code

gfyoung · 2016-12-11T17:41:37Z

@jreback : I did address your API comments here. It also explains why I don't think consistency is an issue here. I don't understand why you want to make this internal function "consistent," especially when it doesn't really make sense to align it with the rest of the other functions.

gfyoung · 2016-12-15T16:08:39Z

@jreback @jorisvandenbossche : Do you have anything more to add about this API consistency issue, especially in light of my response?

jorisvandenbossche · 2016-12-15T16:22:34Z

As a possible way out of the discussion: @jreback, as your main point was a lack of consistency with the rest of the functions in nanops.py, would the current version be more OK to you when we moved it to another location (and actually, this function is not really a "nanop" anyway, it is a adapted add to deal with integer overflow in the first place, not NaNs). Eg core/algorithms.py or somewhere else.

jreback · 2016-12-15T18:10:18Z

would be ok with a move to core/algorithms.py and api as currently (meaning the last iteration)

gfyoung · 2016-12-15T18:51:50Z

Changed _checked_add_with_arr to checked_add_with_arr in core/algorithms.py ~~since we are exposing it in core/api.py. Added whatsnew to v0.20.0 since this is technically an API change.~~

jreback · 2016-12-15T18:54:51Z

pandas/core/api.py

@@ -4,7 +4,8 @@

 import numpy as np

-from pandas.core.algorithms import factorize, match, unique, value_counts
+from pandas.core.algorithms import (checked_add_with_arr, factorize, match,


NO! this is not a top-level function, nor should be actually exposed to the user

the name is fine, but no reason to expose this

would be ok with a move to core/algorithms.py and api as currently

Those were your words. Fair enough. Will remove.

jreback · 2016-12-15T19:09:48Z

@gfyoung moving to an internal pandas module and exposing as a public method are 2 very separate and distinct things.

gfyoung · 2016-12-15T19:11:58Z

@jreback : Yes, I am aware, so then what did you mean by "a move to core/algorithms and api" then?

jreback · 2016-12-15T19:14:35Z

api as you have currently, e.g. (a, b, a_mask, b_mask) (though I think you have something slightly different)

gfyoung · 2016-12-15T19:16:12Z

@jreback : Ah, I see. Signature is (arr, b, arr_mask=None, b_mask=None).

gfyoung · 2016-12-16T04:56:11Z

@jreback , @jorisvandenbossche : Moved checked add function to core/algorithms.py, and everything is still passing. Ready to merge if there are no other concerns.

jorisvandenbossche

Can you add a test for timedelta where the overflow is raised? (for the cases where you added the check)
Can you show the output of the Algorithms benchmark, just to be sure they run?

jorisvandenbossche · 2016-12-16T09:26:09Z

asv_bench/benchmarks/algorithms.py


        self.arr = np.arange(1000000)
        self.arrpos = np.arange(1000000)
        self.arrneg = np.arange(-1000000, 0)
        self.arrmixed = np.array([1, -1]).repeat(500000)

+        self.arr_nan = np.random.choice([True, False], size=1000000)
+        self.arrmixed_nan = np.random.choice([True, False], size=1000000)


What's the difference between those two?

Those are the masks for self.arr and self.arrmixed respectively.

But they are defined exactly the same, that's the reason I am confused on why two different variables are needed and the mask is not just reused. (but they are of course both random, so not exactly the same)

I fail to see your confusion here. Why re-use the mask when that won't be the case in reality (i.e. the masks will likely be different in most cases)?

yeah, no problem (I also merged BTW :-)), I was just skimming the code, and saw two apparantly identical lines of code, hence my comment (which was not needed)

jorisvandenbossche · 2016-12-16T09:27:09Z

asv_bench/benchmarks/algorithms.py

+
+    def time_add_overflow_both_arg_nan(self):
+        self.checked_add(self.arr, self.arrmixed, arr_mask=self.arr_nan,
+                         b_mask=self.arrmixed_arr_nan)


arrmixed_arr_nan should this be arrmixed_nan ? (above as well)

Good catch. Will fix.

jorisvandenbossche · 2016-12-16T09:27:59Z

doc/source/whatsnew/v0.19.2.txt

@@ -56,6 +56,7 @@ Bug Fixes



+Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`)


Can you move to 0.20.0 ?

jorisvandenbossche · 2016-12-16T09:31:01Z

pandas/core/algorithms.py

+        Helper function to broadcast arrays / scalars to the desired shape.
+
+        This function is compatible with different versions of NumPy and is
+        implemented for performance reasons.


"implemented for performance reasons" is a bit unclear I find. We broadcast for performance reasons, but the function is not implemented for that reason, that is to broadcast for older numpy versions.
I would just keep your original comment about performance and broadcasting

Fair enough. Done.

jorisvandenbossche · 2016-12-16T09:33:36Z

pandas/core/algorithms.py

@@ -40,6 +41,94 @@
 # top-level algos #
 # --------------- #

+def checked_add_with_arr(arr, b, arr_mask=None, b_mask=None):


Can you put this function somewhere more below in this file? As it does not belong to the "top-level algos" (the title just above this)

Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.

gfyoung · 2016-12-17T01:51:34Z

@jorisvandenbossche :

· Discovering benchmarks
· Running 16 total benchmarks (1 commits * 1 environments * 16 benchmarks)
[  0.00%] ·· Building for existing-py_home_user_miniconda3_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_user_miniconda3_envs_pandas-dev_bin_python
[  6.25%] ··· Running ...ms.Algorithms.time_add_overflow_both_arg_nan    16.24ms
[ 12.50%] ··· Running ...s.Algorithms.time_add_overflow_first_arg_nan    15.98ms
[ 18.75%] ··· Running ...ithms.Algorithms.time_add_overflow_mixed_arr    16.67ms
[ 25.00%] ··· Running algorithms.Algorithms.time_add_overflow_neg_arr     8.93ms
[ 31.25%] ··· Running ...thms.Algorithms.time_add_overflow_neg_scalar     8.02ms
[ 37.50%] ··· Running algorithms.Algorithms.time_add_overflow_pos_arr     9.03ms
[ 43.75%] ··· Running ...thms.Algorithms.time_add_overflow_pos_scalar     8.07ms
[ 50.00%] ··· Running ....Algorithms.time_add_overflow_second_arg_nan    15.96ms
[ 56.25%] ··· Running ...hms.Algorithms.time_add_overflow_zero_scalar     8.12ms
[ 62.50%] ··· Running algorithms.Algorithms.time_duplicated_float        18.64ms
[ 68.75%] ··· Running algorithms.Algorithms.time_duplicated_int          15.32ms
[ 75.00%] ··· Running ...rithms.Algorithms.time_duplicated_int_unique    50.73μs
[ 81.25%] ··· Running algorithms.Algorithms.time_factorize_float          7.59ms
[ 87.50%] ··· Running algorithms.Algorithms.time_factorize_int            7.49ms
[ 93.75%] ··· Running algorithms.Algorithms.time_factorize_string        29.15ms
[100.00%] ··· Running algorithms.Algorithms.time_match_strings            1.91ms

gfyoung · 2016-12-17T07:05:32Z

@jreback , @jorisvandenbossche : Addressed all of the comments and ran the benchmarks as requested. Everything is green, so ready to merge if there is nothing else.

jorisvandenbossche · 2016-12-17T23:25:22Z

@gfyoung thanks!

Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.

jreback reviewed Dec 9, 2016

View reviewed changes

jreback added Bug Timedelta Timedelta data type labels Dec 10, 2016

gfyoung force-pushed the checked-arr-add-is-nans branch 2 times, most recently from ce48455 to 7990eec Compare December 11, 2016 01:43

gfyoung force-pushed the checked-arr-add-is-nans branch from 7990eec to e57f3bb Compare December 15, 2016 16:09

gfyoung force-pushed the checked-arr-add-is-nans branch 2 times, most recently from d238fbb to 59ba12b Compare December 15, 2016 18:51

gfyoung force-pushed the checked-arr-add-is-nans branch from 59ba12b to 4f8a12d Compare December 15, 2016 18:53

jreback reviewed Dec 15, 2016

View reviewed changes

gfyoung force-pushed the checked-arr-add-is-nans branch from 4f8a12d to 7a3e54e Compare December 15, 2016 18:59

jorisvandenbossche reviewed Dec 16, 2016

View reviewed changes

BUG: Prevent addition overflow with TimedeltaIndex

a086db6

Expands checked-add array addition introduced in pandas-devgh-14237 to include all other addition cases (i.e. TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453. In addition, move checked add function to core/algorithms.

gfyoung force-pushed the checked-arr-add-is-nans branch from 7a3e54e to a086db6 Compare December 17, 2016 01:51

jorisvandenbossche merged commit bdbebc4 into pandas-dev:master Dec 17, 2016

gfyoung deleted the checked-arr-add-is-nans branch December 18, 2016 01:02

jorisvandenbossche added this to the 0.20.0 milestone Dec 24, 2016

		@@ -56,6 +56,7 @@ Bug Fixes



		Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`)

BUG: Prevent addition overflow with TimedeltaIndex #14816

BUG: Prevent addition overflow with TimedeltaIndex #14816

Conversation

gfyoung commented Dec 7, 2016

codecov-io commented Dec 7, 2016 • edited Loading

Current coverage is 85.33% (diff: 89.18%)

Choose a reason for hiding this comment

jreback Dec 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 10, 2016 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 10, 2016

gfyoung commented Dec 11, 2016 • edited Loading

gfyoung commented Dec 11, 2016

jorisvandenbossche commented Dec 11, 2016 • edited Loading

jreback commented Dec 11, 2016

jreback commented Dec 11, 2016

gfyoung commented Dec 11, 2016 • edited Loading

gfyoung commented Dec 15, 2016

jorisvandenbossche commented Dec 15, 2016

jreback commented Dec 15, 2016

gfyoung commented Dec 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 15, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Dec 15, 2016

gfyoung commented Dec 15, 2016

jreback commented Dec 15, 2016

gfyoung commented Dec 15, 2016

gfyoung commented Dec 16, 2016

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Dec 17, 2016

gfyoung commented Dec 17, 2016

jorisvandenbossche commented Dec 17, 2016

codecov-io commented Dec 7, 2016 •

edited

Loading

jreback Dec 9, 2016 •

edited

Loading

gfyoung Dec 11, 2016 •

edited

Loading

gfyoung Dec 10, 2016 •

edited

Loading

gfyoung commented Dec 11, 2016 •

edited

Loading

jorisvandenbossche commented Dec 11, 2016 •

edited

Loading

gfyoung commented Dec 11, 2016 •

edited

Loading

gfyoung commented Dec 15, 2016 •

edited

Loading

gfyoung Dec 15, 2016 •

edited

Loading