Skip to content

BUG: Prevent addition overflow with TimedeltaIndex #14816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 7, 2016

Expands checked-add array addition introduced in #14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to #14453.

@codecov-io
Copy link

codecov-io commented Dec 7, 2016

Current coverage is 85.33% (diff: 89.18%)

Merging #14816 into master will increase coverage by <.01%

@@             master     #14816   diff @@
==========================================
  Files           144        144          
  Lines         51043      51058    +15   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43552      43569    +17   
+ Misses         7491       7489     -2   
  Partials          0          0          

Powered by Codecov. Last update dd8cba2...a086db6

@@ -812,15 +812,21 @@ def unique1d(values):
return uniques


def _checked_add_with_arr(arr, b):
def _checked_add_with_arr(arr, b, arr_nans=None, b_nans=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we make this signature more consistent maybe
_checked_add_with_arr(a, b, a_mask=None, b_mask=None)

Copy link
Contributor

@jreback jreback Dec 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, this function is different than most in nanops.py. we normally actually just call isnull directly on the arrays, so we don't first send in the .view('i8') for datetimelikes (this is done in the function). I think you should make this more consistent e.g. ideally this signature would actually be: _check_add_with_arr(a, b, skipna=True)

if you can do this I would prefer over passing the masks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but you do understand that that isn't what we're getting when doing TDI addition / subtraction. In addition, there is no need to check for isnull since the work already done for us during instantiation of the TDI instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the signature _checked_add_with_arr(a, b, a_mask=None, b_mask=None) is OK, but I agree with @gfyoung that passing the actual masks to the function like it is now is good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is future readers will just be plain confused

change this to compute the mask internally like all other functions

Copy link
Member Author

@gfyoung gfyoung Dec 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not with good documentation they won't. I'm not convinced by your consistency argument because you're going to sacrifice performance. Every function in nanops.py calls isnull, but we have no reason to do this because we already know the null values before calling the function. So why do duplicate work?

The other option is that we would call the specific attribute of the TDI object internally but then that makes no other universal to all pandas objects.


Parameters
----------
arr : array addend.
b : array or scalar addend.
arr_nans : array indicating which elements are NaN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a_mask : boolean array or None
    array indicated which elements to exclude from checking

Copy link
Member Author

@gfyoung gfyoung Dec 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jorisvandenbossche
Copy link
Member

Can you show the results of the related benchmarks before/after PR?

@jreback jreback added Bug Timedelta Timedelta data type labels Dec 10, 2016
@gfyoung gfyoung force-pushed the checked-arr-add-is-nans branch 2 times, most recently from ce48455 to 7990eec Compare December 11, 2016 01:43
@gfyoung
Copy link
Member Author

gfyoung commented Dec 11, 2016

Setup script:

import pandas as pd
import numpy as np

checked_add = pd.core.nanops._checked_add_with_arr

arr = np.arange(1000000)
arrpos = np.arange(1000000)
arrneg = np.arange(-1000000, 0)
arrmixed = np.array([1, -1]).repeat(500000)

Timing info is with 100 loops, best of 3

On master:

  • checked_add(arr, 1): 8.39 ms / loop
  • checked_add(arr, -1): 8.38 ms / loop
  • checked_add(arr, 0): 8.37 ms / loop
  • checked_add(arr, arrpos): 9.53 ms / loop
  • checked_add(arr, arrneg): 9.34 ms / loop
  • checked_add(arr, arrmixed): 14.9 ms / loop

On PR:

  • checked_add(arr, 1): 8.35 ms / loop
  • checked_add(arr, -1): 8.34 ms / loop
  • checked_add(arr, 0): 8.42 ms / loop
  • checked_add(arr, arrpos): 10.1 ms / loop
  • checked_add(arr, arrneg): 9.6 ms / loop
  • checked_add(arr, arrmixed): 16.8 ms / loop

@gfyoung
Copy link
Member Author

gfyoung commented Dec 11, 2016

@jreback , @jorisvandenbossche : How does the performance look to you both? If it looks good, should be ready to merge if there are no other concerns.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 11, 2016

Those timings look good. Can you add a test for timedelta where the overflow is raised? (for the cases where you added the check)

@jreback
Copy link
Contributor

jreback commented Dec 11, 2016

you didn't address my api comments

@jreback
Copy link
Contributor

jreback commented Dec 11, 2016

this is inconsistent with the rest of the code

@gfyoung
Copy link
Member Author

gfyoung commented Dec 11, 2016

@jreback : I did address your API comments here. It also explains why I don't think consistency is an issue here. I don't understand why you want to make this internal function "consistent," especially when it doesn't really make sense to align it with the rest of the other functions.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 15, 2016

@jreback @jorisvandenbossche : Do you have anything more to add about this API consistency issue, especially in light of my response?

@gfyoung gfyoung force-pushed the checked-arr-add-is-nans branch from 7990eec to e57f3bb Compare December 15, 2016 16:09
@jorisvandenbossche
Copy link
Member

As a possible way out of the discussion: @jreback, as your main point was a lack of consistency with the rest of the functions in nanops.py, would the current version be more OK to you when we moved it to another location (and actually, this function is not really a "nanop" anyway, it is a adapted add to deal with integer overflow in the first place, not NaNs). Eg core/algorithms.py or somewhere else.

@jreback
Copy link
Contributor

jreback commented Dec 15, 2016

would be ok with a move to core/algorithms.py and api as currently (meaning the last iteration)

@gfyoung gfyoung force-pushed the checked-arr-add-is-nans branch 2 times, most recently from d238fbb to 59ba12b Compare December 15, 2016 18:51
@gfyoung
Copy link
Member Author

gfyoung commented Dec 15, 2016

Changed _checked_add_with_arr to checked_add_with_arr in core/algorithms.py since we are exposing it in core/api.py. Added whatsnew to v0.20.0 since this is technically an API change.

@gfyoung gfyoung force-pushed the checked-arr-add-is-nans branch from 59ba12b to 4f8a12d Compare December 15, 2016 18:53
@@ -4,7 +4,8 @@

import numpy as np

from pandas.core.algorithms import factorize, match, unique, value_counts
from pandas.core.algorithms import (checked_add_with_arr, factorize, match,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO! this is not a top-level function, nor should be actually exposed to the user

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is fine, but no reason to expose this

Copy link
Member Author

@gfyoung gfyoung Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be ok with a move to core/algorithms.py and api as currently

Those were your words. Fair enough. Will remove.

@gfyoung gfyoung force-pushed the checked-arr-add-is-nans branch from 4f8a12d to 7a3e54e Compare December 15, 2016 18:59
@jreback
Copy link
Contributor

jreback commented Dec 15, 2016

@gfyoung moving to an internal pandas module and exposing as a public method are 2 very separate and distinct things.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 15, 2016

@jreback : Yes, I am aware, so then what did you mean by "a move to core/algorithms and api" then?

@jreback
Copy link
Contributor

jreback commented Dec 15, 2016

api as you have currently, e.g. (a, b, a_mask, b_mask) (though I think you have something slightly different)

@gfyoung
Copy link
Member Author

gfyoung commented Dec 15, 2016

@jreback : Ah, I see. Signature is (arr, b, arr_mask=None, b_mask=None).

@gfyoung
Copy link
Member Author

gfyoung commented Dec 16, 2016

@jreback , @jorisvandenbossche : Moved checked add function to core/algorithms.py, and everything is still passing. Ready to merge if there are no other concerns.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you add a test for timedelta where the overflow is raised? (for the cases where you added the check)

  • Can you show the output of the Algorithms benchmark, just to be sure they run?


self.arr = np.arange(1000000)
self.arrpos = np.arange(1000000)
self.arrneg = np.arange(-1000000, 0)
self.arrmixed = np.array([1, -1]).repeat(500000)

self.arr_nan = np.random.choice([True, False], size=1000000)
self.arrmixed_nan = np.random.choice([True, False], size=1000000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between those two?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are the masks for self.arr and self.arrmixed respectively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they are defined exactly the same, that's the reason I am confused on why two different variables are needed and the mask is not just reused. (but they are of course both random, so not exactly the same)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see your confusion here. Why re-use the mask when that won't be the case in reality (i.e. the masks will likely be different in most cases)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, no problem (I also merged BTW :-)), I was just skimming the code, and saw two apparantly identical lines of code, hence my comment (which was not needed)


def time_add_overflow_both_arg_nan(self):
self.checked_add(self.arr, self.arrmixed, arr_mask=self.arr_nan,
b_mask=self.arrmixed_arr_nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrmixed_arr_nan should this be arrmixed_nan ? (above as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will fix.

@@ -56,6 +56,7 @@ Bug Fixes



Bug in ``TimedeltaIndex`` addition where overflow was being allowed without error (:issue:`14816`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move to 0.20.0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Helper function to broadcast arrays / scalars to the desired shape.

This function is compatible with different versions of NumPy and is
implemented for performance reasons.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"implemented for performance reasons" is a bit unclear I find. We broadcast for performance reasons, but the function is not implemented for that reason, that is to broadcast for older numpy versions.
I would just keep your original comment about performance and broadcasting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

@@ -40,6 +41,94 @@
# top-level algos #
# --------------- #

def checked_add_with_arr(arr, b, arr_mask=None, b_mask=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this function somewhere more below in this file? As it does not belong to the "top-level algos" (the title just above this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

In addition, move checked add function to core/algorithms.
@gfyoung
Copy link
Member Author

gfyoung commented Dec 17, 2016

@jorisvandenbossche :

· Discovering benchmarks
· Running 16 total benchmarks (1 commits * 1 environments * 16 benchmarks)
[  0.00%] ·· Building for existing-py_home_user_miniconda3_envs_pandas-dev_bin_python
[  0.00%] ·· Benchmarking existing-py_home_user_miniconda3_envs_pandas-dev_bin_python
[  6.25%] ··· Running ...ms.Algorithms.time_add_overflow_both_arg_nan    16.24ms
[ 12.50%] ··· Running ...s.Algorithms.time_add_overflow_first_arg_nan    15.98ms
[ 18.75%] ··· Running ...ithms.Algorithms.time_add_overflow_mixed_arr    16.67ms
[ 25.00%] ··· Running algorithms.Algorithms.time_add_overflow_neg_arr     8.93ms
[ 31.25%] ··· Running ...thms.Algorithms.time_add_overflow_neg_scalar     8.02ms
[ 37.50%] ··· Running algorithms.Algorithms.time_add_overflow_pos_arr     9.03ms
[ 43.75%] ··· Running ...thms.Algorithms.time_add_overflow_pos_scalar     8.07ms
[ 50.00%] ··· Running ....Algorithms.time_add_overflow_second_arg_nan    15.96ms
[ 56.25%] ··· Running ...hms.Algorithms.time_add_overflow_zero_scalar     8.12ms
[ 62.50%] ··· Running algorithms.Algorithms.time_duplicated_float        18.64ms
[ 68.75%] ··· Running algorithms.Algorithms.time_duplicated_int          15.32ms
[ 75.00%] ··· Running ...rithms.Algorithms.time_duplicated_int_unique    50.73μs
[ 81.25%] ··· Running algorithms.Algorithms.time_factorize_float          7.59ms
[ 87.50%] ··· Running algorithms.Algorithms.time_factorize_int            7.49ms
[ 93.75%] ··· Running algorithms.Algorithms.time_factorize_string        29.15ms
[100.00%] ··· Running algorithms.Algorithms.time_match_strings            1.91ms

@gfyoung gfyoung force-pushed the checked-arr-add-is-nans branch from 7a3e54e to a086db6 Compare December 17, 2016 01:51
@gfyoung
Copy link
Member Author

gfyoung commented Dec 17, 2016

@jreback , @jorisvandenbossche : Addressed all of the comments and ran the benchmarks as requested. Everything is green, so ready to merge if there is nothing else.

@jorisvandenbossche jorisvandenbossche merged commit bdbebc4 into pandas-dev:master Dec 17, 2016
@jorisvandenbossche
Copy link
Member

@gfyoung thanks!

@gfyoung gfyoung deleted the checked-arr-add-is-nans branch December 18, 2016 01:02
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

In addition, move checked add function to core/algorithms.
@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Dec 24, 2016
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
Expands checked-add array addition introduced in
pandas-devgh-14237 to include all other addition cases (i.e.
TimedeltaIndex and Timedelta). Follow-up to pandas-devgh-14453.

In addition, move checked add function to core/algorithms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants