-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix np.inf + np.nan sum issue on groupby mean #52964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jbrockmendel @rhshadrach please review. Please let me know if it 's correct. Then I'll write test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look correct to me; in particular, np.inf + np.nan
should be np.nan
and if I'm understanding this code it would come out to np.inf
. In addition, this doesn't look like it handles -np.inf
.
Instead, I would guess that when val
is +/- np.inf, then compensation
should be set to 0.
This additional check would make the current implementation more complex in the inner for loop, and I'm wondering if the additional numerical accuracy is worth the perf hit. But maybe branch prediction means the perf hit would be minimal in what I would expect to be the typical case (no inf).
Thanks @rhshadrach for the review. We have to make something like
And yeah we are introducing additional check which will impact the performance. Also you are right about fact that sacrificing performance for numerical accuracy is worth or not. What should we do? Is it okay that we live with this bug ? or fix it ? |
Can you benchmark the proposed fix? |
sure @rhshadrach.. Can you help me with that..? direct me to any resource or docs on standardized way to do benchmark ?
|
Yes - |
instead of checking val for inf or -inf we can check for compensation is nan or not:
Benchmark against Main Branch
Benchmark against branch which contains proposed fix
The overhead is less than 2 ms @rhshadrach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, cc @mroeschke for a 2nd eye
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -1075,6 +1075,8 @@ def group_mean( | |||
y = val - compensation[lab, j] | |||
t = sumx[lab, j] + y | |||
compensation[lab, j] = t - sumx[lab, j] - y | |||
if compensation[lab, j] != compensation[lab, j]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to check for nan? can you use an explicit check? If not, please add a comment about what you're doing and why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jbrockmendel, I tried do it explicitly but it is not working. I tried to use utils.is_nan
since it is no gil
it is not compiling. So I just used this way to determine Nan (That's what we are doing in utils.is_nan
as well.)
I will add appropriate comment in the code
group_mean(actual, counts, data, labels, is_datetimelike=False) | ||
|
||
tm.assert_numpy_array_equal( | ||
actual, np.array([[np.inf, 3], [3, np.inf]], dtype="float64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: please define the expected on a previous line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@parthi-siva - looks like the 32-bit is failing here. |
@rhshadrach Linux 32-Bit test failing issue fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, cc @jbrockmendel - good here?
Thanks @parthi-siva |
* BUG: Fix np.inf + np.nan sum issue on groupby mean * BUG: Change variable name * TST: add test case to validate the fix * Bug: Set Compensation to 0 when it is NaN * TST: Fix failing test * Remove Space * Add Comments * TST: assign expected to seperate variable * Update comment * TST: Fix issue with Linux32 dtype ValueError --------- Co-authored-by: Richard Shadrach <[email protected]>
* BUG: Fix np.inf + np.nan sum issue on groupby mean * BUG: Change variable name * TST: add test case to validate the fix * Bug: Set Compensation to 0 when it is NaN * TST: Fix failing test * Remove Space * Add Comments * TST: assign expected to seperate variable * Update comment * TST: Fix issue with Linux32 dtype ValueError --------- Co-authored-by: Richard Shadrach <[email protected]>
np.inf
value was introduced #50367