Skip to content

Fix Regression when using sum/cumsum on Groupby objects #44526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

CloseChoice
Copy link
Member

NOTE: This will reduce performance significantly. Hopefully someone points me to a better check for nan which can be done with released GIL.

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With how INF and NEGINF are defined here,

cdef:
float64_t INF = <float64_t>np.inf
float64_t NEGINF = -INF
, can avoid the GIL by just doing an equality check?

@phofl
Copy link
Member

phofl commented Nov 19, 2021

Performance was reduced through introducing Kahan. I don't think we can reduce it further. Have to find a way around the gil

@CloseChoice
Copy link
Member Author

got around the gil. Now it's still the additional checks for np.nan and -np.nan in each iteration. But what are these failed checks about?

@mzeitlin11
Copy link
Member

got around the gil. Now it's still the additional checks for np.nan and -np.nan in each iteration. But what are these failed checks about?

Usually a build error when you see that. In this case:

pandas/_libs/groupby.c:12588:18: error: ‘__pyx_v_t’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
12588 |               if (__pyx_t_13) {
      |                  ^
pandas/_libs/groupby.c:12228:28: note: ‘__pyx_v_t’ was declared here
12228 |   __pyx_t_5numpy_float32_t __pyx_v_t;
      |                            ^~~~~~~~~
pandas/_libs/groupby.c: In function ‘__pyx_fuse_9__pyx_pw_6pandas_5_libs_7groupby_57group_cumsum’:
pandas/_libs/groupby.c:13372:18: error: ‘__pyx_v_t’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
13372 |               if (__pyx_t_13) {
      |                  ^
pandas/_libs/groupby.c:13012:28: note: ‘__pyx_v_t’ was declared here
13012 |   __pyx_t_5numpy_float64_t __pyx_v_t;
      |                            ^~~~~~~~~
cc1: all warnings being treated as errors

@mzeitlin11 mzeitlin11 added Groupby Regression Functionality that used to work in a prior pandas version labels Nov 19, 2021
@@ -51,7 +51,14 @@ from pandas._libs.missing cimport checknull
cdef int64_t NPY_NAT = get_nat()
_int64_max = np.iinfo(np.int64).max

cdef float64_t NaN = <float64_t>np.NaN
cdef:
float32_t MINfloat32 = np.NINF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have some comment definitions of these elsewhere in a pxd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI these are being added independntly here: #44522

@mzeitlin11
Copy link
Member

What do the benchmarks look like here? Since we already have the regression #39622 from using kahan (and this adds decent complexity), would like to propose returning to previous simple summation for 1.3.5, which would fix both that regression and this one. Maybe we could expose kahan summation as an optional keyword instead (which a user can just know not to use if their data has infs?)

@CloseChoice
Copy link
Member Author

What do the benchmarks look like here? Since we already have the regression #39622 from using kahan (and this adds decent complexity), would like to propose returning to previous simple summation for 1.3.5, which would fix both that regression and this one. Maybe we could expose kahan summation as an optional keyword instead (which a user can just know not to use if their data has infs?)

Here are the performance measures:

import numpy as np
import pandas as pd
import timeit

arr = np.ones(1000000)
s = pd.Series(np.arange(len(arr)))
s = s / 10

%timeit -r7 -n1000 s.groupby(arr).sum()
# on this branch
# 15.4 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# on 1.3.4
# 15.4 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

idx = pd.date_range(start="1/1/2000", end="1/1/2001", freq="T")
s = pd.Series(np.random.randn(len(idx)), index=idx)
%timeit -r7 -n1000 s.resample("1D").mean()
# on this branch
# 4.67 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# on 1.3.4
# 4.61 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So there is a performance regression of 2-3%
Maybe it would be the best option just to go back to normal summation and expose kahan's summation as a keyword. If the user decides to take the slower but numerically more stable summation, I guess he/she can affort the extra time to get the correct results for the special case with np.inf.

@phofl
Copy link
Member

phofl commented Nov 22, 2021

Could you please run the asvs?

@CloseChoice
Copy link
Member Author

After running the asv, this is a remarkable performance hit:

from asv_bench.benchmarks.groupby import *
self = GroupByCythonAgg()
self.setup("float64", "sum")
%timeit -r 30 -n 100 self.time_frame_agg("float64", "sum")
# this PR
# 29.7 ms ± 257 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)
# master
# 20.1 ms ± 279 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)

@simonjayhawkins
Copy link
Member

What do the benchmarks look like here? Since we already have the regression #39622 from using kahan (and this adds decent complexity), would like to propose returning to previous simple summation for 1.3.5, which would fix both that regression and this one. Maybe we could expose kahan summation as an optional keyword instead (which a user can just know not to use if their data has infs?)

needs a release note. adding the 1.3.5 milestone to match the linked issue.

thoughts on backporting these changes?

@jreback
Copy link
Contributor

jreback commented Nov 27, 2021

move this off 1.3.x it's too late to change anything

@CloseChoice CloseChoice force-pushed the FIX-nan-inf-confusion branch from 27ef25b to 6f6da9a Compare November 28, 2021 11:02
@CloseChoice
Copy link
Member Author

Thanks @jbrockmendel for pointing me to the isinf function from numpy. This also influences the performance:

In [1]: from asv_bench.benchmarks.groupby import *

In [2]: self = GroupByCythonAgg()

In [3]: self.setup("float64", "sum")

In [4]: %timeit -r 30 -n 100 self.time_frame_agg("float64", "sum")
# This PR
# 25.3 ms ± 377 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)
# master
# 19.8 ms ± 355 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 8, 2022
@mroeschke
Copy link
Member

Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch and we can reopen.

@mroeschke mroeschke closed this Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsistent result when groupby then sum values that contain inf
7 participants