Fix Regression when using sum/cumsum on Groupby objects #44526

CloseChoice · 2021-11-19T17:20:07Z

closes BUG: inconsistent result when groupby then sum values that contain inf #43292
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

NOTE: This will reduce performance significantly. Hopefully someone points me to a better check for nan which can be done with released GIL.

mzeitlin11

With how INF and NEGINF are defined here,

pandas/pandas/_libs/missing.pyx

Lines 35 to 37 in 3aeba78

    
           cdef: 
        
               float64_t INF = <float64_t>np.inf 
        
               float64_t NEGINF = -INF

, can avoid the GIL by just doing an equality check?

phofl · 2021-11-19T17:49:03Z

Performance was reduced through introducing Kahan. I don't think we can reduce it further. Have to find a way around the gil

CloseChoice · 2021-11-19T19:55:38Z

got around the gil. Now it's still the additional checks for np.nan and -np.nan in each iteration. But what are these failed checks about?

mzeitlin11 · 2021-11-19T20:14:19Z

got around the gil. Now it's still the additional checks for np.nan and -np.nan in each iteration. But what are these failed checks about?

Usually a build error when you see that. In this case:

pandas/_libs/groupby.c:12588:18: error: ‘__pyx_v_t’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
12588 |               if (__pyx_t_13) {
      |                  ^
pandas/_libs/groupby.c:12228:28: note: ‘__pyx_v_t’ was declared here
12228 |   __pyx_t_5numpy_float32_t __pyx_v_t;
      |                            ^~~~~~~~~
pandas/_libs/groupby.c: In function ‘__pyx_fuse_9__pyx_pw_6pandas_5_libs_7groupby_57group_cumsum’:
pandas/_libs/groupby.c:13372:18: error: ‘__pyx_v_t’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
13372 |               if (__pyx_t_13) {
      |                  ^
pandas/_libs/groupby.c:13012:28: note: ‘__pyx_v_t’ was declared here
13012 |   __pyx_t_5numpy_float64_t __pyx_v_t;
      |                            ^~~~~~~~~
cc1: all warnings being treated as errors

pandas/_libs/groupby.pyx

jreback · 2021-11-20T15:59:41Z

pandas/_libs/groupby.pyx

@@ -51,7 +51,14 @@ from pandas._libs.missing cimport checknull
 cdef int64_t NPY_NAT = get_nat()
 _int64_max = np.iinfo(np.int64).max

-cdef float64_t NaN = <float64_t>np.NaN
+cdef:
+    float32_t MINfloat32 = np.NINF


we have some comment definitions of these elsewhere in a pxd

FYI these are being added independntly here: #44522

pandas/_libs/groupby.pyx

mzeitlin11 · 2021-11-21T00:05:14Z

What do the benchmarks look like here? Since we already have the regression #39622 from using kahan (and this adds decent complexity), would like to propose returning to previous simple summation for 1.3.5, which would fix both that regression and this one. Maybe we could expose kahan summation as an optional keyword instead (which a user can just know not to use if their data has infs?)

CloseChoice · 2021-11-22T16:46:51Z

What do the benchmarks look like here? Since we already have the regression #39622 from using kahan (and this adds decent complexity), would like to propose returning to previous simple summation for 1.3.5, which would fix both that regression and this one. Maybe we could expose kahan summation as an optional keyword instead (which a user can just know not to use if their data has infs?)

Here are the performance measures:

import numpy as np
import pandas as pd
import timeit

arr = np.ones(1000000)
s = pd.Series(np.arange(len(arr)))
s = s / 10

%timeit -r7 -n1000 s.groupby(arr).sum()
# on this branch
# 15.4 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# on 1.3.4
# 15.4 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

idx = pd.date_range(start="1/1/2000", end="1/1/2001", freq="T")
s = pd.Series(np.random.randn(len(idx)), index=idx)
%timeit -r7 -n1000 s.resample("1D").mean()
# on this branch
# 4.67 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# on 1.3.4
# 4.61 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So there is a performance regression of 2-3%
Maybe it would be the best option just to go back to normal summation and expose kahan's summation as a keyword. If the user decides to take the slower but numerically more stable summation, I guess he/she can affort the extra time to get the correct results for the special case with np.inf.

phofl · 2021-11-22T16:47:41Z

Could you please run the asvs?

CloseChoice · 2021-11-26T17:20:22Z

After running the asv, this is a remarkable performance hit:

from asv_bench.benchmarks.groupby import *
self = GroupByCythonAgg()
self.setup("float64", "sum")
%timeit -r 30 -n 100 self.time_frame_agg("float64", "sum")
# this PR
# 29.7 ms ± 257 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)
# master
# 20.1 ms ± 279 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)

pandas/_libs/groupby.pyx

simonjayhawkins · 2021-11-27T11:27:33Z

What do the benchmarks look like here? Since we already have the regression #39622 from using kahan (and this adds decent complexity), would like to propose returning to previous simple summation for 1.3.5, which would fix both that regression and this one. Maybe we could expose kahan summation as an optional keyword instead (which a user can just know not to use if their data has infs?)

needs a release note. adding the 1.3.5 milestone to match the linked issue.

thoughts on backporting these changes?

jreback · 2021-11-27T12:57:28Z

move this off 1.3.x it's too late to change anything

CloseChoice · 2021-11-28T11:13:08Z

Thanks @jbrockmendel for pointing me to the isinf function from numpy. This also influences the performance:

In [1]: from asv_bench.benchmarks.groupby import *

In [2]: self = GroupByCythonAgg()

In [3]: self.setup("float64", "sum")

In [4]: %timeit -r 30 -n 100 self.time_frame_agg("float64", "sum")
# This PR
# 25.3 ms ± 377 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)
# master
# 19.8 ms ± 355 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)

…f-confusion

github-actions · 2022-01-08T00:03:39Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

mroeschke · 2022-01-25T01:36:20Z

Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch and we can reopen.

mzeitlin11 reviewed Nov 19, 2021

View reviewed changes

pandas/_libs/groupby.pyx Show resolved Hide resolved

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

mzeitlin11 added Groupby Regression Functionality that used to work in a prior pandas version labels Nov 19, 2021

mzeitlin11 reviewed Nov 19, 2021

View reviewed changes

pandas/_libs/groupby.pyx Show resolved Hide resolved

jreback requested changes Nov 20, 2021

View reviewed changes

jbrockmendel reviewed Nov 26, 2021

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

simonjayhawkins added this to the 1.3.5 milestone Nov 27, 2021

simonjayhawkins mentioned this pull request Nov 27, 2021

PERF: slowdown in groupby/resample mean() method #39622

Closed

simonjayhawkins mentioned this pull request Nov 27, 2021

BUG: inconsistent result when groupby then sum values that contain inf #43292

Closed

3 tasks

simonjayhawkins removed this from the 1.3.5 milestone Nov 27, 2021

fix kahans summation for the inf case

6f6da9a

CloseChoice force-pushed the FIX-nan-inf-confusion branch from 27ef25b to 6f6da9a Compare November 28, 2021 11:02

CloseChoice added 2 commits November 28, 2021 12:14

fix formatting error

c0f32ac

Merge branch 'master' of github.com:pandas-dev/pandas into FIX-nan-in…

2f51619

…f-confusion

github-actions bot added the Stale label Jan 8, 2022

mroeschke closed this Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Regression when using sum/cumsum on Groupby objects #44526

Fix Regression when using sum/cumsum on Groupby objects #44526

CloseChoice commented Nov 19, 2021

mzeitlin11 left a comment

phofl commented Nov 19, 2021

CloseChoice commented Nov 19, 2021

mzeitlin11 commented Nov 19, 2021

jreback Nov 20, 2021

jreback Nov 20, 2021

mzeitlin11 commented Nov 21, 2021

CloseChoice commented Nov 22, 2021

phofl commented Nov 22, 2021

CloseChoice commented Nov 26, 2021

simonjayhawkins commented Nov 27, 2021

jreback commented Nov 27, 2021

CloseChoice commented Nov 28, 2021

github-actions bot commented Jan 8, 2022

mroeschke commented Jan 25, 2022

	cdef:
	float64_t INF = <float64_t>np.inf
	float64_t NEGINF = -INF

Fix Regression when using sum/cumsum on Groupby objects #44526

Fix Regression when using sum/cumsum on Groupby objects #44526

Conversation

CloseChoice commented Nov 19, 2021

mzeitlin11 left a comment

Choose a reason for hiding this comment

phofl commented Nov 19, 2021

CloseChoice commented Nov 19, 2021

mzeitlin11 commented Nov 19, 2021

jreback Nov 20, 2021

Choose a reason for hiding this comment

jreback Nov 20, 2021

Choose a reason for hiding this comment

mzeitlin11 commented Nov 21, 2021

CloseChoice commented Nov 22, 2021

phofl commented Nov 22, 2021

CloseChoice commented Nov 26, 2021

simonjayhawkins commented Nov 27, 2021

jreback commented Nov 27, 2021

CloseChoice commented Nov 28, 2021

github-actions bot commented Jan 8, 2022

mroeschke commented Jan 25, 2022