BUG: Fix np.inf + np.nan sum issue on groupby mean #52964

parthi-siva · 2023-04-27T15:30:01Z

closes BUG: Groupby mean differs depending on where np.inf value was introduced #50367
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.

parthi-siva · 2023-04-27T15:43:56Z

@jbrockmendel @rhshadrach please review. Please let me know if it 's correct. Then I'll write test cases.

rhshadrach

This does not look correct to me; in particular, np.inf + np.nan should be np.nan and if I'm understanding this code it would come out to np.inf. In addition, this doesn't look like it handles -np.inf.

Instead, I would guess that when val is +/- np.inf, then compensation should be set to 0.

This additional check would make the current implementation more complex in the inner for loop, and I'm wondering if the additional numerical accuracy is worth the perf hit. But maybe branch prediction means the perf hit would be minimal in what I would expect to be the typical case (no inf).

parthi-siva · 2023-04-28T02:52:51Z

Thanks @rhshadrach for the review.

We have to make compensation 0 not only when val is +/- np.inf also when sumx[lab, j] is +/- np.inf. Then only the results are consistent

something like

if val == np.inf or sumx[lab, j] == np.inf:
    compensation[lab, j] = 0.0
else:
    compensation[lab, j] = t - sumx[lab, j] - y

And yeah we are introducing additional check which will impact the performance. Also you are right about fact that sacrificing performance for numerical accuracy is worth or not.

What should we do? Is it okay that we live with this bug ? or fix it ?

rhshadrach · 2023-04-28T02:55:05Z

Can you benchmark the proposed fix?

parthi-siva · 2023-04-28T03:09:08Z

sure @rhshadrach.. Can you help me with that..? direct me to any resource or docs on standardized way to do benchmark ?

%timeit will be sufficient right ?

rhshadrach · 2023-04-28T19:06:06Z

Yes - %timeit is a good way to approach. Generate some test data (e.g. np.random.randint and np.random.rand) of sufficient size so that the groupby operation takes at least a decent amount of time - maybe 80ms or more. Then post the timings when you run it on main vs this PR.

parthi-siva · 2023-05-01T09:43:30Z

instead of checking val for inf or -inf we can check for compensation is nan or not:

 1078  if not isna_entry:
                nobs[lab, j] += 1
                y = val - compensation[lab, j]
                t = sumx[lab, j] + y
                compensation[lab, j] = t - sumx[lab, j] - y
                if compensation[lab, j] != compensation[lab, j] :
                    compensation[lab, j] = 0.
                sumx[lab, j] = t

Benchmark against Main Branch

In [5]: n = 5000000
   ...: ...: df = pd.DataFrame(
   ...: ...:     {
   ...: ...:         "z": [rand.randint(0,1) for i in range(n)],
   ...: ...:         "x1": [rand.randint(0,n) for i in range(n)],
   ...: ...:         "x2": [rand.randint(0,n) for i in range(n)],
   ...: ...:      }
   ...: ...: )

In [6]: %timeit df.groupby("z").mean()
79.8 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df.groupby("z").mean()
79.9 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit df.groupby("z").mean()
79.8 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit df.groupby("z").mean()
80.5 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit df.groupby("z").mean()
80.1 ms ± 650 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit df.groupby("z").mean()
79.2 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit df.groupby("z").mean()
79.4 ms ± 674 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit df.groupby("z").mean()
79.4 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit df.groupby("z").mean()
83.2 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Benchmark against branch which contains proposed fix

In [9]: n = 5000000
   ...: ...: df = pd.DataFrame(
   ...: ...:     {
   ...: ...:         "z": [rand.randint(0,1) for i in range(n)],
   ...: ...:         "x1": [rand.randint(0,n) for i in range(n)],
   ...: ...:         "x2": [rand.randint(0,n) for i in range(n)],
   ...: ...:      }
   ...: ...: )

In [10]: %timeit df.groupby("z").mean()
81.5 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit df.groupby("z").mean()
81.8 ms ± 891 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit df.groupby("z").mean()
81.5 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit df.groupby("z").mean()
82 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit df.groupby("z").mean()
83.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: %timeit df.groupby("z").mean()
81.3 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit df.groupby("z").mean()
81.7 ms ± 768 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit df.groupby("z").mean()
81.6 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [18]: %timeit df.groupby("z").mean()
81.7 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The overhead is less than 2 ms @rhshadrach

rhshadrach

Looks good, cc @mroeschke for a 2nd eye

pandas/_libs/groupby.pyx

parthi-siva

Done

jbrockmendel · 2023-05-02T15:58:45Z

pandas/_libs/groupby.pyx

@@ -1075,6 +1075,8 @@ def group_mean(
                    y = val - compensation[lab, j]
                    t = sumx[lab, j] + y
                    compensation[lab, j] = t - sumx[lab, j] - y
+                    if compensation[lab, j] != compensation[lab, j]:


this is to check for nan? can you use an explicit check? If not, please add a comment about what you're doing and why

Hi @jbrockmendel, I tried do it explicitly but it is not working. I tried to use utils.is_nan since it is no gil it is not compiling. So I just used this way to determine Nan (That's what we are doing in utils.is_nan as well.)
I will add appropriate comment in the code

jbrockmendel · 2023-05-02T15:59:11Z

pandas/tests/groupby/test_libgroupby.py

+    group_mean(actual, counts, data, labels, is_datetimelike=False)
+
+    tm.assert_numpy_array_equal(
+        actual, np.array([[np.inf, 3], [3, np.inf]], dtype="float64")


nitpick: please define the expected on a previous line

pandas/_libs/groupby.pyx

rhshadrach · 2023-05-09T21:16:00Z

@parthi-siva - looks like the 32-bit is failing here.

https://github.com/pandas-dev/pandas/actions/runs/4911671351/jobs/8769921340?pr=52964#step:4:1956

parthi-siva · 2023-05-10T13:39:17Z

@rhshadrach Linux 32-Bit test failing issue fixed.

rhshadrach

lgtm, cc @jbrockmendel - good here?

mroeschke · 2023-05-23T16:37:04Z

Thanks @parthi-siva

* BUG: Fix np.inf + np.nan sum issue on groupby mean * BUG: Change variable name * TST: add test case to validate the fix * Bug: Set Compensation to 0 when it is NaN * TST: Fix failing test * Remove Space * Add Comments * TST: assign expected to seperate variable * Update comment * TST: Fix issue with Linux32 dtype ValueError --------- Co-authored-by: Richard Shadrach <[email protected]>

BUG: Fix np.inf + np.nan sum issue on groupby mean

c8153c9

parthi-siva marked this pull request as draft April 27, 2023 15:30

Merge branch 'main' into BUG-GH#50367

5a0f44b

parthi-siva marked this pull request as ready for review April 27, 2023 15:38

rhshadrach reviewed Apr 28, 2023

View reviewed changes

parthi-siva added 2 commits April 28, 2023 07:26

BUG: Change variable name

575e80a

TST: add test case to validate the fix

60a0279

parthi-siva marked this pull request as draft April 28, 2023 02:05

rhshadrach added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 28, 2023

Merge branch 'main' into BUG-GH#50367

6ff9f58

parthi-siva added 2 commits May 1, 2023 12:48

Merge branch 'main' into BUG-GH#50367

9938e4f

Bug: Set Compensation to 0 when it is NaN

323dd11

parthi-siva marked this pull request as ready for review May 1, 2023 09:45

TST: Fix failing test

1fc174e

rhshadrach requested changes May 2, 2023

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

parthi-siva added 2 commits May 2, 2023 07:55

Merge branch 'main' into BUG-GH#50367

cac4d72

Remove Space

dea7d19

parthi-siva commented May 2, 2023

View reviewed changes

parthi-siva requested a review from rhshadrach May 2, 2023 02:37

jbrockmendel reviewed May 2, 2023

View reviewed changes

Merge branch 'main' into BUG-GH#50367

8244835

parthi-siva added 2 commits May 3, 2023 12:37

Add Comments

6f3e64e

TST: assign expected to seperate variable

cf50206

rhshadrach requested changes May 4, 2023

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

parthi-siva added 2 commits May 4, 2023 08:37

Merge branch 'main' into BUG-GH#50367

9339817

Update comment

ac407dd

parthi-siva requested a review from rhshadrach May 4, 2023 03:17

Merge branch 'main' into BUG-GH#50367

7beb278

parthi-siva requested a review from jbrockmendel May 7, 2023 06:23

Merge branch 'main' into BUG-GH#50367

b7b9dc3

parthi-siva added 2 commits May 10, 2023 17:24

Merge branch 'main' into BUG-GH#50367

7bbc20f

TST: Fix issue with Linux32 dtype ValueError

3c02170

parthi-siva added 2 commits May 14, 2023 19:20

Merge branch 'main' into BUG-GH#50367

8943cda

Merge branch 'main' into BUG-GH#50367

d5ed2a2

rhshadrach approved these changes May 17, 2023

View reviewed changes

rhshadrach added this to the 2.1 milestone May 17, 2023

Merge branch 'main' into BUG-GH#50367

c0c26eb

mroeschke approved these changes May 23, 2023

View reviewed changes

mroeschke merged commit 04134d5 into pandas-dev:main May 23, 2023

Charlie-XIAO mentioned this pull request Jun 12, 2023

BUG: groupby sum turning inf+inf and (-inf)+(-inf) into nan #53623

Merged

4 tasks

Uh oh!

BUG: Fix np.inf + np.nan sum issue on groupby mean #52964

BUG: Fix np.inf + np.nan sum issue on groupby mean #52964

Uh oh!

Conversation

parthi-siva commented Apr 27, 2023

Uh oh!

parthi-siva commented Apr 27, 2023

Uh oh!

rhshadrach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthi-siva commented Apr 28, 2023

Uh oh!

rhshadrach commented Apr 28, 2023

Uh oh!

parthi-siva commented Apr 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented Apr 28, 2023

Uh oh!

parthi-siva commented May 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

parthi-siva left a comment

Choose a reason for hiding this comment

Uh oh!

jbrockmendel May 2, 2023

Choose a reason for hiding this comment

Uh oh!

parthi-siva May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel May 2, 2023

Choose a reason for hiding this comment

Uh oh!

parthi-siva May 3, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhshadrach commented May 9, 2023

Uh oh!

parthi-siva commented May 10, 2023

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

mroeschke commented May 23, 2023

Uh oh!

Uh oh!

rhshadrach left a comment •

edited

Loading

parthi-siva commented Apr 28, 2023 •

edited

Loading

parthi-siva commented May 1, 2023 •

edited

Loading

parthi-siva May 3, 2023 •

edited

Loading