Skip to content

BUG: Fix np.inf + np.nan sum issue on groupby mean #52964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 23, 2023

Conversation

parthi-siva
Copy link
Contributor

@parthi-siva parthi-siva marked this pull request as draft April 27, 2023 15:30
@parthi-siva parthi-siva marked this pull request as ready for review April 27, 2023 15:38
@parthi-siva
Copy link
Contributor Author

@jbrockmendel @rhshadrach please review. Please let me know if it 's correct. Then I'll write test cases.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look correct to me; in particular, np.inf + np.nan should be np.nan and if I'm understanding this code it would come out to np.inf. In addition, this doesn't look like it handles -np.inf.

Instead, I would guess that when val is +/- np.inf, then compensation should be set to 0.

This additional check would make the current implementation more complex in the inner for loop, and I'm wondering if the additional numerical accuracy is worth the perf hit. But maybe branch prediction means the perf hit would be minimal in what I would expect to be the typical case (no inf).

@parthi-siva parthi-siva marked this pull request as draft April 28, 2023 02:05
@rhshadrach rhshadrach added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 28, 2023
@parthi-siva
Copy link
Contributor Author

Thanks @rhshadrach for the review.

We have to make compensation 0 not only when val is +/- np.inf also when sumx[lab, j] is +/- np.inf. Then only the results are consistent

something like

if val == np.inf or sumx[lab, j] == np.inf:
    compensation[lab, j] = 0.0
else:
    compensation[lab, j] = t - sumx[lab, j] - y

And yeah we are introducing additional check which will impact the performance. Also you are right about fact that sacrificing performance for numerical accuracy is worth or not.

What should we do? Is it okay that we live with this bug ? or fix it ?

@rhshadrach
Copy link
Member

Can you benchmark the proposed fix?

@parthi-siva
Copy link
Contributor Author

parthi-siva commented Apr 28, 2023

sure @rhshadrach.. Can you help me with that..? direct me to any resource or docs on standardized way to do benchmark ?

%timeit will be sufficient right ?

@rhshadrach
Copy link
Member

Yes - %timeit is a good way to approach. Generate some test data (e.g. np.random.randint and np.random.rand) of sufficient size so that the groupby operation takes at least a decent amount of time - maybe 80ms or more. Then post the timings when you run it on main vs this PR.

@parthi-siva
Copy link
Contributor Author

parthi-siva commented May 1, 2023

instead of checking val for inf or -inf we can check for compensation is nan or not:

 1078  if not isna_entry:
                nobs[lab, j] += 1
                y = val - compensation[lab, j]
                t = sumx[lab, j] + y
                compensation[lab, j] = t - sumx[lab, j] - y
                if compensation[lab, j] != compensation[lab, j] :
                    compensation[lab, j] = 0.
                sumx[lab, j] = t

Benchmark against Main Branch

In [5]: n = 5000000
   ...: ...: df = pd.DataFrame(
   ...: ...:     {
   ...: ...:         "z": [rand.randint(0,1) for i in range(n)],
   ...: ...:         "x1": [rand.randint(0,n) for i in range(n)],
   ...: ...:         "x2": [rand.randint(0,n) for i in range(n)],
   ...: ...:      }
   ...: ...: )

In [6]: %timeit df.groupby("z").mean()
79.8 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df.groupby("z").mean()
79.9 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit df.groupby("z").mean()
79.8 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit df.groupby("z").mean()
80.5 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit df.groupby("z").mean()
80.1 ms ± 650 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit df.groupby("z").mean()
79.2 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit df.groupby("z").mean()
79.4 ms ± 674 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit df.groupby("z").mean()
79.4 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit df.groupby("z").mean()
83.2 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Benchmark against branch which contains proposed fix

In [9]: n = 5000000
   ...: ...: df = pd.DataFrame(
   ...: ...:     {
   ...: ...:         "z": [rand.randint(0,1) for i in range(n)],
   ...: ...:         "x1": [rand.randint(0,n) for i in range(n)],
   ...: ...:         "x2": [rand.randint(0,n) for i in range(n)],
   ...: ...:      }
   ...: ...: )

In [10]: %timeit df.groupby("z").mean()
81.5 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit df.groupby("z").mean()
81.8 ms ± 891 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [12]: %timeit df.groupby("z").mean()
81.5 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit df.groupby("z").mean()
82 ms ± 754 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit df.groupby("z").mean()
83.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: %timeit df.groupby("z").mean()
81.3 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit df.groupby("z").mean()
81.7 ms ± 768 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [17]: %timeit df.groupby("z").mean()
81.6 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [18]: %timeit df.groupby("z").mean()
81.7 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The overhead is less than 2 ms @rhshadrach

@parthi-siva parthi-siva marked this pull request as ready for review May 1, 2023 09:45
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, cc @mroeschke for a 2nd eye

Copy link
Contributor Author

@parthi-siva parthi-siva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@parthi-siva parthi-siva requested a review from rhshadrach May 2, 2023 02:37
@@ -1075,6 +1075,8 @@ def group_mean(
y = val - compensation[lab, j]
t = sumx[lab, j] + y
compensation[lab, j] = t - sumx[lab, j] - y
if compensation[lab, j] != compensation[lab, j]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to check for nan? can you use an explicit check? If not, please add a comment about what you're doing and why

Copy link
Contributor Author

@parthi-siva parthi-siva May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jbrockmendel, I tried do it explicitly but it is not working. I tried to use utils.is_nan since it is no gil it is not compiling. So I just used this way to determine Nan (That's what we are doing in utils.is_nan as well.)
I will add appropriate comment in the code

group_mean(actual, counts, data, labels, is_datetimelike=False)

tm.assert_numpy_array_equal(
actual, np.array([[np.inf, 3], [3, np.inf]], dtype="float64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: please define the expected on a previous line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@parthi-siva parthi-siva requested a review from rhshadrach May 4, 2023 03:17
@parthi-siva parthi-siva requested a review from jbrockmendel May 7, 2023 06:23
@rhshadrach
Copy link
Member

@parthi-siva - looks like the 32-bit is failing here.

https://github.com/pandas-dev/pandas/actions/runs/4911671351/jobs/8769921340?pr=52964#step:4:1956

@parthi-siva
Copy link
Contributor Author

@rhshadrach Linux 32-Bit test failing issue fixed.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, cc @jbrockmendel - good here?

@rhshadrach rhshadrach added this to the 2.1 milestone May 17, 2023
@mroeschke mroeschke merged commit 04134d5 into pandas-dev:main May 23, 2023
@mroeschke
Copy link
Member

Thanks @parthi-siva

DeaMariaLeon pushed a commit to DeaMariaLeon/pandas that referenced this pull request May 24, 2023
* BUG: Fix np.inf + np.nan sum issue on groupby mean

* BUG: Change variable name

* TST: add test case to validate the fix

* Bug: Set Compensation to 0 when it is NaN

* TST: Fix failing test

* Remove Space

* Add Comments

* TST: assign expected to seperate variable

* Update comment

* TST: Fix issue with Linux32 dtype ValueError

---------

Co-authored-by: Richard Shadrach <[email protected]>
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
* BUG: Fix np.inf + np.nan sum issue on groupby mean

* BUG: Change variable name

* TST: add test case to validate the fix

* Bug: Set Compensation to 0 when it is NaN

* TST: Fix failing test

* Remove Space

* Add Comments

* TST: assign expected to seperate variable

* Update comment

* TST: Fix issue with Linux32 dtype ValueError

---------

Co-authored-by: Richard Shadrach <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby mean differs depending on where np.inf value was introduced
4 participants