Skip to content

Commit 85f2d92

Browse files
committed
Make group_mean compatible with NaT
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to #43132. On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64. This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible. Additionally, I've taken the liberty to: * Add a docstring for the group_mean algorithm. * Change the algorithm to use guard clauses instead of else/if. * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file). - [x] closes #43132 - [x] tests added / passed - [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them - [x] whatsnew entry => different format but it's there
1 parent 8954bf1 commit 85f2d92

File tree

2 files changed

+7
-5
lines changed

2 files changed

+7
-5
lines changed

pandas/_libs/groupby.pyx

+6-3
Original file line numberDiff line numberDiff line change
@@ -676,16 +676,18 @@ def group_mean(floating[:, ::1] out,
676676
ndarray[floating, ndim=2] values,
677677
const intp_t[::1] labels,
678678
Py_ssize_t min_count=-1,
679-
bint is_datetimelike = False) -> None:
679+
bint is_datetimelike=False) -> None:
680680
"""
681-
Compute the mean per label given a label assignment for each value. NaN values are ignored.
681+
Compute the mean per label given a label assignment for each value.
682+
NaN values are ignored.
682683

683684
Parameters
684685
----------
685686
out : np.ndarray[floating]
686687
Values into which this method will write its results.
687688
counts : np.ndarray[int64]
688-
A zeroed array of the same shape as labels, populated by group sizes during algorithm.
689+
A zeroed array of the same shape as labels,
690+
populated by group sizes during algorithm.
689691
values : np.ndarray[floating]
690692
2-d array of the values to find the mean of.
691693
labels : np.ndarray[np.intp]
@@ -750,6 +752,7 @@ def group_mean(floating[:, ::1] out,
750752
continue
751753
out[i, j] = sumx[i, j] / count
752754

755+
753756
@cython.wraparound(False)
754757
@cython.boundscheck(False)
755758
def group_ohlc(floating[:, ::1] out,

pandas/tests/groupby/test_libgroupby.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -238,12 +238,11 @@ def test_cython_group_transform_algos():
238238

239239

240240
def test_cython_group_mean_timedelta():
241-
is_datetimelike = True
242241
actual = np.zeros(shape=(1, 1), dtype="float64")
243242
counts = np.array([0], dtype="int64")
244243
data = (
245244
np.array(
246-
[np.datetime64(2, "ns"), np.datetime64(4, "ns"), np.datetime64("NaT")],
245+
[np.timedelta64(2, "ns"), np.timedelta64(4, "ns"), np.timedelta64("NaT")],
247246
dtype="m8[ns]",
248247
)[:, None]
249248
.view("int64")

0 commit comments

Comments
 (0)