Make group_mean compatible with NaT

AlexeyGy · AlexeyGy · commit 530802d26792 · 2021-09-10T18:58:36.000Z
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132. On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64. This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible. Additionally, I've taken the liberty to: * Add a docstring for the group_mean algorithm. * Change the algorithm to use guard clauses instead of else/if. * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file). - [x] closes pandas-dev#43132 - [x] tests added / passed - [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them - [x] whatsnew entry => different format but it's there
diff --git a/pandas/tests/groupby/test_libgroupby.py b/pandas/tests/groupby/test_libgroupby.py
@@ -248,7 +248,7 @@ def test_cython_group_mean_timedelta():
         .view("int64")
         .astype("float64")
     )
-    labels = np.zeros(len(data), dtype="int64")
+    labels = np.zeros(len(data), np.intp)
 
     group_mean(actual, counts, data, labels, is_datetimelike=True)
 

Original file line number	Diff line number	Diff line change
`@@ -248,7 +248,7 @@ def test_cython_group_mean_timedelta():`
`248`	`248`	`.view("int64")`
`249`	`249`	`.astype("float64")`
`250`	`250`	`)`
`251`		`- labels = np.zeros(len(data), dtype="int64")`
	`251`	`+ labels = np.zeros(len(data), np.intp)`
`252`	`252`
`253`	`253`	`group_mean(actual, counts, data, labels, is_datetimelike=True)`
`254`	`254`