-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Make group_mean compatible with NaT #43467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
970f302
177f1ab
9c5228c
eb3134c
ea26fb9
bb70636
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -673,10 +673,45 @@ def group_mean(floating[:, ::1] out, | |
int64_t[::1] counts, | ||
ndarray[floating, ndim=2] values, | ||
const intp_t[::1] labels, | ||
Py_ssize_t min_count=-1) -> None: | ||
Py_ssize_t min_count=-1, | ||
bint is_datetimelike=False, | ||
const uint8_t[:, ::1] mask=None, | ||
uint8_t[:, ::1] result_mask=None | ||
) -> None: | ||
""" | ||
Compute the mean per label given a label assignment for each value. | ||
NaN values are ignored. | ||
|
||
Parameters | ||
---------- | ||
out : np.ndarray[floating] | ||
Values into which this method will write its results. | ||
counts : np.ndarray[int64] | ||
A zeroed array of the same shape as labels, | ||
populated by group sizes during algorithm. | ||
values : np.ndarray[floating] | ||
2-d array of the values to find the mean of. | ||
labels : np.ndarray[np.intp] | ||
Array containing unique label for each group, with its | ||
ordering matching up to the corresponding record in `values`. | ||
min_count : Py_ssize_t | ||
Only used in add and prod. Always -1. | ||
is_datetimelike : bool | ||
True if `values` contains datetime-like entries. | ||
mask : ndarray[bool, ndim=2], optional | ||
Not used. | ||
result_mask : ndarray[bool, ndim=2], optional | ||
Not used. | ||
|
||
Notes | ||
----- | ||
This method modifies the `out` parameter rather than returning an object. | ||
`counts` is modified to hold group sizes | ||
""" | ||
|
||
cdef: | ||
Py_ssize_t i, j, N, K, lab, ncounts = len(counts) | ||
floating val, count, y, t | ||
floating val, count, y, t, nan_val | ||
floating[:, ::1] sumx, compensation | ||
int64_t[:, ::1] nobs | ||
Py_ssize_t len_values = len(values), len_labels = len(labels) | ||
|
@@ -686,12 +721,13 @@ def group_mean(floating[:, ::1] out, | |
if len_values != len_labels: | ||
raise ValueError("len(index) != len(labels)") | ||
|
||
nobs = np.zeros((<object>out).shape, dtype=np.int64) | ||
# the below is equivalent to `np.zeros_like(out)` but faster | ||
nobs = np.zeros((<object>out).shape, dtype=np.int64) | ||
sumx = np.zeros((<object>out).shape, dtype=(<object>out).base.dtype) | ||
compensation = np.zeros((<object>out).shape, dtype=(<object>out).base.dtype) | ||
|
||
N, K = (<object>values).shape | ||
nan_val = NPY_NAT if is_datetimelike else NAN | ||
mzeitlin11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
with nogil: | ||
for i in range(N): | ||
|
@@ -703,7 +739,7 @@ def group_mean(floating[:, ::1] out, | |
for j in range(K): | ||
val = values[i, j] | ||
# not nan | ||
if val == val: | ||
if val == val and not (is_datetimelike and val == NPY_NAT): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment here - would rather be explicit about the cast that's occurring to make this work (eg hopefully this is int equality, and not float equality since comparing floats is problematic). On a brief investigation, one potential flaw here is that with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have an implicit guarantee that datetimelike arrays will be returned as We can prevent any problems via asserting that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that is possible because the type A test for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there's an issue with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I was wrong. There is a We could change that path through the code to not do the cast for is_datetimelike but that would necessitate changing the input values data type to a union type which includes int64. Re-using There is some risk through the union data type though, If I look at the algorithm of What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
nobs[lab, j] += 1 | ||
y = val - compensation[lab, j] | ||
t = sumx[lab, j] + y | ||
|
@@ -714,7 +750,7 @@ def group_mean(floating[:, ::1] out, | |
for j in range(K): | ||
count = nobs[i, j] | ||
if nobs[i, j] == 0: | ||
out[i, j] = NAN | ||
out[i, j] = nan_val | ||
else: | ||
AlexeyGy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
out[i, j] = sumx[i, j] / count | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding these if they arent used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so you can use the same call from the python code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, originally, I did not want to add useless extra params but after this comment here remarked that an additional path in the code is not wanted, I did not see any other way than to add these parameters here.
I also thought about actually using the params but that has nothing to do with the original scope of this PR so I chose not to invest more time here.