-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Groupby quantiles incorrect bins #33200 #33644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
15a27ea
8e6af0e
5832ba9
662c102
e70e3ca
b0c309b
131a8c1
d18bdc4
e7c0eac
07fa080
24779c6
c6f41a4
6b28d91
f21e332
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -778,47 +778,48 @@ def group_quantile(ndarray[float64_t] out, | |
if not mask[i]: | ||
non_na_counts[lab] += 1 | ||
|
||
# Get an index of values sorted by labels and then values | ||
order = (values, labels) | ||
sort_arr = np.lexsort(order).astype(np.int64, copy=False) | ||
|
||
with nogil: | ||
for i in range(ngroups): | ||
# Figure out how many group elements there are | ||
grp_sz = counts[i] | ||
non_na_sz = non_na_counts[i] | ||
if labels.any(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this required for? Do we have a test case when this is False? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Normally, if labels are not provided or the dataframe is empty, an error message will rise when applying the quantile. However, there are cases where certain operations lead to empty labels, as when time resampling some types of empty dataframe: Here is an example of the pipeline: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this accounted for in the test that you've created? If not, can you add a test for it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add two cases that take the if labels.any(): is False path.
otherwise labels.max() can raise |
||
# Get an index of values sorted by labels and then values | ||
labels[labels==-1] = np.max(labels) + 1 | ||
mabelvj marked this conversation as resolved.
Show resolved
Hide resolved
mabelvj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
order = (values, labels) | ||
sort_arr= np.lexsort(order).astype(np.int64, copy=False) | ||
with nogil: | ||
for i in range(ngroups): | ||
# Figure out how many group elements there are | ||
grp_sz = counts[i] | ||
non_na_sz = non_na_counts[i] | ||
|
||
if non_na_sz == 0: | ||
out[i] = NaN | ||
else: | ||
# Calculate where to retrieve the desired value | ||
# Casting to int will intentionally truncate result | ||
idx = grp_start + <int64_t>(q * <float64_t>(non_na_sz - 1)) | ||
|
||
val = values[sort_arr[idx]] | ||
# If requested quantile falls evenly on a particular index | ||
# then write that index's value out. Otherwise interpolate | ||
q_idx = q * (non_na_sz - 1) | ||
frac = q_idx % 1 | ||
|
||
if frac == 0.0 or interp == INTERPOLATION_LOWER: | ||
out[i] = val | ||
if non_na_sz == 0: | ||
out[i] = NaN | ||
else: | ||
next_val = values[sort_arr[idx + 1]] | ||
if interp == INTERPOLATION_LINEAR: | ||
out[i] = val + (next_val - val) * frac | ||
elif interp == INTERPOLATION_HIGHER: | ||
out[i] = next_val | ||
elif interp == INTERPOLATION_MIDPOINT: | ||
out[i] = (val + next_val) / 2.0 | ||
elif interp == INTERPOLATION_NEAREST: | ||
if frac > .5 or (frac == .5 and q > .5): # Always OK? | ||
# Calculate where to retrieve the desired value | ||
# Casting to int will intentionally truncate result | ||
idx = grp_start + <int64_t>(q * <float64_t>(non_na_sz - 1)) | ||
|
||
val = values[sort_arr[idx]] | ||
# If requested quantile falls evenly on a particular index | ||
# then write that index's value out. Otherwise interpolate | ||
q_idx = q * (non_na_sz - 1) | ||
frac = q_idx % 1 | ||
|
||
if frac == 0.0 or interp == INTERPOLATION_LOWER: | ||
out[i] = val | ||
else: | ||
next_val = values[sort_arr[idx + 1]] | ||
if interp == INTERPOLATION_LINEAR: | ||
out[i] = val + (next_val - val) * frac | ||
elif interp == INTERPOLATION_HIGHER: | ||
out[i] = next_val | ||
else: | ||
out[i] = val | ||
|
||
# Increment the index reference in sorted_arr for the next group | ||
grp_start += grp_sz | ||
elif interp == INTERPOLATION_MIDPOINT: | ||
out[i] = (val + next_val) / 2.0 | ||
elif interp == INTERPOLATION_NEAREST: | ||
if frac > .5 or (frac == .5 and q > .5): # Always OK? | ||
out[i] = next_val | ||
else: | ||
out[i] = val | ||
|
||
# Increment the index reference in sorted_arr for the next group | ||
grp_start += grp_sz | ||
|
||
|
||
# ---------------------------------------------------------------------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also a bug for DataFrameGroupBy? If so may want to note that, otherwise seems pretty good to me. @WillAyd any other thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for both. Groupby.quantile processes both Series and Dataframes, and it is the cython funtion called there the one that has been fixed. Maybe just Groupby.quantile would suffice, since that's the function changed.