-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Groupby quantiles incorrect bins #33200 #33644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
15a27ea
8e6af0e
5832ba9
662c102
e70e3ca
b0c309b
131a8c1
d18bdc4
e7c0eac
07fa080
24779c6
c6f41a4
6b28d91
f21e332
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -778,6 +778,10 @@ def group_quantile(ndarray[float64_t] out, | |
if not mask[i]: | ||
non_na_counts[lab] += 1 | ||
|
||
if labels.any(): | ||
# Put '-1' (NaN) labels as the last group so it does not interfere | ||
# with the calculations. | ||
labels[labels == -1] = np.max(labels) + 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it appears that modifying labels is causing a segfault in test_quantile_missing_group_values_no_segfaults on windows on the second call in the loop. I suspect this is the reason for the timeouts. |
||
# Get an index of values sorted by labels and then values | ||
order = (values, labels) | ||
sort_arr = np.lexsort(order).astype(np.int64, copy=False) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -181,14 +181,23 @@ def test_quantile_missing_group_values_no_segfaults(): | |
grp.quantile() | ||
|
||
|
||
def test_quantile_missing_group_values_correct_results(): | ||
# GH 28662 | ||
data = np.array([1.0, np.nan, 3.0, np.nan]) | ||
df = pd.DataFrame(dict(key=data, val=range(4))) | ||
@pytest.mark.parametrize( | ||
"key, val, expected_key, expected_val", | ||
[ | ||
([1.0, np.nan, 3.0, np.nan], range(4), [1.0, 3.0], [0.0, 2.0]), | ||
([1.0, np.nan, 2.0, 2.0], range(4), [1.0, 2.0], [0.0, 2.5]), | ||
(["a", "b", "b", np.nan], range(4), ["a", "b"], [0, 1.5]), | ||
], | ||
) | ||
def test_quantile_missing_group_values_correct_results( | ||
key, val, expected_key, expected_val | ||
): | ||
# GH 28662, GH 33200 | ||
df = pd.DataFrame({"key": key, "val": val}) | ||
|
||
result = df.groupby("key").quantile() | ||
result = df.groupby("key").quantile(0.5) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add another test that does not explicitly set a quantile (e.g. like the original) |
||
expected = pd.DataFrame( | ||
[1.0, 3.0], index=pd.Index([1.0, 3.0], name="key"), columns=["val"] | ||
expected_val, index=pd.Index(expected_key, name="key"), columns=["val"] | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this required for? Do we have a test case when this is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, if labels are not provided or the dataframe is empty, an error message will rise when applying the quantile. However, there are cases where certain operations lead to empty labels, as when time resampling some types of empty dataframe:
Here is an example of the pipeline:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34258&view=logs&j=bef1c175-2c1b-51ae-044a-2437c76fc339&t=770e7bb1-09f5-5ebf-b63b-578d2906aac9&l=127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accounted for in the test that you've created? If not, can you add a test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add two cases that take the if labels.any(): is False path.
otherwise labels.max() can raise
ValueError: zero-size array to reduction operation maximum which has no identity