-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Groupby quantiles incorrect bins #33200 #33644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Note this is a duplicate of #33571
Updated the PR. I did not notice the other PR. It was not referenced in the original issue #33200 . |
c15b8b8
to
c56744b
Compare
c56744b
to
8e6af0e
Compare
# Figure out how many group elements there are | ||
grp_sz = counts[i] | ||
non_na_sz = non_na_counts[i] | ||
if labels.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this required for? Do we have a test case when this is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, if labels are not provided or the dataframe is empty, an error message will rise when applying the quantile. However, there are cases where certain operations lead to empty labels, as when time resampling some types of empty dataframe:
Here is an example of the pipeline:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34258&view=logs&j=bef1c175-2c1b-51ae-044a-2437c76fc339&t=770e7bb1-09f5-5ebf-b63b-578d2906aac9&l=127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accounted for in the test that you've created? If not, can you add a test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add two cases that take the if labels.any(): is False path.
What is this required for?
otherwise labels.max() can raise ValueError: zero-size array to reduction operation maximum which has no identity
1daaef1
to
b8a19bf
Compare
b8a19bf
to
5832ba9
Compare
2175657
to
662c102
Compare
BTW this should also close #33569 |
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -698,6 +698,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`DataFrame.resample` where an ``AmbiguousTimeError`` would be raised when the resulting timezone aware :class:`DatetimeIndex` had a DST transition at midnight (:issue:`25758`) | |||
- Bug in :meth:`DataFrame.groupby` where a ``ValueError`` would be raised when grouping by a categorical column with read-only categories and ``sort=False`` (:issue:`33410`) | |||
- Bug in :meth:`GroupBy.first` and :meth:`GroupBy.last` where None is not preserved in object dtype (:issue:`32800`) | |||
- Bug in :meth:`SeriesGroupBy.quantile` causes the quantiles to be shifted when the ``by`` axis contains ``NaN`` (:issue:`33200`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also a bug for DataFrameGroupBy? If so may want to note that, otherwise seems pretty good to me. @WillAyd any other thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for both. Groupby.quantile processes both Series and Dataframes, and it is the cython funtion called there the one that has been fixed. Maybe just Groupby.quantile would suffice, since that's the function changed.
b3cf5b3
to
b0c309b
Compare
xref #33300 to track. |
@mabelvj This PR seems to be timing out on Windows. is this related to the changes here? |
@mabelvj Can you merge master? The quantile tests were moved to /tests/groupby/test_quantile.py. |
330e24d
to
131a8c1
Compare
pandas/_libs/groupby.pyx
Outdated
if labels.any(): | ||
# Put '-1' (NaN) labels as the last group so it does not interfere | ||
# with the calculations. | ||
labels[labels == -1] = np.max(labels) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that modifying labels is causing a segfault in test_quantile_missing_group_values_no_segfaults on windows on the second call in the loop. I suspect this is the reason for the timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit on test. Implementation lgtm - @jreback
# Figure out how many group elements there are | ||
grp_sz = counts[i] | ||
non_na_sz = non_na_counts[i] | ||
if labels.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accounted for in the test that you've created? If not, can you add a test for it?
|
||
result = df.groupby("key").quantile() | ||
result = df.groupby("key").quantile(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add another test that does not explicitly set a quantile (e.g. like the original)
thanks @mabelvj for the patch. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
…incorrect bins)
…bins) (#34382) Co-authored-by: Mabel Villalba <[email protected]>
Maintain the order of the bins in group_quantile. Updated tests #33200
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff