BUG: Groupby quantiles incorrect bins #33200 #33644

mabelvj · 2020-04-19T00:34:00Z

Maintain the order of the bins in group_quantile. Updated tests #33200

closes BUG: 'groupby()' results in shifted results for 'quantile()' #33200, closes BUG: strange behaviour in quantile with group by #33569
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd

Thanks for the PR. Note this is a duplicate of #33571

pandas/_libs/groupby.pyx

pandas/tests/groupby/test_function.py

…andas-dev#33200

mabelvj · 2020-04-19T21:08:08Z

Updated the PR. I did not notice the other PR. It was not referenced in the original issue #33200 .

WillAyd · 2020-04-22T03:37:07Z

pandas/_libs/groupby.pyx

-            # Figure out how many group elements there are
-            grp_sz = counts[i]
-            non_na_sz = non_na_counts[i]
+    if labels.any():


What is this required for? Do we have a test case when this is False?

Normally, if labels are not provided or the dataframe is empty, an error message will rise when applying the quantile. However, there are cases where certain operations lead to empty labels, as when time resampling some types of empty dataframe:

Here is an example of the pipeline:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34258&view=logs&j=bef1c175-2c1b-51ae-044a-2437c76fc339&t=770e7bb1-09f5-5ebf-b63b-578d2906aac9&l=127

Is this accounted for in the test that you've created? If not, can you add a test for it?

add two cases that take the if labels.any(): is False path.

What is this required for?

otherwise labels.max() can raise ValueError: zero-size array to reduction operation maximum which has no identity

pandas/_libs/groupby.pyx

…abels pandas-dev#33200

pandas/_libs/groupby.pyx

pandas/tests/groupby/test_function.py

dsaxton · 2020-04-26T19:32:38Z

BTW this should also close #33569

dsaxton · 2020-04-28T18:57:45Z

doc/source/whatsnew/v1.1.0.rst

@@ -698,6 +698,7 @@ Groupby/resample/rolling
 - Bug in :meth:`DataFrame.resample` where an ``AmbiguousTimeError`` would be raised when the resulting timezone aware :class:`DatetimeIndex` had a DST transition at midnight (:issue:`25758`)
 - Bug in :meth:`DataFrame.groupby` where a ``ValueError`` would be raised when grouping by a categorical column with read-only categories and ``sort=False`` (:issue:`33410`)
 - Bug in :meth:`GroupBy.first` and :meth:`GroupBy.last` where None is not preserved in object dtype (:issue:`32800`)
+- Bug in :meth:`SeriesGroupBy.quantile` causes the quantiles to be shifted when the ``by`` axis contains ``NaN`` (:issue:`33200`)


Is this also a bug for DataFrameGroupBy? If so may want to note that, otherwise seems pretty good to me. @WillAyd any other thoughts here?

It works for both. Groupby.quantile processes both Series and Dataframes, and it is the cython funtion called there the one that has been fixed. Maybe just Groupby.quantile would suffice, since that's the function changed.

simonjayhawkins · 2020-05-05T08:48:50Z

xref #33300 to track.

simonjayhawkins · 2020-05-07T16:06:43Z

@mabelvj This PR seems to be timing out on Windows. is this related to the changes here?

dsaxton · 2020-05-09T22:32:43Z

@mabelvj Can you merge master? The quantile tests were moved to /tests/groupby/test_quantile.py.

simonjayhawkins · 2020-05-14T20:23:38Z

pandas/_libs/groupby.pyx

+    if labels.any():
+        # Put '-1' (NaN) labels as the last group so it does not interfere
+        # with the calculations.
+        labels[labels == -1] = np.max(labels) + 1


it appears that modifying labels is causing a segfault in test_quantile_missing_group_values_no_segfaults on windows on the second call in the loop. I suspect this is the reason for the timeouts.

…ntile

pandas/_libs/groupby.pyx

simonjayhawkins

Thanks @mabelvj lgtm @dsaxton @WillAyd

WillAyd

minor nit on test. Implementation lgtm - @jreback

WillAyd · 2020-05-17T17:40:35Z

pandas/_libs/groupby.pyx

-            # Figure out how many group elements there are
-            grp_sz = counts[i]
-            non_na_sz = non_na_counts[i]
+    if labels.any():


Is this accounted for in the test that you've created? If not, can you add a test for it?

…ntile

jreback · 2020-05-20T14:19:56Z

pandas/tests/groupby/test_quantile.py


-    result = df.groupby("key").quantile()
+    result = df.groupby("key").quantile(0.5)


can you add another test that does not explicitly set a quantile (e.g. like the original)

jreback · 2020-05-25T21:58:18Z

thanks @mabelvj for the patch.

lumberbot-app · 2020-05-25T21:58:41Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.0.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 2740fb44e5e2ff64087ca7de5999eb30352722f0

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #33644: BUG: Groupby quantiles incorrect bins #33200'

Push to a named branch :

git push YOURFORK 1.0.x:auto-backport-of-pr-33644-on-1.0.x

Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #33644 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

…incorrect bins)

…bins) (#34382) Co-authored-by: Mabel Villalba <[email protected]>

WillAyd requested changes Apr 19, 2020

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

dsaxton reviewed Apr 19, 2020

View reviewed changes

pandas/tests/groupby/test_function.py Outdated Show resolved Hide resolved

BUG: Maintain the order of the bins in group_quantile. Updated tests p…

15a27ea

…andas-dev#33200

mabelvj force-pushed the 33200-groupby-quantile branch from c15b8b8 to c56744b Compare April 19, 2020 21:09

BUG: Fix quantile calculation pandas-dev#33200

8e6af0e

mabelvj force-pushed the 33200-groupby-quantile branch from c56744b to 8e6af0e Compare April 20, 2020 15:30

mabelvj requested review from WillAyd and dsaxton April 21, 2020 10:00

WillAyd reviewed Apr 22, 2020

View reviewed changes

WillAyd added the Groupby label Apr 22, 2020

dsaxton reviewed Apr 22, 2020

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

mabelvj force-pushed the 33200-groupby-quantile branch from 1daaef1 to b8a19bf Compare April 26, 2020 16:37

BUG: Fix quantile calculation. Only move -1 labels if there are any l…

5832ba9

…abels pandas-dev#33200

mabelvj force-pushed the 33200-groupby-quantile branch from b8a19bf to 5832ba9 Compare April 26, 2020 16:39

Merge branch 'master' into 33200-groupby-quantile

662c102

mabelvj force-pushed the 33200-groupby-quantile branch from 2175657 to 662c102 Compare April 26, 2020 17:30

dsaxton reviewed Apr 26, 2020

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

pandas/tests/groupby/test_function.py Outdated Show resolved Hide resolved

BUG: Explicit quantile to 0.5. Fix format pandas-dev#33200

e70e3ca

dsaxton reviewed Apr 28, 2020

View reviewed changes

BUG: Fix whatsnew. pandas-dev#33200

b0c309b

mabelvj force-pushed the 33200-groupby-quantile branch from b3cf5b3 to b0c309b Compare April 29, 2020 17:01

mabelvj requested a review from WillAyd April 30, 2020 10:49

dsaxton mentioned this pull request May 4, 2020

BUG: Debug grouped quantile with NA values #33571

Closed

5 tasks

simonjayhawkins added the Still Needs Manual Backport label May 7, 2020

BUG: Fix merge master. pandas-dev#33200

131a8c1

mabelvj force-pushed the 33200-groupby-quantile branch from 330e24d to 131a8c1 Compare May 11, 2020 17:41

simonjayhawkins reviewed May 14, 2020

View reviewed changes

simonjayhawkins added 2 commits May 15, 2020 15:37

Merge remote-tracking branch 'upstream/master' into 33200-groupby-qua…

d18bdc4

…ntile

troubleshoot ci timeouts

e7c0eac

simonjayhawkins requested changes May 15, 2020

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

issue number and coments

07fa080

simonjayhawkins approved these changes May 15, 2020

View reviewed changes

WillAyd reviewed May 17, 2020

View reviewed changes

simonjayhawkins added 3 commits May 19, 2020 12:30

Merge remote-tracking branch 'upstream/master' into 33200-groupby-qua…

24779c6

…ntile

add more test cases

c6f41a4

Merge remote-tracking branch 'upstream/master' into 33200-groupby-qua…

6b28d91

…ntile

jreback added this to the 1.1 milestone May 20, 2020

jreback requested changes May 20, 2020

View reviewed changes

update test per comment

f21e332

simonjayhawkins mentioned this pull request May 25, 2020

RLS: 1.0.4 #33300

Closed

jreback modified the milestones: 1.1, 1.0.4 May 25, 2020

jreback approved these changes May 25, 2020

View reviewed changes

jreback merged commit 2740fb4 into pandas-dev:master May 25, 2020

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request May 26, 2020

Backport PR pandas-dev#33644 on branch 1.0.x (BUG: Groupby quantiles …

1d2c2ea

…incorrect bins)

simonjayhawkins mentioned this pull request May 26, 2020

Backport PR #33644 on branch 1.0.x (BUG: Groupby quantiles incorrect bins) #34382

Merged

simonjayhawkins removed the Still Needs Manual Backport label May 26, 2020

simonjayhawkins added a commit that referenced this pull request May 26, 2020

Backport PR #33644 on branch 1.0.x (BUG: Groupby quantiles incorrect …

bcdb073

…bins) (#34382) Co-authored-by: Mabel Villalba <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Groupby quantiles incorrect bins #33200 #33644

BUG: Groupby quantiles incorrect bins #33200 #33644

mabelvj commented Apr 19, 2020 •

edited

Loading

WillAyd left a comment

mabelvj commented Apr 19, 2020 •

edited

Loading

WillAyd Apr 22, 2020

mabelvj Apr 26, 2020 •

edited

Loading

WillAyd May 17, 2020

simonjayhawkins May 20, 2020 •

edited

Loading

dsaxton commented Apr 26, 2020

dsaxton Apr 28, 2020

mabelvj Apr 29, 2020 •

edited

Loading

simonjayhawkins commented May 5, 2020

simonjayhawkins commented May 7, 2020

dsaxton commented May 9, 2020

simonjayhawkins May 14, 2020

simonjayhawkins left a comment

WillAyd left a comment

WillAyd May 17, 2020

jreback May 20, 2020

jreback commented May 25, 2020

lumberbot-app bot commented May 25, 2020


		result = df.groupby("key").quantile()
		result = df.groupby("key").quantile(0.5)

BUG: Groupby quantiles incorrect bins #33200 #33644

BUG: Groupby quantiles incorrect bins #33200 #33644

Conversation

mabelvj commented Apr 19, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

mabelvj commented Apr 19, 2020 • edited Loading

WillAyd Apr 22, 2020

Choose a reason for hiding this comment

mabelvj Apr 26, 2020 • edited Loading

Choose a reason for hiding this comment

WillAyd May 17, 2020

Choose a reason for hiding this comment

simonjayhawkins May 20, 2020 • edited Loading

Choose a reason for hiding this comment

dsaxton commented Apr 26, 2020

dsaxton Apr 28, 2020

Choose a reason for hiding this comment

mabelvj Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

simonjayhawkins commented May 5, 2020

simonjayhawkins commented May 7, 2020

dsaxton commented May 9, 2020

simonjayhawkins May 14, 2020

Choose a reason for hiding this comment

simonjayhawkins left a comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd May 17, 2020

Choose a reason for hiding this comment

jreback May 20, 2020

Choose a reason for hiding this comment

jreback commented May 25, 2020

lumberbot-app bot commented May 25, 2020

mabelvj commented Apr 19, 2020 •

edited

Loading

mabelvj commented Apr 19, 2020 •

edited

Loading

mabelvj Apr 26, 2020 •

edited

Loading

simonjayhawkins May 20, 2020 •

edited

Loading

mabelvj Apr 29, 2020 •

edited

Loading