Skip to content

BUG: Groupby quantiles incorrect bins #33200 #33644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 25, 2020

Conversation

mabelvj
Copy link
Contributor

@mabelvj mabelvj commented Apr 19, 2020

Maintain the order of the bins in group_quantile. Updated tests #33200

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Note this is a duplicate of #33571

@mabelvj
Copy link
Contributor Author

mabelvj commented Apr 19, 2020

Updated the PR. I did not notice the other PR. It was not referenced in the original issue #33200 .

@mabelvj mabelvj force-pushed the 33200-groupby-quantile branch from c15b8b8 to c56744b Compare April 19, 2020 21:09
@mabelvj mabelvj force-pushed the 33200-groupby-quantile branch from c56744b to 8e6af0e Compare April 20, 2020 15:30
@mabelvj mabelvj requested review from WillAyd and dsaxton April 21, 2020 10:00
# Figure out how many group elements there are
grp_sz = counts[i]
non_na_sz = non_na_counts[i]
if labels.any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this required for? Do we have a test case when this is False?

Copy link
Contributor Author

@mabelvj mabelvj Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, if labels are not provided or the dataframe is empty, an error message will rise when applying the quantile. However, there are cases where certain operations lead to empty labels, as when time resampling some types of empty dataframe:

Here is an example of the pipeline:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34258&view=logs&j=bef1c175-2c1b-51ae-044a-2437c76fc339&t=770e7bb1-09f5-5ebf-b63b-578d2906aac9&l=127

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accounted for in the test that you've created? If not, can you add a test for it?

Copy link
Member

@simonjayhawkins simonjayhawkins May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add two cases that take the if labels.any(): is False path.

What is this required for?

otherwise labels.max() can raise ValueError: zero-size array to reduction operation maximum which has no identity

@mabelvj mabelvj force-pushed the 33200-groupby-quantile branch from 1daaef1 to b8a19bf Compare April 26, 2020 16:37
@mabelvj mabelvj force-pushed the 33200-groupby-quantile branch from b8a19bf to 5832ba9 Compare April 26, 2020 16:39
@mabelvj mabelvj force-pushed the 33200-groupby-quantile branch from 2175657 to 662c102 Compare April 26, 2020 17:30
@dsaxton
Copy link
Member

dsaxton commented Apr 26, 2020

BTW this should also close #33569

@@ -698,6 +698,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.resample` where an ``AmbiguousTimeError`` would be raised when the resulting timezone aware :class:`DatetimeIndex` had a DST transition at midnight (:issue:`25758`)
- Bug in :meth:`DataFrame.groupby` where a ``ValueError`` would be raised when grouping by a categorical column with read-only categories and ``sort=False`` (:issue:`33410`)
- Bug in :meth:`GroupBy.first` and :meth:`GroupBy.last` where None is not preserved in object dtype (:issue:`32800`)
- Bug in :meth:`SeriesGroupBy.quantile` causes the quantiles to be shifted when the ``by`` axis contains ``NaN`` (:issue:`33200`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also a bug for DataFrameGroupBy? If so may want to note that, otherwise seems pretty good to me. @WillAyd any other thoughts here?

Copy link
Contributor Author

@mabelvj mabelvj Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for both. Groupby.quantile processes both Series and Dataframes, and it is the cython funtion called there the one that has been fixed. Maybe just Groupby.quantile would suffice, since that's the function changed.

@mabelvj mabelvj force-pushed the 33200-groupby-quantile branch from b3cf5b3 to b0c309b Compare April 29, 2020 17:01
@mabelvj mabelvj requested a review from WillAyd April 30, 2020 10:49
@simonjayhawkins
Copy link
Member

xref #33300 to track.

@simonjayhawkins
Copy link
Member

@mabelvj This PR seems to be timing out on Windows. is this related to the changes here?

@dsaxton
Copy link
Member

dsaxton commented May 9, 2020

@mabelvj Can you merge master? The quantile tests were moved to /tests/groupby/test_quantile.py.

@mabelvj mabelvj force-pushed the 33200-groupby-quantile branch from 330e24d to 131a8c1 Compare May 11, 2020 17:41
if labels.any():
# Put '-1' (NaN) labels as the last group so it does not interfere
# with the calculations.
labels[labels == -1] = np.max(labels) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that modifying labels is causing a segfault in test_quantile_missing_group_values_no_segfaults on windows on the second call in the loop. I suspect this is the reason for the timeouts.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mabelvj lgtm @dsaxton @WillAyd

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit on test. Implementation lgtm - @jreback

# Figure out how many group elements there are
grp_sz = counts[i]
non_na_sz = non_na_counts[i]
if labels.any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accounted for in the test that you've created? If not, can you add a test for it?

@jreback jreback added this to the 1.1 milestone May 20, 2020

result = df.groupby("key").quantile()
result = df.groupby("key").quantile(0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another test that does not explicitly set a quantile (e.g. like the original)

@simonjayhawkins simonjayhawkins mentioned this pull request May 25, 2020
@jreback jreback modified the milestones: 1.1, 1.0.4 May 25, 2020
@jreback jreback merged commit 2740fb4 into pandas-dev:master May 25, 2020
@jreback
Copy link
Contributor

jreback commented May 25, 2020

thanks @mabelvj for the patch.

@lumberbot-app
Copy link

lumberbot-app bot commented May 25, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.0.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 2740fb44e5e2ff64087ca7de5999eb30352722f0
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #33644: BUG: Groupby quantiles incorrect bins #33200'
  1. Push to a named branch :
git push YOURFORK 1.0.x:auto-backport-of-pr-33644-on-1.0.x
  1. Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #33644 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants