Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173

WillAyd · 2019-10-22T22:39:06Z

closes Crash during groupby quantile #28882
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jbrockmendel · 2019-10-23T00:13:44Z

pandas/tests/groupby/test_function.py

+
+    # Random segfaults; would have been guaranteed in loop
+    grp = df.groupby("key")
+    for _ in range(100):


i take it this is doesn't reliably segfault on the first try?

... or i could just read the comment three lines up. never mind

jbrockmendel · 2019-10-23T00:15:21Z

pandas/tests/groupby/test_function.py

+def test_quantile_missing_group_values_correct_results():
+    # GH 28662
+    data = np.array([1.0, np.nan, 3.0, np.nan])
+    df = pd.DataFrame(dict(key=data, val=range(4)))


any reason not to re-use the setup from the other test? could just assert the result after the loop?

Different frame shapes. I think segfaults were occurring with oddly sized frames and bad results with evenly sized. Might be some other conflating factor

jreback

lgtm. I suppose we could mark for a potential 0.25.3, but aren't planning as of yet @pandas-dev/pandas-core

WillAyd · 2019-10-25T19:14:39Z

All green on this. FWIW I think worth a 0.25.3 release since its a regression and happy to volunteer for that if others agree

jbrockmendel · 2019-10-26T00:06:19Z

lgtm cc @jreback

simonjayhawkins · 2019-10-28T12:41:02Z

pandas/tests/groupby/test_function.py

+
+    result = df.groupby("key").quantile()
+    expected = pd.DataFrame(
+        [1.0, 3.0], index=pd.Index([1.0, 3.0], name="key"), columns=["val"]


the expected should be [0, 2], but also on 0.24.2 the columns index name was the q value. not sure if this is important.

>>> pd.__version__ '0.24.2' >>> >>> import numpy as np >>> >>> data = np.array([1.0, np.nan, 3.0, np.nan]) >>> df = pd.DataFrame(dict(key=data, val=range(4))) >>> >>> >>> df.groupby("key").quantile() 0.5 val key 1.0 0.0 3.0 2.0

Keep in mind that this is a groupby and each element belongs to its own group. [1, 3] is essentially the identity of the non-NA groupings and should be correct

w.r.t. the column index name that probably goes back to https://github.com/pandas-dev/pandas/pull/20405/files#r208366338 which was an inconsistency in 0.24.2

jreback · 2019-10-30T12:11:15Z

All green on this. FWIW I think worth a 0.25.3 release since its a regression and happy to volunteer for that if others agree

@jorisvandenbossche @TomAugspurger

ok by me

TomAugspurger · 2019-10-30T12:58:46Z

Doesn't matter to me. How far are we from 1.0? ~1 month? Just FYI Will, there are some strange issues with the pdf build in the docker container. I had to build it outside of docker. I can probably help out with that if needed.

…

On Wed, Oct 30, 2019 at 7:11 AM Jeff Reback ***@***.***> wrote: All green on this. FWIW I think worth a 0.25.3 release since its a regression and happy to volunteer for that if others agree @jorisvandenbossche <https://github.com/jorisvandenbossche> @TomAugspurger <https://github.com/TomAugspurger> ok by me — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29173?email_source=notifications&email_token=AAKAOIXMOSBTY5Y5X7MA6EDQRF2WXA5CNFSM4JDZGOT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECT5WMA#issuecomment-547871536>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIRBDUK6PSMJ5SRYG4TQRF2WXANCNFSM4JDZGOTQ> .

WillAyd · 2019-10-30T18:29:31Z

Sounds good. I'll do the back port and start on the release today / tomorrow. I'll reach out as questions come up

WillAyd · 2019-10-30T18:34:47Z

doc/source/whatsnew/v1.0.0.rst

@@ -410,6 +410,7 @@ Groupby/resample/rolling
 - Bug in :meth:`DataFrame.groupby` not offering selection by column name when ``axis=1`` (:issue:`27614`)
 - Bug in :meth:`DataFrameGroupby.agg` not able to use lambda function with named aggregation (:issue:`27519`)
 - Bug in :meth:`DataFrame.groupby` losing column name information when grouping by a categorical column (:issue:`28787`)
+- Bug in :meth:`DataFrameGroupBy.quantile` where NA values in the grouping could cause segfaults or incorrect results (:issue:`28882`)


Should have removed this before merging; will clean up after backport

…n GroupBy.quantile with NA Values in Grouping

* Backport PR #27826 for 0.25.3 release * BUG: Fix groupby quantile segfault Validate that q is between 0 and 1. Closes #27470 * prettier * Backport PR #29173 for 0.25.3 release * Backport PR #29296 for 0.25.3 release * Backport PR #29294 for 0.25.3 whatsnew

…ues in Grouping (pandas-dev#29173)

WillAyd added 4 commits October 22, 2019 15:32

Tests for NA handling

cd9d872

Impl and test fixes

ff6fe6a

Whatsnew

50e7091

Black

69ca2b4

WillAyd added Groupby Segfault Non-Recoverable Error labels Oct 22, 2019

jbrockmendel added the quantile quantile method label Oct 22, 2019

jbrockmendel reviewed Oct 23, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into quantile-segfault

b9a89cd

jreback added this to the 1.0 milestone Oct 23, 2019

jreback approved these changes Oct 23, 2019

View reviewed changes

WillAyd mentioned this pull request Oct 24, 2019

Drop Python 3.5 support #29034

Closed

WillAyd added 2 commits October 24, 2019 11:34

Merge remote-tracking branch 'upstream/master' into quantile-segfault

c8b6675

Merge remote-tracking branch 'upstream/master' into quantile-segfault

57889db

simonjayhawkins reviewed Oct 28, 2019

View reviewed changes

WillAyd merged commit d0fe636 into pandas-dev:master Oct 30, 2019

WillAyd deleted the quantile-segfault branch October 30, 2019 18:29

WillAyd commented Oct 30, 2019

View reviewed changes

WillAyd added a commit to WillAyd/pandas that referenced this pull request Oct 31, 2019

Backport PR pandas-dev#29173: Fixed segfaults and incorrect results i…

7d2b2c5

…n GroupBy.quantile with NA Values in Grouping

WillAyd added a commit to WillAyd/pandas that referenced this pull request Oct 31, 2019

Backport PR pandas-dev#29173 for 0.25.3 release

cdacc3f

Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019

Fixed segfaults and incorrect results in GroupBy.quantile with NA Val…

8ea3d49

…ues in Grouping (pandas-dev#29173)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

Fixed segfaults and incorrect results in GroupBy.quantile with NA Val…

6b31015

…ues in Grouping (pandas-dev#29173)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

Fixed segfaults and incorrect results in GroupBy.quantile with NA Val…

a859803

…ues in Grouping (pandas-dev#29173)

simonjayhawkins mentioned this pull request Apr 19, 2020

BUG: Debug grouped quantile with NA values #33571

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173

Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173

WillAyd commented Oct 22, 2019

jbrockmendel Oct 23, 2019

jbrockmendel Oct 23, 2019

jbrockmendel Oct 23, 2019

WillAyd Oct 23, 2019

jreback left a comment

WillAyd commented Oct 25, 2019

jbrockmendel commented Oct 26, 2019

simonjayhawkins Oct 28, 2019

WillAyd Oct 28, 2019

jreback commented Oct 30, 2019

TomAugspurger commented Oct 30, 2019 via email

WillAyd commented Oct 30, 2019

WillAyd Oct 30, 2019

Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173

Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173

Conversation

WillAyd commented Oct 22, 2019

jbrockmendel Oct 23, 2019

Choose a reason for hiding this comment

jbrockmendel Oct 23, 2019

Choose a reason for hiding this comment

jbrockmendel Oct 23, 2019

Choose a reason for hiding this comment

WillAyd Oct 23, 2019

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Oct 25, 2019

jbrockmendel commented Oct 26, 2019

simonjayhawkins Oct 28, 2019

Choose a reason for hiding this comment

WillAyd Oct 28, 2019

Choose a reason for hiding this comment

jreback commented Oct 30, 2019

TomAugspurger commented Oct 30, 2019 via email

WillAyd commented Oct 30, 2019

WillAyd Oct 30, 2019

Choose a reason for hiding this comment