Skip to content

Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 30, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.groupby` not offering selection by column name when ``axis=1`` (:issue:`27614`)
- Bug in :meth:`DataFrameGroupby.agg` not able to use lambda function with named aggregation (:issue:`27519`)
- Bug in :meth:`DataFrame.groupby` losing column name information when grouping by a categorical column (:issue:`28787`)
- Bug in :meth:`DataFrameGroupBy.quantile` where NA values in the grouping could cause segfaults or incorrect results (:issue:`28882`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have removed this before merging; will clean up after backport


Reshaping
^^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,9 @@ def group_quantile(ndarray[float64_t] out,
with nogil:
for i in range(N):
lab = labels[i]
if lab == -1: # NA group label
continue

counts[lab] += 1
if not mask[i]:
non_na_counts[lab] += 1
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1373,6 +1373,29 @@ def test_quantile_out_of_bounds_q_raises():
g.quantile(-1)


def test_quantile_missing_group_values_no_segfaults():
# GH 28662
data = np.array([1.0, np.nan, 1.0])
df = pd.DataFrame(dict(key=data, val=range(3)))

# Random segfaults; would have been guaranteed in loop
grp = df.groupby("key")
for _ in range(100):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i take it this is doesn't reliably segfault on the first try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or i could just read the comment three lines up. never mind

grp.quantile()


def test_quantile_missing_group_values_correct_results():
# GH 28662
data = np.array([1.0, np.nan, 3.0, np.nan])
df = pd.DataFrame(dict(key=data, val=range(4)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to re-use the setup from the other test? could just assert the result after the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different frame shapes. I think segfaults were occurring with oddly sized frames and bad results with evenly sized. Might be some other conflating factor


result = df.groupby("key").quantile()
expected = pd.DataFrame(
[1.0, 3.0], index=pd.Index([1.0, 3.0], name="key"), columns=["val"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expected should be [0, 2], but also on 0.24.2 the columns index name was the q value. not sure if this is important.

>>> pd.__version__
'0.24.2'
>>>
>>> import numpy as np
>>>
>>> data = np.array([1.0, np.nan, 3.0, np.nan])
>>> df = pd.DataFrame(dict(key=data, val=range(4)))
>>>
>>>
>>> df.groupby("key").quantile()
0.5  val
key
1.0  0.0
3.0  2.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that this is a groupby and each element belongs to its own group. [1, 3] is essentially the identity of the non-NA groupings and should be correct

w.r.t. the column index name that probably goes back to https://github.com/pandas-dev/pandas/pull/20405/files#r208366338 which was an inconsistency in 0.24.2

)
tm.assert_frame_equal(result, expected)


# pipe
# --------------------------------

Expand Down