-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fixed segfaults and incorrect results in GroupBy.quantile with NA Values in Grouping #29173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
# Random segfaults; would have been guaranteed in loop | ||
grp = df.groupby("key") | ||
for _ in range(100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i take it this is doesn't reliably segfault on the first try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or i could just read the comment three lines up. never mind
def test_quantile_missing_group_values_correct_results(): | ||
# GH 28662 | ||
data = np.array([1.0, np.nan, 3.0, np.nan]) | ||
df = pd.DataFrame(dict(key=data, val=range(4))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to re-use the setup from the other test? could just assert the result after the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different frame shapes. I think segfaults were occurring with oddly sized frames and bad results with evenly sized. Might be some other conflating factor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I suppose we could mark for a potential 0.25.3, but aren't planning as of yet @pandas-dev/pandas-core
All green on this. FWIW I think worth a 0.25.3 release since its a regression and happy to volunteer for that if others agree |
lgtm cc @jreback |
|
||
result = df.groupby("key").quantile() | ||
expected = pd.DataFrame( | ||
[1.0, 3.0], index=pd.Index([1.0, 3.0], name="key"), columns=["val"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the expected should be [0, 2], but also on 0.24.2 the columns index name was the q value. not sure if this is important.
>>> pd.__version__
'0.24.2'
>>>
>>> import numpy as np
>>>
>>> data = np.array([1.0, np.nan, 3.0, np.nan])
>>> df = pd.DataFrame(dict(key=data, val=range(4)))
>>>
>>>
>>> df.groupby("key").quantile()
0.5 val
key
1.0 0.0
3.0 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that this is a groupby and each element belongs to its own group. [1, 3] is essentially the identity of the non-NA groupings and should be correct
w.r.t. the column index name that probably goes back to https://github.com/pandas-dev/pandas/pull/20405/files#r208366338 which was an inconsistency in 0.24.2
@jorisvandenbossche @TomAugspurger ok by me |
Doesn't matter to me. How far are we from 1.0? ~1 month?
Just FYI Will, there are some strange issues with the pdf build in the
docker container. I had to build it outside of docker. I can probably help
out with that if needed.
…On Wed, Oct 30, 2019 at 7:11 AM Jeff Reback ***@***.***> wrote:
All green on this. FWIW I think worth a 0.25.3 release since its a
regression and happy to volunteer for that if others agree
@jorisvandenbossche <https://github.com/jorisvandenbossche> @TomAugspurger
<https://github.com/TomAugspurger>
ok by me
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29173?email_source=notifications&email_token=AAKAOIXMOSBTY5Y5X7MA6EDQRF2WXA5CNFSM4JDZGOT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECT5WMA#issuecomment-547871536>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIRBDUK6PSMJ5SRYG4TQRF2WXANCNFSM4JDZGOTQ>
.
|
Sounds good. I'll do the back port and start on the release today / tomorrow. I'll reach out as questions come up |
@@ -410,6 +410,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`DataFrame.groupby` not offering selection by column name when ``axis=1`` (:issue:`27614`) | |||
- Bug in :meth:`DataFrameGroupby.agg` not able to use lambda function with named aggregation (:issue:`27519`) | |||
- Bug in :meth:`DataFrame.groupby` losing column name information when grouping by a categorical column (:issue:`28787`) | |||
- Bug in :meth:`DataFrameGroupBy.quantile` where NA values in the grouping could cause segfaults or incorrect results (:issue:`28882`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have removed this before merging; will clean up after backport
…n GroupBy.quantile with NA Values in Grouping
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff