-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: errors and segfaults in groupby cython transforms (#16771) #26134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26134 +/- ##
==========================================
- Coverage 91.99% 91.98% -0.01%
==========================================
Files 175 175
Lines 52387 52387
==========================================
- Hits 48191 48187 -4
- Misses 4196 4200 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26134 +/- ##
==========================================
- Coverage 92% 91.98% -0.03%
==========================================
Files 175 175
Lines 52371 52383 +12
==========================================
Hits 48184 48184
- Misses 4187 4199 +12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed in depth but some quick comments. Can you also post ASV results for this to ensure no regression?
Hmm so I don't think the behavior here is correct as it kind of defeats the purpose of the In [15]: import pandas as pd
...:
...: x_vals = [1]
...: x_cats = range(2)
...: y = [1]
...: df = pd.DataFrame(dict(x=pd.Categorical(x_vals, x_cats), y=y)) Using In [15]: df.groupby('x', observed=True).y.cumsum()
Out[15]:
0 1
Name: y, dtype: int64 Definitely the current random number that gets yielded with that being False is incorrect: In [16]: df.groupby('x', observed=False).y.cumsum()
Out[16]:
0 4825136601
Name: y, dtype: int64 But at the same time what you have here in this PR I think is forcing For consistency the expectation should be a DataFrame with all of the categories in the index and probably np.nan where the value is not actually observed. Kind of wonky given this is a transform but I think that's the best approach |
Alternately it may make more sense to just raise when a user tries to run a transform with a categorical in the grouper with unobserved categories and I can't imagine a scenario where running a transform and getting back a differently indexed result would be useful in the scope of pandas usage and this would certainly be simpler to integrate into the code base |
@jreback @TomAugspurger if you have thoughts on comment above |
The above comments seem a little confused? As these are transforms, there's no such thing as observed=True or observed=False behaviour; either way, the output index will be the same as the input index. As observed=False is the default, I think we should continue to support transforms with it set; otherwise you'd have to pass observed=True every time you wanted to call a transform. |
@jreback Requested changes made. @WillAyd ASV shows no significant difference.
|
@jreback Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of comments, ping on green (either if making the change or showing why you cannot)
@jreback Done. Some checks are failing, but seems unrelated to this PR? |
thanks @adbull |
git diff upstream/master -u -- "*.py" | flake8 --diff