-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby ffill adds labels as extra column (#21521) #26162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26162 +/- ##
===========================================
- Coverage 91.98% 40.75% -51.24%
===========================================
Files 175 175
Lines 52377 52382 +5
===========================================
- Hits 48180 21346 -26834
- Misses 4197 31036 +26839
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26162 +/- ##
==========================================
- Coverage 91.68% 91.67% -0.02%
==========================================
Files 174 174
Lines 50703 50697 -6
==========================================
- Hits 46488 46477 -11
- Misses 4215 4220 +5
Continue to review full report at Codecov.
|
@jreback Done |
Failing tests seem unrelated to this PR? |
Try merging master just fixed a few things in CI
…Sent from my iPhone
On Apr 23, 2019, at 7:26 AM, Adam Bull ***@***.***> wrote:
Failing tests seem unrelated to this PR?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
pandas/core/groupby/generic.py
Outdated
output = OrderedDict( | ||
(grp.name, grp.grouper) for grp in self.grouper.groupings) | ||
(grp.name, grp.grouper) for grp in self.grouper.groupings | ||
if grp.in_axis and grp.name in self._selected_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these conditions ever different? Thought the point of in_axis
was to state if the grouping was a column in the selected object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in df.groupby(labels)[selection].ffill()
, the first test checks whether labels
is a column in df
, or something else (e.g. a separate series). The second test then checks whether labels
is in selection
.
You need both tests, as e.g. labels
could be in df
but not selection
; or could not be in df
, but have the same name as a column in selection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK thanks. Sorry for all of the questions here but main thing I am trying to avoid is introducing new inconsistencies. Wondering why we don't see the same problem using shift
as we do using _fill
as they ultimately dispatch to _get_cythonized_result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have an issue with shift
because it doesn't try to add the group labels back:
>>> df = pd.DataFrame(dict(x=[1]))
>>> df.groupby('x').shift()
Empty DataFrame
Columns: []
Index: [0]
>>> df.groupby('x').ffill()
x
0 1
You could argue there's some inconsistency between shift
and _fill
here, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could argue there's some inconsistency between
shift
and_fill
here, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?
Yea that's valid. The only thing I'm trying to avoid this is changing the output format once here and changing again in a subsequent release - it just makes for a lesser end user experience.
Do we need to add the group labels back for fill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- none of the other groupby transforms add labels back, so maybe we shouldn't? In general I'd prefer inconsistency to breaking changes, but we'll be breaking with 0.24 either way. Up to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed - going to be breaking in any case but would rather break once in the next major release than have to break in the next major release then break again in the subsequent one.
Can you see what breaks if you try not to add the labels back in? May be indicative of effort involved which could guide the right path forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you leave the labels off, it just breaks the tests you'd expect: test_group_fill_methods
, test_pad_stable_sorting
and test_pct_change
in tests/groupby/test_transform.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right those tests might have the wrong expectation.
Just so we are clear here is the current behavior with this PR:
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['a', np.nan], ['b', np.nan]], columns=['key', 'val'])
>>> df
key val
0 a 1.0
1 b 2.0
2 a NaN
3 b NaN
>>> df.groupby('key').ffill()
key val
0 a 1.0
1 b 2.0
2 a 1.0
3 b 2.0
>>> df.groupby('key').shift() # or rank, cumsum, etc...
val
0 NaN
1 NaN
2 1.0
3 2.0
If you get rid of the subclassed _fill
implementation you'd get the following:
>>> df.groupby('key').ffill()
val
0 1.0
1 2.0
2 1.0
3 2.0
>>> df.groupby('key').shift() # or rank, cumsum, etc...
val
0 NaN
1 NaN
2 1.0
3 2.0
I don't think the fill methods should make an exception on the shape of the returned value when compared to any other transformation functions, hence why I'd rather you just remove _fill
and update tests instead of the patch as is
56b75ae
to
dd793c5
Compare
@WillAyd As discussed, I've changed this to not return group labels at all, for consistency with other groupby transforms. |
@adbull nice - looks good on initial glance. I'll dive deeper tomorrow. Thanks for sticking with the feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm outside of @jreback comments
ok looks good. can you merge master. also have a look at the example in the docs: http://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html#transformation; its the note box
This example is showing the incorrect behavior (I believe your patch will fix it); do your test sufficiently cover this case? |
@adbull if you can merge master and update |
@jreback Done. Patch will fix the example in the docs, and the test covers that case. |
lgtm. ping on green. |
you have a linting issue |
Issue is from a different PR, I didn't edit test_groupby.py |
@jreback looks green |
Thanks @adbull - nice change! |
git diff upstream/master -u -- "*.py" | flake8 --diff