Skip to content

BUG: groupby ffill adds labels as extra column (#21521) #26162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2019

Conversation

adbull
Copy link
Contributor

@adbull adbull commented Apr 20, 2019

@codecov
Copy link

codecov bot commented Apr 20, 2019

Codecov Report

Merging #26162 into master will decrease coverage by 51.23%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26162       +/-   ##
===========================================
- Coverage   91.98%   40.75%   -51.24%     
===========================================
  Files         175      175               
  Lines       52377    52382        +5     
===========================================
- Hits        48180    21346    -26834     
- Misses       4197    31036    +26839
Flag Coverage Δ
#multiple ?
#single 40.75% <ø> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/generic.py 13.14% <ø> (-75.89%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc86509...e010ffe. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 20, 2019

Codecov Report

Merging #26162 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26162      +/-   ##
==========================================
- Coverage   91.68%   91.67%   -0.02%     
==========================================
  Files         174      174              
  Lines       50703    50697       -6     
==========================================
- Hits        46488    46477      -11     
- Misses       4215     4220       +5
Flag Coverage Δ
#multiple 90.18% <ø> (-0.01%) ⬇️
#single 41.18% <ø> (-0.17%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 97.23% <ø> (-0.01%) ⬇️
pandas/core/groupby/generic.py 88.93% <ø> (-0.08%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.6% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6e43a4...01057a4. Read the comment docs.

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Apr 21, 2019
@adbull
Copy link
Contributor Author

adbull commented Apr 22, 2019

@jreback Done

@adbull
Copy link
Contributor Author

adbull commented Apr 23, 2019

Failing tests seem unrelated to this PR?

@WillAyd
Copy link
Member

WillAyd commented Apr 23, 2019 via email

output = OrderedDict(
(grp.name, grp.grouper) for grp in self.grouper.groupings)
(grp.name, grp.grouper) for grp in self.grouper.groupings
if grp.in_axis and grp.name in self._selected_obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these conditions ever different? Thought the point of in_axis was to state if the grouping was a column in the selected object

Copy link
Contributor Author

@adbull adbull Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in df.groupby(labels)[selection].ffill(), the first test checks whether labels is a column in df, or something else (e.g. a separate series). The second test then checks whether labels is in selection.

You need both tests, as e.g. labels could be in df but not selection; or could not be in df, but have the same name as a column in selection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK thanks. Sorry for all of the questions here but main thing I am trying to avoid is introducing new inconsistencies. Wondering why we don't see the same problem using shift as we do using _fill as they ultimately dispatch to _get_cythonized_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an issue with shift because it doesn't try to add the group labels back:

>>> df = pd.DataFrame(dict(x=[1]))
>>> df.groupby('x').shift()
Empty DataFrame
Columns: []
Index: [0]
>>> df.groupby('x').ffill()
   x
0  1

You could argue there's some inconsistency between shift and _fill here, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could argue there's some inconsistency between shift and _fill here, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?

Yea that's valid. The only thing I'm trying to avoid this is changing the output format once here and changing again in a subsequent release - it just makes for a lesser end user experience.

Do we need to add the group labels back for fill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- none of the other groupby transforms add labels back, so maybe we shouldn't? In general I'd prefer inconsistency to breaking changes, but we'll be breaking with 0.24 either way. Up to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed - going to be breaking in any case but would rather break once in the next major release than have to break in the next major release then break again in the subsequent one.

Can you see what breaks if you try not to add the labels back in? May be indicative of effort involved which could guide the right path forward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you leave the labels off, it just breaks the tests you'd expect: test_group_fill_methods, test_pad_stable_sorting and test_pct_change in tests/groupby/test_transform.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right those tests might have the wrong expectation.

Just so we are clear here is the current behavior with this PR:

>>> df = pd.DataFrame([['a', 1], ['b', 2], ['a', np.nan], ['b', np.nan]], columns=['key', 'val'])
>>> df
  key  val
0   a  1.0
1   b  2.0
2   a  NaN
3   b  NaN

>>> df.groupby('key').ffill()
  key  val
0   a  1.0
1   b  2.0
2   a  1.0
3   b  2.0

>>> df.groupby('key').shift()  # or rank, cumsum, etc...
   val
0  NaN
1  NaN
2  1.0
3  2.0

If you get rid of the subclassed _fill implementation you'd get the following:

>>> df.groupby('key').ffill()
   val
0  1.0
1  2.0
2  1.0
3  2.0

>>> df.groupby('key').shift()  # or rank, cumsum, etc...
   val
0  NaN
1  NaN
2  1.0
3  2.0

I don't think the fill methods should make an exception on the shape of the returned value when compared to any other transformation functions, hence why I'd rather you just remove _fill and update tests instead of the patch as is

@adbull adbull force-pushed the groupby-ffill branch 3 times, most recently from 56b75ae to dd793c5 Compare April 27, 2019 08:45
@adbull
Copy link
Contributor Author

adbull commented Apr 27, 2019

@WillAyd As discussed, I've changed this to not return group labels at all, for consistency with other groupby transforms.

@WillAyd
Copy link
Member

WillAyd commented Apr 27, 2019

@adbull nice - looks good on initial glance. I'll dive deeper tomorrow.

Thanks for sticking with the feedback

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm outside of @jreback comments

@jreback
Copy link
Contributor

jreback commented May 7, 2019

ok looks good. can you merge master. also have a look at the example in the docs: http://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html#transformation; its the note box

Some functions will automatically transform the input when applied to a GroupBy object, but returning an object of the same shape as the original. Passing as_index=False will not affect these transformation methods.
For example: fillna, ffill, bfill, shift..

This example is showing the incorrect behavior (I believe your patch will fix it); do your test sufficiently cover this case?

@jreback
Copy link
Contributor

jreback commented May 12, 2019

@adbull if you can merge master and update

@adbull
Copy link
Contributor Author

adbull commented May 12, 2019

@jreback Done. Patch will fix the example in the docs, and the test covers that case.

@jreback jreback added this to the 0.25.0 milestone May 12, 2019
@jreback
Copy link
Contributor

jreback commented May 12, 2019

lgtm. ping on green.

@jreback
Copy link
Contributor

jreback commented May 12, 2019

you have a linting issue

@adbull
Copy link
Contributor Author

adbull commented May 13, 2019

Issue is from a different PR, I didn't edit test_groupby.py

@adbull
Copy link
Contributor Author

adbull commented May 13, 2019

@jreback looks green

@WillAyd WillAyd merged commit 3b24fb6 into pandas-dev:master May 15, 2019
@WillAyd
Copy link
Member

WillAyd commented May 15, 2019

Thanks @adbull - nice change!

@adbull adbull deleted the groupby-ffill branch May 15, 2019 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby().ffill() adds group labels as extra column
3 participants