Skip to content

pct change bug issue 21200 #21235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Dec 12, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
ee2c034
pct change bug issue 21200
simonariddell May 28, 2018
47b5908
add whatsnew
simonariddell May 28, 2018
41beb4a
fix pep8 issues
simonariddell May 28, 2018
25efd37
Removed vectorization from groupby pct change
simonariddell May 30, 2018
849fac4
try using cache arguments to keep vectorization
simonariddell Jun 4, 2018
eaede34
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 9, 2018
64097cc
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 19, 2018
fcd7183
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 19, 2018
e8681d6
Merge branch 'groupby_pctchange_bug' of https://github.com/SimonAleck…
simonariddell Jun 24, 2018
77c3935
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 24, 2018
d037e65
Update test to address WillAyds code suggestion
simonariddell Jul 29, 2018
a4fcb33
incorporate Will CRs
simonariddell Sep 27, 2018
64d2e4f
cr update
simonariddell Sep 28, 2018
eabfe1f
merge upstream
simonariddell Sep 28, 2018
a63553a
incorporate CR and merge in master
simonariddell Sep 29, 2018
c72c112
Merge branch 'master' of https://github.com/pandas-dev/pandas
simonariddell Nov 23, 2018
a745209
merge in master
simonariddell Nov 23, 2018
5d3fc2d
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Nov 25, 2018
ac9d005
Merge remote-tracking branch 'upstream/master'
simonariddell Nov 25, 2018
1f1f705
Merge branch 'master' into groupby_pctchange_bug
simonariddell Nov 25, 2018
795d7be
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Nov 25, 2018
f0a9c6d
Merge branch 'groupby_pctchange_bug' of https://github.com/SimonAleck…
simonariddell Nov 27, 2018
a979f79
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Nov 27, 2018
5eb79c2
Merge branch 'groupby_pctchange_bug' of https://github.com/SimonAleck…
simonariddell Nov 27, 2018
b11d20f
BUG,TST: GH21234 groupby pct_change
simonariddell Nov 27, 2018
f480299
BUG,TST: GH21234 groupby pct_change
simonariddell Nov 28, 2018
a3c95cb
BUG,TST: GH21235 groupby pct_change whatsnew
simonariddell Dec 2, 2018
e5289a2
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 9, 2018
1ccc507
Merge remote-tracking branch 'upstream/master'
simonariddell Dec 9, 2018
7f4a244
merge master
simonariddell Dec 9, 2018
c26cf60
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 9, 2018
877b433
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 9, 2018
c8b686e
Merge remote-tracking branch 'upstream/master'
simonariddell Dec 11, 2018
faf532e
Merge remote-tracking branch 'upstream/master'
simonariddell Dec 12, 2018
6d87075
Merge branch 'master' into groupby_pctchange_bug
simonariddell Dec 12, 2018
ad934b7
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 12, 2018
01d705f
BUG, TST: GH 21235 groupby pct_change test parametrize
simonariddell Dec 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 20 additions & 8 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2075,12 +2075,19 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None,
fill_method=fill_method,
limit=limit, freq=freq,
axis=axis))
if fill_method:
new = DataFrameGroupBy(self._obj_with_exclusions,
grouper=self.grouper)
new.obj = getattr(new, fill_method)(limit=limit)
new._reset_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of these calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a fill method is chosen, I want a groupby instance that has the properly filled data as its obj. When I assign new.obj = getattr(new, fill_method)(limit=limit), the new._cache is not updated, as a result it does not reflect the latest change.
By creating a new groupby object with the data object with the correct filled values I am able to call new.shift(periods=periods, freq=freq) later, where the shift method is called on data that has already been properly filled.
The 'trick' is that by assigning the filled data back to the groupby instance, I am able to still have access to the helper methods like shift.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example of where these three lines differ from just doing self.ffill() (I'm assuming fill_method=='ffill' and limit==1)?

Haven't seen any other aggregation function have to make these types of calls so I'm still confused as to why this is necessary. A concrete example would be immensely helpful

Copy link
Contributor Author

@simonariddell simonariddell Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can simplify those three lines, but I don't believe I can simply use self.ffill(). Let me try to explain:
The first line can be replaced with new = copy(self). However, the reason I call the fill method and reassign it to the .obj attribute, is that later on line 2087 I want to call the groupby implementation of the shift method on new. Whereas type(self.ffill()) returns a DataFrame. I used this 'trick' of doing the fill and reassigning it back to the .obj attribute in order to still have access to the groupby helper methods. If there is a more idiomatic way to do this, please let me know.

In this case if I were to remove those three lines and replace with with self.ffill(), the code would immediately fail when I tried to call a groupby method on the returned dataframe object.

Is this sufficiently clear/concrete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just easier if I share code myself at this point :-). Building off of what you have here's my stab at implementation:

https://github.com/WillAyd/pandas/blob/ebe4fffd935f2f16b6e5c635cfa4883e6bacc5ea/pandas/core/groupby/groupby.py#L2070

AFAICT it passes all of the tests you have and therefore provides the same behavior. However, it has the advantage of being more terse and not requiring the construction of a secondary GroupBy object.

Not saying you need to exactly mimic this but I'm sharing to highlight how I think this can be done more succinctly again without the need of constructing entirely new objects. If you know of where this would fail then we probably want to add a test case to handle whatever that difference is. Otherwise if they are the same I'd prefer if you adopt something a little more in line with what I've suggested.

No worries on timing - this is open source so any and all contributions are welcome at any time.

Copy link
Contributor Author

@simonariddell simonariddell Jul 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing some code, that was very helpful. I updated my test to highlight a case where your code would fail, but what I wrote wouldn't. Specifically (1) I added a test case for no fill method, and (2) I added more np.nan in the test data. This causes your code to fail as it does not apply the fill_method to the shifted data.

I agree making a second object seems wrong. However, It's not clear to me how to avoid this, unless there are additional helper methods that I'm just not aware of or finding.

else:
new = self

filled = getattr(self, fill_method)(limit=limit).drop(
self.grouper.names, axis=1)
shifted = filled.shift(periods=periods, freq=freq)
obj = new.obj.drop(self.grouper.names, axis=1)
shifted = new.shift(periods=periods, freq=freq).\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note - I thought this would fail LINTing but regardless we tend to prefer implicit line continuation, so would be better to do .drop( on the first line and continue along with the subsequent code on the next line, so long as that all fits

drop(self.grouper.names, axis=1)
return (obj / shifted) - 1

return (filled / shifted) - 1

@Substitution(name='groupby')
@Appender(_doc_template)
Expand Down Expand Up @@ -3942,11 +3949,16 @@ def _apply_to_column_groupbys(self, func):
return func(self)

def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None):
"""Calculate percent change of each value to previous entry in group"""
filled = getattr(self, fill_method)(limit=limit)
shifted = filled.shift(periods=periods, freq=freq)
"""Calcuate pct_change of each value to previous entry in group"""
if fill_method:
new = SeriesGroupBy(self.obj, grouper=self.grouper)
new.obj = getattr(new, fill_method)(limit=limit)
new._reset_cache()
else:
new = self

return (filled / shifted) - 1
shifted = new.shift(periods=periods, freq=freq)
return (new.obj / shifted) - 1


class NDFrameGroupBy(GroupBy):
Expand Down
43 changes: 22 additions & 21 deletions pandas/tests/groupby/test_transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -722,35 +722,36 @@ def interweave(list_obj):


@pytest.mark.parametrize("test_series", [True, False])
@pytest.mark.parametrize("shuffle", [True, False])
@pytest.mark.parametrize("periods,fill_method,limit", [
(1, 'ffill', None), (1, 'ffill', 1),
(1, 'bfill', None), (1, 'bfill', 1),
(-1, 'ffill', None), (-1, 'ffill', 1),
(-1, 'bfill', None), (-1, 'bfill', 1)])
def test_pct_change(test_series, periods, fill_method, limit):
vals = [np.nan, np.nan, 1, 2, 4, 10, np.nan, np.nan]
exp_vals = Series(vals).pct_change(periods=periods,
fill_method=fill_method,
limit=limit).tolist()

df = DataFrame({'key': ['a'] * len(vals) + ['b'] * len(vals),
'vals': vals * 2})
(-1, 'bfill', None), (-1, 'bfill', 1),
])
def test_pct_change(test_series, shuffle, periods, fill_method, limit):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment with the issue number for this test (like GH #21200, #21621), looks like the other issue is covered here as well.

vals = [3, np.nan, 1, 2, 4, 10, np.nan, 4]
keys = ['a', 'b']
key_v = np.repeat(keys, len(vals))
df = DataFrame({'key': key_v, 'vals': vals * 2})
if shuffle:
order = np.random.RandomState(seed=42).permutation(len(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but what's the point of this? If you are seeding your random call isn't this just always going to return the same order? If so would be preferable just to be explicit about that order, though not sure if its necessary in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, you're right. Nixed it.

df = df.reindex(order).reset_index(drop=True)

df = getattr(df.groupby('key'), fill_method)(limit=limit)
grp = df.groupby('key')

def get_result(grp_obj):
return grp_obj.pct_change(periods=periods,
fill_method=fill_method,
limit=limit)
exp = grp['vals'].shift(0) / grp['vals'].shift(periods) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind changing exp to expected while you are touching this? The latter is a quasi-standard we've been applying to tests

Copy link
Contributor Author

@simonariddell simonariddell Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change exp to expected, and wait for a response on the other comment before updating. I'm still active on this, apologies for the delay in responding to your helpful comments. Thanks!

exp = exp.to_frame('vals')

if test_series:
exp = pd.Series(exp_vals * 2)
exp.name = 'vals'
grp = grp['vals']
result = get_result(grp)
tm.assert_series_equal(result, exp)
result = grp['vals'].pct_change(periods=periods,
fill_method=fill_method,
limit=limit)
tm.assert_series_equal(result, exp.loc[:, 'vals'])
else:
exp = DataFrame({'vals': exp_vals * 2})
result = get_result(grp)
result = grp.pct_change(periods=periods,
fill_method=fill_method,
limit=limit)
tm.assert_frame_equal(result, exp)


Expand Down