Skip to content

pct change bug issue 21200 #21235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Dec 12, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
ee2c034
pct change bug issue 21200
simonariddell May 28, 2018
47b5908
add whatsnew
simonariddell May 28, 2018
41beb4a
fix pep8 issues
simonariddell May 28, 2018
25efd37
Removed vectorization from groupby pct change
simonariddell May 30, 2018
849fac4
try using cache arguments to keep vectorization
simonariddell Jun 4, 2018
eaede34
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 9, 2018
64097cc
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 19, 2018
fcd7183
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 19, 2018
e8681d6
Merge branch 'groupby_pctchange_bug' of https://github.com/SimonAleck…
simonariddell Jun 24, 2018
77c3935
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Jun 24, 2018
d037e65
Update test to address WillAyds code suggestion
simonariddell Jul 29, 2018
a4fcb33
incorporate Will CRs
simonariddell Sep 27, 2018
64d2e4f
cr update
simonariddell Sep 28, 2018
eabfe1f
merge upstream
simonariddell Sep 28, 2018
a63553a
incorporate CR and merge in master
simonariddell Sep 29, 2018
c72c112
Merge branch 'master' of https://github.com/pandas-dev/pandas
simonariddell Nov 23, 2018
a745209
merge in master
simonariddell Nov 23, 2018
5d3fc2d
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Nov 25, 2018
ac9d005
Merge remote-tracking branch 'upstream/master'
simonariddell Nov 25, 2018
1f1f705
Merge branch 'master' into groupby_pctchange_bug
simonariddell Nov 25, 2018
795d7be
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Nov 25, 2018
f0a9c6d
Merge branch 'groupby_pctchange_bug' of https://github.com/SimonAleck…
simonariddell Nov 27, 2018
a979f79
BUG,TST: Remove case where vectorization fails in pct_change groupby …
simonariddell Nov 27, 2018
5eb79c2
Merge branch 'groupby_pctchange_bug' of https://github.com/SimonAleck…
simonariddell Nov 27, 2018
b11d20f
BUG,TST: GH21234 groupby pct_change
simonariddell Nov 27, 2018
f480299
BUG,TST: GH21234 groupby pct_change
simonariddell Nov 28, 2018
a3c95cb
BUG,TST: GH21235 groupby pct_change whatsnew
simonariddell Dec 2, 2018
e5289a2
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 9, 2018
1ccc507
Merge remote-tracking branch 'upstream/master'
simonariddell Dec 9, 2018
7f4a244
merge master
simonariddell Dec 9, 2018
c26cf60
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 9, 2018
877b433
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 9, 2018
c8b686e
Merge remote-tracking branch 'upstream/master'
simonariddell Dec 11, 2018
faf532e
Merge remote-tracking branch 'upstream/master'
simonariddell Dec 12, 2018
6d87075
Merge branch 'master' into groupby_pctchange_bug
simonariddell Dec 12, 2018
ad934b7
BUG, TST: GH 21235 groupby pct_change whatsnew
simonariddell Dec 12, 2018
01d705f
BUG, TST: GH 21235 groupby pct_change test parametrize
simonariddell Dec 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 8 additions & 16 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2070,17 +2070,10 @@ def shift(self, periods=1, freq=None, axis=0):
def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None,
axis=0):
"""Calcuate pct_change of each value to previous entry in group"""
if freq is not None or axis != 0:
return self.apply(lambda x: x.pct_change(periods=periods,
fill_method=fill_method,
limit=limit, freq=freq,
axis=axis))

filled = getattr(self, fill_method)(limit=limit).drop(
self.grouper.names, axis=1)
shifted = filled.shift(periods=periods, freq=freq)

return (filled / shifted) - 1
return self.apply(lambda x: x.pct_change(periods=periods,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this essentially eliminating a vectorized solution across the board? How are the ASVs comparing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pct_change is blacklisted in benchmarks/groupby.py.

fill_method=fill_method,
limit=limit, freq=freq,
axis=axis))

@Substitution(name='groupby')
@Appender(_doc_template)
Expand Down Expand Up @@ -3942,11 +3935,10 @@ def _apply_to_column_groupbys(self, func):
return func(self)

def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None):
"""Calculate percent change of each value to previous entry in group"""
filled = getattr(self, fill_method)(limit=limit)
shifted = filled.shift(periods=periods, freq=freq)

return (filled / shifted) - 1
"""Calcuate pct_change of each value to previous entry in group"""
return self.apply(lambda x: x.pct_change(periods=periods,
fill_method=fill_method,
limit=limit, freq=freq))


class NDFrameGroupBy(GroupBy):
Expand Down
36 changes: 25 additions & 11 deletions pandas/tests/groupby/test_transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -722,19 +722,29 @@ def interweave(list_obj):


@pytest.mark.parametrize("test_series", [True, False])
@pytest.mark.parametrize("shuffle", [True, False])
@pytest.mark.parametrize("periods,fill_method,limit", [
(1, 'ffill', None), (1, 'ffill', 1),
(1, 'bfill', None), (1, 'bfill', 1),
(-1, 'ffill', None), (-1, 'ffill', 1),
(-1, 'bfill', None), (-1, 'bfill', 1)])
def test_pct_change(test_series, periods, fill_method, limit):
vals = [np.nan, np.nan, 1, 2, 4, 10, np.nan, np.nan]
exp_vals = Series(vals).pct_change(periods=periods,
fill_method=fill_method,
limit=limit).tolist()

df = DataFrame({'key': ['a'] * len(vals) + ['b'] * len(vals),
'vals': vals * 2})
def test_pct_change(test_series, shuffle, periods, fill_method, limit):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment with the issue number for this test (like GH #21200, #21621), looks like the other issue is covered here as well.

vals = [3, np.nan, 1, 2, 4, 10, np.nan, np.nan]
keys = ['a', 'b']
key_v = [k for j in list(map(lambda x: [x] * len(vals), keys)) for k in j]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you really want to do this, use np.repeat

df = DataFrame({'key': key_v, 'vals': vals * 2})
if shuffle:
order = np.random.RandomState(seed=42).permutation(len(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but what's the point of this? If you are seeding your random call isn't this just always going to return the same order? If so would be preferable just to be explicit about that order, though not sure if its necessary in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, you're right. Nixed it.

df = df.reindex(order).reset_index(drop=True)

manual_apply = []
for k in keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you just using this? df.loc[df.key == k, 'vals'], which is a Series

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop / concat mechanism is rather non-idiomatic - whats the reasoning for this instead of just doing a comparison to the anonymous function as was done in the original issue?

Copy link
Contributor Author

@simonariddell simonariddell Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to do this as it means the test will ultimately assert that the result of a groupby apply is the same as looping over the keys and doing a normal apply, as the two should be identical. Or, more specifically, the groupby apply should ought to replicate the results of a normal apply looped over the groups.
I'm experimenting now with what an anonymous comparison would look like, but my naive thought is that a comparison to a normal apply is more to-the-point on what we want to test?

Edit: Fixed it in the newest CR. I like your suggestion more now that I see it in code.

subgroup = Series(df.loc[df.key == k, 'vals'].values)
manual_apply.append(subgroup.pct_change(periods=periods,
fill_method=fill_method,
limit=limit))
exp_vals = pd.concat(manual_apply).reset_index(drop=True)
exp = pd.DataFrame(exp_vals, columns=['A'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat already returns a frame, use ignore_index=True in the concat

grp = df.groupby('key')

def get_result(grp_obj):
Expand All @@ -743,14 +753,18 @@ def get_result(grp_obj):
limit=limit)

if test_series:
exp = pd.Series(exp_vals * 2)
exp.name = 'vals'
exp = exp.loc[:, 'A']
grp = grp['vals']
result = get_result(grp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that your change introduced it but can we get rid of the get_result function? Not sure why this was set up this way and adds unnecessary fluff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

df.insert(0, 'A', result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of all of the insert / sort calls? Would rather just construct the expected value and simply compare that to the result of a pct_change call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Fixed.

result = df.sort_values(by='key')
result = result.loc[:, 'A']
result = result.reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use inplace in testing

tm.assert_series_equal(result, exp)
else:
exp = DataFrame({'vals': exp_vals * 2})
result = get_result(grp)
result.reset_index(drop=True, inplace=True)
result.columns = ['A']
tm.assert_frame_equal(result, exp)


Expand Down