-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pct change bug issue 21200 #21235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pct change bug issue 21200 #21235
Changes from 6 commits
ee2c034
47b5908
41beb4a
25efd37
849fac4
eaede34
64097cc
fcd7183
e8681d6
77c3935
d037e65
a4fcb33
64d2e4f
eabfe1f
a63553a
c72c112
a745209
5d3fc2d
ac9d005
1f1f705
795d7be
f0a9c6d
a979f79
5eb79c2
b11d20f
f480299
a3c95cb
e5289a2
1ccc507
7f4a244
c26cf60
877b433
c8b686e
faf532e
6d87075
ad934b7
01d705f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2075,12 +2075,19 @@ def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, | |
fill_method=fill_method, | ||
limit=limit, freq=freq, | ||
axis=axis)) | ||
if fill_method: | ||
new = DataFrameGroupBy(self._obj_with_exclusions, | ||
grouper=self.grouper) | ||
new.obj = getattr(new, fill_method)(limit=limit) | ||
new._reset_cache() | ||
else: | ||
new = self | ||
|
||
filled = getattr(self, fill_method)(limit=limit).drop( | ||
self.grouper.names, axis=1) | ||
shifted = filled.shift(periods=periods, freq=freq) | ||
obj = new.obj.drop(self.grouper.names, axis=1) | ||
shifted = new.shift(periods=periods, freq=freq).\ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Side note - I thought this would fail LINTing but regardless we tend to prefer implicit line continuation, so would be better to do |
||
drop(self.grouper.names, axis=1) | ||
return (obj / shifted) - 1 | ||
|
||
return (filled / shifted) - 1 | ||
|
||
@Substitution(name='groupby') | ||
@Appender(_doc_template) | ||
|
@@ -3942,11 +3949,16 @@ def _apply_to_column_groupbys(self, func): | |
return func(self) | ||
|
||
def pct_change(self, periods=1, fill_method='pad', limit=None, freq=None): | ||
"""Calculate percent change of each value to previous entry in group""" | ||
filled = getattr(self, fill_method)(limit=limit) | ||
shifted = filled.shift(periods=periods, freq=freq) | ||
"""Calcuate pct_change of each value to previous entry in group""" | ||
if fill_method: | ||
new = SeriesGroupBy(self.obj, grouper=self.grouper) | ||
new.obj = getattr(new, fill_method)(limit=limit) | ||
new._reset_cache() | ||
else: | ||
new = self | ||
|
||
return (filled / shifted) - 1 | ||
shifted = new.shift(periods=periods, freq=freq) | ||
return (new.obj / shifted) - 1 | ||
|
||
|
||
class NDFrameGroupBy(GroupBy): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -722,19 +722,30 @@ def interweave(list_obj): | |
|
||
|
||
@pytest.mark.parametrize("test_series", [True, False]) | ||
@pytest.mark.parametrize("shuffle", [True, False]) | ||
@pytest.mark.parametrize("periods,fill_method,limit", [ | ||
(1, 'ffill', None), (1, 'ffill', 1), | ||
(1, 'bfill', None), (1, 'bfill', 1), | ||
(-1, 'ffill', None), (-1, 'ffill', 1), | ||
(-1, 'bfill', None), (-1, 'bfill', 1)]) | ||
def test_pct_change(test_series, periods, fill_method, limit): | ||
vals = [np.nan, np.nan, 1, 2, 4, 10, np.nan, np.nan] | ||
exp_vals = Series(vals).pct_change(periods=periods, | ||
fill_method=fill_method, | ||
limit=limit).tolist() | ||
|
||
df = DataFrame({'key': ['a'] * len(vals) + ['b'] * len(vals), | ||
'vals': vals * 2}) | ||
(-1, 'bfill', None), (-1, 'bfill', 1), | ||
]) | ||
def test_pct_change(test_series, shuffle, periods, fill_method, limit): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a comment with the issue number for this test (like |
||
vals = [3, np.nan, 1, 2, 4, 10, np.nan, 4] | ||
keys = ['a', 'b'] | ||
key_v = np.repeat(keys, len(vals)) | ||
df = DataFrame({'key': key_v, 'vals': vals * 2}) | ||
if shuffle: | ||
order = np.random.RandomState(seed=42).permutation(len(df)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm missing something but what's the point of this? If you are seeding your random call isn't this just always going to return the same order? If so would be preferable just to be explicit about that order, though not sure if its necessary in the first place There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eh, you're right. Nixed it. |
||
df = df.reindex(order).reset_index(drop=True) | ||
|
||
manual_apply = [] | ||
for k in keys: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why aren't you just using this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This loop / concat mechanism is rather non-idiomatic - whats the reasoning for this instead of just doing a comparison to the anonymous function as was done in the original issue? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I chose to do this as it means the test will ultimately assert that the result of a groupby apply is the same as looping over the keys and doing a normal apply, as the two should be identical. Or, more specifically, the groupby apply should ought to replicate the results of a normal apply looped over the groups. Edit: Fixed it in the newest CR. I like your suggestion more now that I see it in code. |
||
ind = df.loc[df.key == k, 'vals'] | ||
manual_apply.append(ind.pct_change(periods=periods, | ||
fill_method=fill_method, | ||
limit=limit)) | ||
exp_vals = pd.concat(manual_apply, ignore_index=True) | ||
exp = pd.DataFrame(exp_vals.values, columns=['A']) | ||
grp = df.groupby('key') | ||
|
||
def get_result(grp_obj): | ||
|
@@ -743,14 +754,21 @@ def get_result(grp_obj): | |
limit=limit) | ||
|
||
if test_series: | ||
exp = pd.Series(exp_vals * 2) | ||
exp.name = 'vals' | ||
exp = exp.loc[:, 'A'] | ||
grp = grp['vals'] | ||
result = get_result(grp) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not that your change introduced it but can we get rid of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed |
||
df.insert(0, 'A', result) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the point of all of the insert / sort calls? Would rather just construct the expected value and simply compare that to the result of a pct_change call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Fixed. |
||
result = df.sort_values(by='key') | ||
result = result.loc[:, 'A'] | ||
result = result.reset_index(drop=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't use inplace in testing |
||
tm.assert_series_equal(result, exp) | ||
else: | ||
exp = DataFrame({'vals': exp_vals * 2}) | ||
result = get_result(grp) | ||
result = result.reset_index(drop=True) | ||
df.insert(0, 'A', result) | ||
result = df.sort_values(by='key') | ||
result = result.loc[:, ['A']] | ||
result = result.reset_index(drop=True) | ||
tm.assert_frame_equal(result, exp) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of these calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a fill method is chosen, I want a groupby instance that has the properly filled data as its obj. When I assign new.obj = getattr(new, fill_method)(limit=limit), the new._cache is not updated, as a result it does not reflect the latest change.
By creating a new groupby object with the data object with the correct filled values I am able to call new.shift(periods=periods, freq=freq) later, where the shift method is called on data that has already been properly filled.
The 'trick' is that by assigning the filled data back to the groupby instance, I am able to still have access to the helper methods like shift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me an example of where these three lines differ from just doing
self.ffill()
(I'm assumingfill_method
=='ffill' andlimit
==1)?Haven't seen any other aggregation function have to make these types of calls so I'm still confused as to why this is necessary. A concrete example would be immensely helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can simplify those three lines, but I don't believe I can simply use self.ffill(). Let me try to explain:
The first line can be replaced with new = copy(self). However, the reason I call the fill method and reassign it to the .obj attribute, is that later on line 2087 I want to call the groupby implementation of the shift method on new. Whereas type(self.ffill()) returns a DataFrame. I used this 'trick' of doing the fill and reassigning it back to the .obj attribute in order to still have access to the groupby helper methods. If there is a more idiomatic way to do this, please let me know.
In this case if I were to remove those three lines and replace with with self.ffill(), the code would immediately fail when I tried to call a groupby method on the returned dataframe object.
Is this sufficiently clear/concrete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just easier if I share code myself at this point :-). Building off of what you have here's my stab at implementation:
https://github.com/WillAyd/pandas/blob/ebe4fffd935f2f16b6e5c635cfa4883e6bacc5ea/pandas/core/groupby/groupby.py#L2070
AFAICT it passes all of the tests you have and therefore provides the same behavior. However, it has the advantage of being more terse and not requiring the construction of a secondary
GroupBy
object.Not saying you need to exactly mimic this but I'm sharing to highlight how I think this can be done more succinctly again without the need of constructing entirely new objects. If you know of where this would fail then we probably want to add a test case to handle whatever that difference is. Otherwise if they are the same I'd prefer if you adopt something a little more in line with what I've suggested.
No worries on timing - this is open source so any and all contributions are welcome at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing some code, that was very helpful. I updated my test to highlight a case where your code would fail, but what I wrote wouldn't. Specifically (1) I added a test case for no fill method, and (2) I added more np.nan in the test data. This causes your code to fail as it does not apply the fill_method to the shifted data.
I agree making a second object seems wrong. However, It's not clear to me how to avoid this, unless there are additional helper methods that I'm just not aware of or finding.