-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pct change bug issue 21200 #21235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pct change bug issue 21200 #21235
Conversation
Hello @SimonAlecks! Thanks for updating the PR.
Comment last updated on November 27, 2018 at 06:03 Hours UTC |
Just eyeballing this I do not like that the implementation is going to vary depending on the monotonicity of the group as that has relatively large performance implications that the standard user will not understand nor should they care to Is there any way to preserve the vectorized solution and come to the same result? |
@WillAyd It's hard for me to think of a scenario where the vectorized solution would work if the data is not in order with respect to groups, although perhaps the monotonicity requirement is overkill? I'm new to Pandas source code, so I'll investigate the existing helper functions that may preserve the intended logic but lessen the requirements. I'm also a little concerned though that the vectorized solution may be getting the first element of each group wrong, if it's calculating its percentage change with respect to the last element of the previous group. I'll check this when I get home. |
Codecov Report
@@ Coverage Diff @@
## master #21235 +/- ##
==========================================
+ Coverage 92.21% 92.21% +<.01%
==========================================
Files 162 162
Lines 51763 51768 +5
==========================================
+ Hits 47733 47740 +7
+ Misses 4030 4028 -2
Continue to review full report at Codecov.
|
After changing the UT to have a value in the 0th mock array element, the UT begins to fail. The non-apply based vectorized solution doesn't seem to work, as it calculates a pct diff between the first element of a group, and the last element of the preceding group. It seems to me the simplest solution to solve this issue is to remove the potential to use a vectorized solution. In addition to solving the original issue, it also solves the other issue I discovered above, as well as the additional bonus of letting the user expect the same index structure returned whether or not the vectorized style was called or not (granted, this could easily be fixed). I suppose we could try preserving the vectorized solution, then writing a little algo to find the first element of each group and setting it to nan. I have no prior knowledge as to whether this would improve performance vs. calling apply. My intuition says at this point you'd be better off just using apply in the first place, but I could very well be wrong. I also left my updated UT in, as it would catch these cases in case they occurred again, or if a vectorized solution was reimplemented. Any thoughts @WillAyd ? PS: Does anyone have a link they can share with me on how to troubleshoot/debug the continuous-integration/travis-ci/pr failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are killing performance here, this is why it was changed originally. The real fix is to do this in a cythonized version, similar to the way shift/fillna are done.
df.insert(0, 'A', result) | ||
result = df.sort_values(by='key') | ||
result = result.loc[:, 'A'] | ||
result = result.reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use inplace in testing
def test_pct_change(test_series, shuffle, periods, fill_method, limit): | ||
vals = [3, np.nan, 1, 2, 4, 10, np.nan, np.nan] | ||
keys = ['a', 'b'] | ||
key_v = [k for j in list(map(lambda x: [x] * len(vals), keys)) for k in j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you really want to do this, use np.repeat
df = df.reindex(order).reset_index(drop=True) | ||
|
||
manual_apply = [] | ||
for k in keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why aren't you just using this? df.loc[df.key == k, 'vals']
, which is a Series
fill_method=fill_method, | ||
limit=limit)) | ||
exp_vals = pd.concat(manual_apply).reset_index(drop=True) | ||
exp = pd.DataFrame(exp_vals, columns=['A']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concat already returns a frame, use ignore_index=True
in the concat
Just looking at this can't you just changed the following line: pandas/pandas/core/groupby/groupby.py Line 2081 in c85ab08
To be |
In my most recent commit I tried to preserve vectorization, while also not making costly calls to is_monotonic, and preventing group contamination within the vectorization (e.g. first element of group B being divided by an element of group A). I did this by only allowing the code to attempt vectorization if it can confirm that the series is monotonic, and if the truth value of whether or not it is monotonic already exists in the cache. If these conditions are met, it will do a vectorized pct change operation, and then a vectorized operation to set any values that shouldn't exist to np.nan (e.g. for shift=1 the first element of a group would be divided by the last element of the prior group, but this is an invalid value). I expect this is trying to be too fancy given the problem and my knowledge of pandas source, so I'm prepared to nix it if this is the case, and just keep it simple and/or try to implement it in cython. Please just let me know. Else I will extend it to the dataframe method as well. Hi William, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the correct approach and is way too convoluted. My previous suggestion worked fine for your tests, save the tests that introduce allowing None as an argument.
If however you can find a scenario where what I suggested doesn't work then this needs to be implemented in Cython.
df = DataFrame({'key': ['a'] * len(vals) + ['b'] * len(vals), | ||
'vals': vals * 2}) | ||
(-1, 'bfill', None), (-1, 'bfill', 1), | ||
(-1, None, None), (-1, None, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is not a valid fill_method
on master so this is changing the behavior
pandas/core/groupby/groupby.py
Outdated
shifted = filled.shift(periods=periods, freq=freq) | ||
|
||
return (filled / shifted) - 1 | ||
return self.apply(lambda x: x.pct_change(periods=periods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this essentially eliminating a vectorized solution across the board? How are the ASVs comparing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pct_change is blacklisted in benchmarks/groupby.py.
…method, improve test. pandas-dev#21235
For the reasons I mentioned above a vectorized solution will return incorrect values for pct_change under a set of conditions. I attempted to solve these conditions in the CR above, which William pointed out was too convoluted (I agree). The additional change you suggested (self.shift instead of filled.shift) does not work either for the reasons I mentioned above. The new test I wrote would fail under that change. My suggestion is to update the test, remove the vectorized solution, and create a new issue to implement this in cython, which I'm interested in doing. |
Not sure what you mean by the benchmarks being blacklisted but for illustration purposes here is how your latest compares compares to master: before after ratio
[415012f4] [eaede342]
+ 571±1μs 975±4ms 1708.33 groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'transformation')
+ 572±2μs 974±0.6ms 1702.23 groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'direct')
+ 592±3μs 635±8ms 1072.93 groupby.GroupByMethods.time_dtype_as_group('int', 'pct_change', 'transformation')
+ 606±8μs 628±3ms 1036.87 groupby.GroupByMethods.time_dtype_as_group('int', 'pct_change', 'direct')
+ 559±10μs 514±10ms 920.59 groupby.GroupByMethods.time_dtype_as_field('float', 'pct_change', 'direct')
+ 529±0.9μs 481±5ms 908.79 groupby.GroupByMethods.time_dtype_as_field('float', 'pct_change', 'transformation')
+ 603±8μs 460±20ms 762.64 groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'direct')
+ 593±1μs 445±3ms 750.02 groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'transformation')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. As you can see (and I think have already recognized) this isn't a great solution and kills performance.
I'm not entirely clear on what your reasons are. It's not a requirement that grouped elements be monotonic in any of the Cythonized code. If you are getting an error on the test can you be explicit about what exactly that error is? IIRC correctly with the last set of tests what I suggested worked fine. I have a hunch that there may be confusion about the test as it was before (i.e. explicitly passing Final note - thanks for helping on this. I hope you are taking this feedback constructively and not in any negative fashion (I believe you are but just want to be explicit). Understanding how / when to interface with the Cythonized code is one of the more complicated aspects of this package |
…method. Incorporate CR suggestion.
I'm very grateful for the feedback, and I definitely appreciate your time. I apologize for the delay since the last commit. I've been working on understanding the code more clearly. In this situation I've attempted to solve the problem by incorporating your prior feedback. My previous posts were incorrect, and you correctly pointed out to me how the shift operator works. In this latest CR I've attempted to fix the problem in line with the spirit of the original implementation. The first step is I added an explicit catch for a fill_method, which did not exist before. I initially attempted to implement your 1-line suggestion, but I noticed that I couldn't simply call self.shift(), as it first needed to have the proper fill method. However, the fill method returns the obj attribute, not the groupby object. In order to remedy this, without creating a function that changed the state of the code, I used the following strategy: Once this is done, the result is relatively self-explanatory. and is pretty similar to the original code. Lastly, I ran an ASV benchmark locally, and the performance seemed similar, but I need to spend some more time teaching myself how the benchmarking system and code works, as it's possible I made an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer
pandas/core/groupby/groupby.py
Outdated
new = DataFrameGroupBy(self._obj_with_exclusions, | ||
grouper=self.grouper) | ||
new.obj = getattr(new, fill_method)(limit=limit) | ||
new._reset_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of these calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a fill method is chosen, I want a groupby instance that has the properly filled data as its obj. When I assign new.obj = getattr(new, fill_method)(limit=limit), the new._cache is not updated, as a result it does not reflect the latest change.
By creating a new groupby object with the data object with the correct filled values I am able to call new.shift(periods=periods, freq=freq) later, where the shift method is called on data that has already been properly filled.
The 'trick' is that by assigning the filled data back to the groupby instance, I am able to still have access to the helper methods like shift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me an example of where these three lines differ from just doing self.ffill()
(I'm assuming fill_method
=='ffill' and limit
==1)?
Haven't seen any other aggregation function have to make these types of calls so I'm still confused as to why this is necessary. A concrete example would be immensely helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can simplify those three lines, but I don't believe I can simply use self.ffill(). Let me try to explain:
The first line can be replaced with new = copy(self). However, the reason I call the fill method and reassign it to the .obj attribute, is that later on line 2087 I want to call the groupby implementation of the shift method on new. Whereas type(self.ffill()) returns a DataFrame. I used this 'trick' of doing the fill and reassigning it back to the .obj attribute in order to still have access to the groupby helper methods. If there is a more idiomatic way to do this, please let me know.
In this case if I were to remove those three lines and replace with with self.ffill(), the code would immediately fail when I tried to call a groupby method on the returned dataframe object.
Is this sufficiently clear/concrete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just easier if I share code myself at this point :-). Building off of what you have here's my stab at implementation:
AFAICT it passes all of the tests you have and therefore provides the same behavior. However, it has the advantage of being more terse and not requiring the construction of a secondary GroupBy
object.
Not saying you need to exactly mimic this but I'm sharing to highlight how I think this can be done more succinctly again without the need of constructing entirely new objects. If you know of where this would fail then we probably want to add a test case to handle whatever that difference is. Otherwise if they are the same I'd prefer if you adopt something a little more in line with what I've suggested.
No worries on timing - this is open source so any and all contributions are welcome at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing some code, that was very helpful. I updated my test to highlight a case where your code would fail, but what I wrote wouldn't. Specifically (1) I added a test case for no fill method, and (2) I added more np.nan in the test data. This causes your code to fail as it does not apply the fill_method to the shifted data.
I agree making a second object seems wrong. However, It's not clear to me how to avoid this, unless there are additional helper methods that I'm just not aware of or finding.
df = df.reindex(order).reset_index(drop=True) | ||
|
||
manual_apply = [] | ||
for k in keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop / concat mechanism is rather non-idiomatic - whats the reasoning for this instead of just doing a comparison to the anonymous function as was done in the original issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to do this as it means the test will ultimately assert that the result of a groupby apply is the same as looping over the keys and doing a normal apply, as the two should be identical. Or, more specifically, the groupby apply should ought to replicate the results of a normal apply looped over the groups.
I'm experimenting now with what an anonymous comparison would look like, but my naive thought is that a comparison to a normal apply is more to-the-point on what we want to test?
Edit: Fixed it in the newest CR. I like your suggestion more now that I see it in code.
@@ -743,14 +754,21 @@ def get_result(grp_obj): | |||
limit=limit) | |||
|
|||
if test_series: | |||
exp = pd.Series(exp_vals * 2) | |||
exp.name = 'vals' | |||
exp = exp.loc[:, 'A'] | |||
grp = grp['vals'] | |||
result = get_result(grp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that your change introduced it but can we get rid of the get_result
function? Not sure why this was set up this way and adds unnecessary fluff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
grp = grp['vals'] | ||
result = get_result(grp) | ||
df.insert(0, 'A', result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of all of the insert / sort calls? Would rather just construct the expected value and simply compare that to the result of a pct_change call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Fixed.
…method. Incorporate CR suggestion.
…s/pandas into groupby_pctchange_bug
…method. Incorporate CR suggestion.
Incorporated most requested changes, with a few clarifying questions asked on a few of them. Thanks for your patience! |
@SimonAlecks do you believe this change would also solve #21621?1 If so, could you add a test cases for that issue? |
pandas/core/groupby/groupby.py
Outdated
new = DataFrameGroupBy(self._obj_with_exclusions, | ||
grouper=self.grouper) | ||
new.obj = getattr(new, fill_method)(limit=limit) | ||
new._reset_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me an example of where these three lines differ from just doing self.ffill()
(I'm assuming fill_method
=='ffill' and limit
==1)?
Haven't seen any other aggregation function have to make these types of calls so I'm still confused as to why this is necessary. A concrete example would be immensely helpful
return grp_obj.pct_change(periods=periods, | ||
fill_method=fill_method, | ||
limit=limit) | ||
exp = grp['vals'].shift(0) / grp['vals'].shift(periods) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind changing exp
to expected
while you are touching this? The latter is a quasi-standard we've been applying to tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change exp to expected, and wait for a response on the other comment before updating. I'm still active on this, apologies for the delay in responding to your helpful comments. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think this is in a good enough state with appropriate follow ups documented. @jreback over to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok. this needs a whatnew entry. I think its important enough to warrant a sub-section, with a small example of what it was doing before and the correct result. you an put it in right below the grouping section in Bug Fixes I think is ok.
Added the whatsnew. I checked for previous examples in the docs to follow style guidelines, but please let me know if it's too verbose or I made a stylistic mistake, and I'll quickly fix it. Thanks. |
doc/source/whatsnew/v0.23.5.txt
Outdated
@@ -42,7 +42,33 @@ Bug Fixes | |||
**Groupby/Resample/Rolling** | |||
|
|||
- Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`). | |||
- | |||
- Bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this entire note to a sub-section in Backwards incompatible API changes.
doc/source/whatsnew/v0.23.5.txt
Outdated
.. ipython:: python | ||
|
||
df = pd.DataFrame({'a': [1.0, 1.1, 2.2], 'grp': ['a', 'a', 'b']}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show the df, with 'df'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
doc/source/whatsnew/v0.23.5.txt
Outdated
|
||
Grouping respected | ||
|
||
.. code-block:: ipython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use an ipython block (it will run the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Didn't properly read contributor documentation on how to do documentation -- sorry for the churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually aren't doing a v0.23.5 release anymore. If you can merge master and move the whatsnew to v0.24 would be great
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -378,6 +378,29 @@ Backwards incompatible API changes | |||
- Passing scalar values to :class:`DatetimeIndex` or :class:`TimedeltaIndex` will now raise ``TypeError`` instead of ``ValueError`` (:issue:`23539`) | |||
- ``max_rows`` and ``max_cols`` parameters removed from :class:`HTMLFormatter` since truncation is handled by :class:`DataFrameFormatter` (:issue:`23818`) | |||
- :meth:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`) | |||
- Bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to a separate subsection
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -378,6 +378,29 @@ Backwards incompatible API changes | |||
- Passing scalar values to :class:`DatetimeIndex` or :class:`TimedeltaIndex` will now raise ``TypeError`` instead of ``ValueError`` (:issue:`23539`) | |||
- ``max_rows`` and ``max_cols`` parameters removed from :class:`HTMLFormatter` since truncation is handled by :class:`DataFrameFormatter` (:issue:`23818`) | |||
- :meth:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`) | |||
- Bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`) | |||
|
|||
**New behavior**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow the formatting of other subsections
doc/source/whatsnew/v0.24.0.rst
Outdated
|
||
.. ipython:: python | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no import
doc/source/whatsnew/v0.24.0.rst
Outdated
df.groupby('grp').pct_change() | ||
|
||
|
||
**Previous behavior**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls merge master as well. small changes. ping on green.
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -379,6 +379,33 @@ Backwards incompatible API changes | |||
- ``max_rows`` and ``max_cols`` parameters removed from :class:`HTMLFormatter` since truncation is handled by :class:`DataFrameFormatter` (:issue:`23818`) | |||
- :meth:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`) | |||
|
|||
Percentage change called on groupby now returns correct values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this 'Percentage change on groupby changes'
doc/source/whatsnew/v0.24.0.rst
Outdated
Percentage change called on groupby now returns correct values | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Fixed a bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a tiny bit more here. A user reading this will want to know what is changing, IOW the groupby percentage worked across groups, now it correctly works per group.
# GH 21200, 21621 | ||
if freq == 'D': | ||
pytest.xfail("'freq' test not necessary until #23918 completed and" | ||
"freq is used in the vectorized approach") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could put the xxfail inthe parametrize, something like this format
@pytest.mark.parametrize("engine,ext", [
pytest.param('openpyxl', '.xlsx', marks=pytest.mark.skipif(
not td.safe_import('openpyxl'), reason='No openpyxl')),
but of course its an xfail and not a skip and this is not an excel test :<
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this actually fail or be non-performant compared to the other argument combinations? I think the latter and if so probably OK to just remove xfail / skip altogether; we have the TODO in the actual code in regards to the perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be non-performant, but it wouldn't fail (well it technically would fail, but because the index in the test is not a date-time). Personally, I think it makes sense to make the index date-time, and not xfail this. Let me know if you agree, and I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. No I think that would make the test here more complicated than it needs to be; if you could follow the approach as outlined by @jreback would be preferable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand this xfail and it seems like the behavior is correct bc, as @simonariddell mentioned, it should raise without a DatetimeIndex. what is the xfailed case supposed to be testing?
doc/source/whatsnew/v0.24.0.rst
Outdated
Percentage change called on groupby now returns correct values | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Fixed a bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also list both issues here
thanks @SimonAlecks |
* upstream/master: pct change bug issue 21200 (pandas-dev#21235) DOC: Fix summaries not ending with a period (pandas-dev#24190) DOC: Add back currentmodule to api.rst (pandas-dev#24232) DOC: Fixed implicit imports for whatsnew (v >= version 20.0) (pandas-dev#24199) remove enum import for PY2 compat, xref pandas-dev#22802 (pandas-dev#24170) CI: Simplify Azure Pipelines configuration files (pandas-dev#23111) REF/TST: Add more pytest idiom to indexing/multiindex/test_getitem.py (pandas-dev#24053)
Since it came out of this PR #23918 would be a logical follow up |
closes #21200
closes #21621
git diff upstream/master -u -- "*.py" | flake8 --diff
This addresses #21200 . When there are different groups in a dataframe, by using groupby it is expected that the pct_change function be applied on each group. However, combining groupby with pct_change does not produce the correct result.
Explanation:
Currently the groupby method in the pandas series and pandas dataframe pct change method can implement a vectorized solution, rather than calling apply, if certain conditions are met. For the pandas series method, the vectorized solution is the only option.
This is certainly inappropriate in cases where the groupby object is non-monotonic in its group order. To solve this I've added a check for monotonicity in both the series and dataframe implementation, as well as adding the opportunity to call apply for the series method.
In addition, I have augmented the UT to accept a parameter that can shuffle the dataframe, in order to ensure that the correct calculation occurs.
Concern
One concern I have is that depending on whether the apply or vectorized solution is used within the pct change method (e.g. depending on whether the groupby object is monotonic or not), the result returned to the user may have a different index structure. While this was the case prior to the PR, It's not clear to me if (1) this is an acceptable design within the pandas infrastructure, and (2) whether or not this is within the scope of a single PR that was originally opened to address a very specific bug.
As this is my first pandas PR, I would certainly appreciate feedback, and will incorporate any constructive feedback into future issues.