Skip to content

pct change bug issue 21200 #21235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Dec 12, 2018
Merged

Conversation

simonariddell
Copy link
Contributor

@simonariddell simonariddell commented May 28, 2018

closes #21200
closes #21621

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This addresses #21200 . When there are different groups in a dataframe, by using groupby it is expected that the pct_change function be applied on each group. However, combining groupby with pct_change does not produce the correct result.

Explanation:
Currently the groupby method in the pandas series and pandas dataframe pct change method can implement a vectorized solution, rather than calling apply, if certain conditions are met. For the pandas series method, the vectorized solution is the only option.
This is certainly inappropriate in cases where the groupby object is non-monotonic in its group order. To solve this I've added a check for monotonicity in both the series and dataframe implementation, as well as adding the opportunity to call apply for the series method.
In addition, I have augmented the UT to accept a parameter that can shuffle the dataframe, in order to ensure that the correct calculation occurs.

Concern
One concern I have is that depending on whether the apply or vectorized solution is used within the pct change method (e.g. depending on whether the groupby object is monotonic or not), the result returned to the user may have a different index structure. While this was the case prior to the PR, It's not clear to me if (1) this is an acceptable design within the pandas infrastructure, and (2) whether or not this is within the scope of a single PR that was originally opened to address a very specific bug.

As this is my first pandas PR, I would certainly appreciate feedback, and will incorporate any constructive feedback into future issues.

@pep8speaks
Copy link

pep8speaks commented May 28, 2018

Hello @SimonAlecks! Thanks for updating the PR.

Comment last updated on November 27, 2018 at 06:03 Hours UTC

@WillAyd
Copy link
Member

WillAyd commented May 29, 2018

Just eyeballing this I do not like that the implementation is going to vary depending on the monotonicity of the group as that has relatively large performance implications that the standard user will not understand nor should they care to

Is there any way to preserve the vectorized solution and come to the same result?

@simonariddell
Copy link
Contributor Author

@WillAyd It's hard for me to think of a scenario where the vectorized solution would work if the data is not in order with respect to groups, although perhaps the monotonicity requirement is overkill? I'm new to Pandas source code, so I'll investigate the existing helper functions that may preserve the intended logic but lessen the requirements.

I'm also a little concerned though that the vectorized solution may be getting the first element of each group wrong, if it's calculating its percentage change with respect to the last element of the previous group. I'll check this when I get home.

@codecov
Copy link

codecov bot commented May 30, 2018

Codecov Report

Merging #21235 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21235      +/-   ##
==========================================
+ Coverage   92.21%   92.21%   +<.01%     
==========================================
  Files         162      162              
  Lines       51763    51768       +5     
==========================================
+ Hits        47733    47740       +7     
+ Misses       4030     4028       -2
Flag Coverage Δ
#multiple 90.61% <100%> (ø) ⬆️
#single 43% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 96.67% <100%> (+0.16%) ⬆️
pandas/core/groupby/generic.py 87.12% <100%> (+0.04%) ⬆️
pandas/util/testing.py 87.41% <0%> (-0.1%) ⬇️
pandas/io/json/json.py 93.09% <0%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89f0441...01d705f. Read the comment docs.

@simonariddell
Copy link
Contributor Author

simonariddell commented May 30, 2018

After changing the UT to have a value in the 0th mock array element, the UT begins to fail. The non-apply based vectorized solution doesn't seem to work, as it calculates a pct diff between the first element of a group, and the last element of the preceding group.

It seems to me the simplest solution to solve this issue is to remove the potential to use a vectorized solution. In addition to solving the original issue, it also solves the other issue I discovered above, as well as the additional bonus of letting the user expect the same index structure returned whether or not the vectorized style was called or not (granted, this could easily be fixed).

I suppose we could try preserving the vectorized solution, then writing a little algo to find the first element of each group and setting it to nan. I have no prior knowledge as to whether this would improve performance vs. calling apply. My intuition says at this point you'd be better off just using apply in the first place, but I could very well be wrong.

I also left my updated UT in, as it would catch these cases in case they occurred again, or if a vectorized solution was reimplemented.

Any thoughts @WillAyd ?

PS: Does anyone have a link they can share with me on how to troubleshoot/debug the continuous-integration/travis-ci/pr failure?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are killing performance here, this is why it was changed originally. The real fix is to do this in a cythonized version, similar to the way shift/fillna are done.

df.insert(0, 'A', result)
result = df.sort_values(by='key')
result = result.loc[:, 'A']
result = result.reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use inplace in testing

def test_pct_change(test_series, shuffle, periods, fill_method, limit):
vals = [3, np.nan, 1, 2, 4, 10, np.nan, np.nan]
keys = ['a', 'b']
key_v = [k for j in list(map(lambda x: [x] * len(vals), keys)) for k in j]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you really want to do this, use np.repeat

df = df.reindex(order).reset_index(drop=True)

manual_apply = []
for k in keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't you just using this? df.loc[df.key == k, 'vals'], which is a Series

fill_method=fill_method,
limit=limit))
exp_vals = pd.concat(manual_apply).reset_index(drop=True)
exp = pd.DataFrame(exp_vals, columns=['A'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat already returns a frame, use ignore_index=True in the concat

@WillAyd
Copy link
Member

WillAyd commented May 30, 2018

Just looking at this can't you just changed the following line:

shifted = filled.shift(periods=periods, freq=freq)

To be self.shift instead of filled.shift? I think that emulates exactly what the user is doing in the example cited so may just be a typo in the first place. That would preserve the performance and be much preferable if it works

@simonariddell
Copy link
Contributor Author

In my most recent commit I tried to preserve vectorization, while also not making costly calls to is_monotonic, and preventing group contamination within the vectorization (e.g. first element of group B being divided by an element of group A).

I did this by only allowing the code to attempt vectorization if it can confirm that the series is monotonic, and if the truth value of whether or not it is monotonic already exists in the cache. If these conditions are met, it will do a vectorized pct change operation, and then a vectorized operation to set any values that shouldn't exist to np.nan (e.g. for shift=1 the first element of a group would be divided by the last element of the prior group, but this is an invalid value).

I expect this is trying to be too fancy given the problem and my knowledge of pandas source, so I'm prepared to nix it if this is the case, and just keep it simple and/or try to implement it in cython. Please just let me know. Else I will extend it to the dataframe method as well.

Hi William,
I tried what you suggested -- but I don't think it makes sense. I think the current logic is correct in so far as shifted should simply be the filled series that is shifted once. Another way to say it, is changing the code to your suggestion wouldn't address the issue that non-monotonic group ordering is handled incorrectly, and that even if it were monotonic, the first element of a group may have a pct change with respect to the last element of the previous group.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the correct approach and is way too convoluted. My previous suggestion worked fine for your tests, save the tests that introduce allowing None as an argument.

If however you can find a scenario where what I suggested doesn't work then this needs to be implemented in Cython.

df = DataFrame({'key': ['a'] * len(vals) + ['b'] * len(vals),
'vals': vals * 2})
(-1, 'bfill', None), (-1, 'bfill', 1),
(-1, None, None), (-1, None, 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is not a valid fill_method on master so this is changing the behavior

shifted = filled.shift(periods=periods, freq=freq)

return (filled / shifted) - 1
return self.apply(lambda x: x.pct_change(periods=periods,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this essentially eliminating a vectorized solution across the board? How are the ASVs comparing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pct_change is blacklisted in benchmarks/groupby.py.

@simonariddell
Copy link
Contributor Author

For the reasons I mentioned above a vectorized solution will return incorrect values for pct_change under a set of conditions. I attempted to solve these conditions in the CR above, which William pointed out was too convoluted (I agree).

The additional change you suggested (self.shift instead of filled.shift) does not work either for the reasons I mentioned above. The new test I wrote would fail under that change.

My suggestion is to update the test, remove the vectorized solution, and create a new issue to implement this in cython, which I'm interested in doing.

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2018

Not sure what you mean by the benchmarks being blacklisted but for illustration purposes here is how your latest compares compares to master:

       before           after         ratio
     [415012f4]       [eaede342]
+         571±1μs          975±4ms  1708.33  groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'transformation')
+         572±2μs        974±0.6ms  1702.23  groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'direct')
+         592±3μs          635±8ms  1072.93  groupby.GroupByMethods.time_dtype_as_group('int', 'pct_change', 'transformation')
+         606±8μs          628±3ms  1036.87  groupby.GroupByMethods.time_dtype_as_group('int', 'pct_change', 'direct')
+        559±10μs         514±10ms   920.59  groupby.GroupByMethods.time_dtype_as_field('float', 'pct_change', 'direct')
+       529±0.9μs          481±5ms   908.79  groupby.GroupByMethods.time_dtype_as_field('float', 'pct_change', 'transformation')
+         603±8μs         460±20ms   762.64  groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'direct')
+         593±1μs          445±3ms   750.02  groupby.GroupByMethods.time_dtype_as_field('int', 'pct_change', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

As you can see (and I think have already recognized) this isn't a great solution and kills performance.

The additional change you suggested (self.shift instead of filled.shift) does not work either for the reasons I mentioned above. The new test I wrote would fail under that change.

I'm not entirely clear on what your reasons are. It's not a requirement that grouped elements be monotonic in any of the Cythonized code.

If you are getting an error on the test can you be explicit about what exactly that error is? IIRC correctly with the last set of tests what I suggested worked fine. I have a hunch that there may be confusion about the test as it was before (i.e. explicitly passing None would fail, but that was also expected). I could be wrong but in any case clarifying the failures you are seeing would be very helpful.

Final note - thanks for helping on this. I hope you are taking this feedback constructively and not in any negative fashion (I believe you are but just want to be explicit). Understanding how / when to interface with the Cythonized code is one of the more complicated aspects of this package

@simonariddell
Copy link
Contributor Author

I'm very grateful for the feedback, and I definitely appreciate your time. I apologize for the delay since the last commit. I've been working on understanding the code more clearly.

In this situation I've attempted to solve the problem by incorporating your prior feedback. My previous posts were incorrect, and you correctly pointed out to me how the shift operator works.

In this latest CR I've attempted to fix the problem in line with the spirit of the original implementation. The first step is I added an explicit catch for a fill_method, which did not exist before.

I initially attempted to implement your 1-line suggestion, but I noticed that I couldn't simply call self.shift(), as it first needed to have the proper fill method. However, the fill method returns the obj attribute, not the groupby object.

In order to remedy this, without creating a function that changed the state of the code, I used the following strategy:
https://github.com/SimonAlecks/pandas/blob/64097ccd308db909dc99cbb88eeb6206beedb820/pandas/core/groupby/groupby.py#L2079
I first recreate a new groupby object, but I solve the fill problem as part of this group. I remember I then had to reset the cache in order for some of the meta-data to align properly. I convinced myself this was necessary at the time, but I didn't write down why, so I'm going to go revisit that now to remind myself that I did this for a good reason. More generally, I could see it being the case that this solution is a hack or not idiomatic within pandas. If so, could you please let me know the correct way to achieve this same result?

Once this is done, the result is relatively self-explanatory. and is pretty similar to the original code.

Lastly, I ran an ASV benchmark locally, and the performance seemed similar, but I need to spend some more time teaching myself how the benchmarking system and code works, as it's possible I made an error.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer

new = DataFrameGroupBy(self._obj_with_exclusions,
grouper=self.grouper)
new.obj = getattr(new, fill_method)(limit=limit)
new._reset_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of these calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a fill method is chosen, I want a groupby instance that has the properly filled data as its obj. When I assign new.obj = getattr(new, fill_method)(limit=limit), the new._cache is not updated, as a result it does not reflect the latest change.
By creating a new groupby object with the data object with the correct filled values I am able to call new.shift(periods=periods, freq=freq) later, where the shift method is called on data that has already been properly filled.
The 'trick' is that by assigning the filled data back to the groupby instance, I am able to still have access to the helper methods like shift.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example of where these three lines differ from just doing self.ffill() (I'm assuming fill_method=='ffill' and limit==1)?

Haven't seen any other aggregation function have to make these types of calls so I'm still confused as to why this is necessary. A concrete example would be immensely helpful

Copy link
Contributor Author

@simonariddell simonariddell Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can simplify those three lines, but I don't believe I can simply use self.ffill(). Let me try to explain:
The first line can be replaced with new = copy(self). However, the reason I call the fill method and reassign it to the .obj attribute, is that later on line 2087 I want to call the groupby implementation of the shift method on new. Whereas type(self.ffill()) returns a DataFrame. I used this 'trick' of doing the fill and reassigning it back to the .obj attribute in order to still have access to the groupby helper methods. If there is a more idiomatic way to do this, please let me know.

In this case if I were to remove those three lines and replace with with self.ffill(), the code would immediately fail when I tried to call a groupby method on the returned dataframe object.

Is this sufficiently clear/concrete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just easier if I share code myself at this point :-). Building off of what you have here's my stab at implementation:

https://github.com/WillAyd/pandas/blob/ebe4fffd935f2f16b6e5c635cfa4883e6bacc5ea/pandas/core/groupby/groupby.py#L2070

AFAICT it passes all of the tests you have and therefore provides the same behavior. However, it has the advantage of being more terse and not requiring the construction of a secondary GroupBy object.

Not saying you need to exactly mimic this but I'm sharing to highlight how I think this can be done more succinctly again without the need of constructing entirely new objects. If you know of where this would fail then we probably want to add a test case to handle whatever that difference is. Otherwise if they are the same I'd prefer if you adopt something a little more in line with what I've suggested.

No worries on timing - this is open source so any and all contributions are welcome at any time.

Copy link
Contributor Author

@simonariddell simonariddell Jul 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing some code, that was very helpful. I updated my test to highlight a case where your code would fail, but what I wrote wouldn't. Specifically (1) I added a test case for no fill method, and (2) I added more np.nan in the test data. This causes your code to fail as it does not apply the fill_method to the shifted data.

I agree making a second object seems wrong. However, It's not clear to me how to avoid this, unless there are additional helper methods that I'm just not aware of or finding.

df = df.reindex(order).reset_index(drop=True)

manual_apply = []
for k in keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop / concat mechanism is rather non-idiomatic - whats the reasoning for this instead of just doing a comparison to the anonymous function as was done in the original issue?

Copy link
Contributor Author

@simonariddell simonariddell Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to do this as it means the test will ultimately assert that the result of a groupby apply is the same as looping over the keys and doing a normal apply, as the two should be identical. Or, more specifically, the groupby apply should ought to replicate the results of a normal apply looped over the groups.
I'm experimenting now with what an anonymous comparison would look like, but my naive thought is that a comparison to a normal apply is more to-the-point on what we want to test?

Edit: Fixed it in the newest CR. I like your suggestion more now that I see it in code.

@@ -743,14 +754,21 @@ def get_result(grp_obj):
limit=limit)

if test_series:
exp = pd.Series(exp_vals * 2)
exp.name = 'vals'
exp = exp.loc[:, 'A']
grp = grp['vals']
result = get_result(grp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that your change introduced it but can we get rid of the get_result function? Not sure why this was set up this way and adds unnecessary fluff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

grp = grp['vals']
result = get_result(grp)
df.insert(0, 'A', result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of all of the insert / sort calls? Would rather just construct the expected value and simply compare that to the result of a pct_change call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Fixed.

@simonariddell
Copy link
Contributor Author

Incorporated most requested changes, with a few clarifying questions asked on a few of them. Thanks for your patience!

@mroeschke
Copy link
Member

mroeschke commented Jun 25, 2018

@SimonAlecks do you believe this change would also solve #21621?1 If so, could you add a test cases for that issue?

new = DataFrameGroupBy(self._obj_with_exclusions,
grouper=self.grouper)
new.obj = getattr(new, fill_method)(limit=limit)
new._reset_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example of where these three lines differ from just doing self.ffill() (I'm assuming fill_method=='ffill' and limit==1)?

Haven't seen any other aggregation function have to make these types of calls so I'm still confused as to why this is necessary. A concrete example would be immensely helpful

return grp_obj.pct_change(periods=periods,
fill_method=fill_method,
limit=limit)
exp = grp['vals'].shift(0) / grp['vals'].shift(periods) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind changing exp to expected while you are touching this? The latter is a quasi-standard we've been applying to tests

Copy link
Contributor Author

@simonariddell simonariddell Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change exp to expected, and wait for a response on the other comment before updating. I'm still active on this, apologies for the delay in responding to your helpful comments. Thanks!

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think this is in a good enough state with appropriate follow ups documented. @jreback over to you

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok. this needs a whatnew entry. I think its important enough to warrant a sub-section, with a small example of what it was doing before and the correct result. you an put it in right below the grouping section in Bug Fixes I think is ok.

@simonariddell
Copy link
Contributor Author

Added the whatsnew. I checked for previous examples in the docs to follow style guidelines, but please let me know if it's too verbose or I made a stylistic mistake, and I'll quickly fix it. Thanks.

@@ -42,7 +42,33 @@ Bug Fixes
**Groupby/Resample/Rolling**

- Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`).
-
- Bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this entire note to a sub-section in Backwards incompatible API changes.

.. ipython:: python

df = pd.DataFrame({'a': [1.0, 1.1, 2.2], 'grp': ['a', 'a', 'b']})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show the df, with 'df'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


Grouping respected

.. code-block:: ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an ipython block (it will run the code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Didn't properly read contributor documentation on how to do documentation -- sorry for the churn.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually aren't doing a v0.23.5 release anymore. If you can merge master and move the whatsnew to v0.24 would be great

@@ -378,6 +378,29 @@ Backwards incompatible API changes
- Passing scalar values to :class:`DatetimeIndex` or :class:`TimedeltaIndex` will now raise ``TypeError`` instead of ``ValueError`` (:issue:`23539`)
- ``max_rows`` and ``max_cols`` parameters removed from :class:`HTMLFormatter` since truncation is handled by :class:`DataFrameFormatter` (:issue:`23818`)
- :meth:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`)
- Bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to a separate subsection

@@ -378,6 +378,29 @@ Backwards incompatible API changes
- Passing scalar values to :class:`DatetimeIndex` or :class:`TimedeltaIndex` will now raise ``TypeError`` instead of ``ValueError`` (:issue:`23539`)
- ``max_rows`` and ``max_cols`` parameters removed from :class:`HTMLFormatter` since truncation is handled by :class:`DataFrameFormatter` (:issue:`23818`)
- :meth:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`)
- Bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`)

**New behavior**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow the formatting of other subsections


.. ipython:: python

import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no import

df.groupby('grp').pct_change()


**Previous behavior**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous first

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls merge master as well. small changes. ping on green.

@@ -379,6 +379,33 @@ Backwards incompatible API changes
- ``max_rows`` and ``max_cols`` parameters removed from :class:`HTMLFormatter` since truncation is handled by :class:`DataFrameFormatter` (:issue:`23818`)
- :meth:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`)

Percentage change called on groupby now returns correct values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this 'Percentage change on groupby changes'

Percentage change called on groupby now returns correct values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Fixed a bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a tiny bit more here. A user reading this will want to know what is changing, IOW the groupby percentage worked across groups, now it correctly works per group.

# GH 21200, 21621
if freq == 'D':
pytest.xfail("'freq' test not necessary until #23918 completed and"
"freq is used in the vectorized approach")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could put the xxfail inthe parametrize, something like this format

@pytest.mark.parametrize("engine,ext", [
    pytest.param('openpyxl', '.xlsx', marks=pytest.mark.skipif(
        not td.safe_import('openpyxl'), reason='No openpyxl')),

but of course its an xfail and not a skip and this is not an excel test :<

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this actually fail or be non-performant compared to the other argument combinations? I think the latter and if so probably OK to just remove xfail / skip altogether; we have the TODO in the actual code in regards to the perf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be non-performant, but it wouldn't fail (well it technically would fail, but because the index in the test is not a date-time). Personally, I think it makes sense to make the index date-time, and not xfail this. Let me know if you agree, and I'll do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. No I think that would make the test here more complicated than it needs to be; if you could follow the approach as outlined by @jreback would be preferable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this xfail and it seems like the behavior is correct bc, as @simonariddell mentioned, it should raise without a DatetimeIndex. what is the xfailed case supposed to be testing?

Percentage change called on groupby now returns correct values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Fixed a bug where calling :func:`SeriesGroupBy.pct_change` or :func:`DataFrameGroupBy.pct_change` would ignore groups when calculating the percent change (:issue:`21235`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also list both issues here

@jreback jreback added this to the 0.24.0 milestone Dec 12, 2018
@jreback jreback merged commit 9ea2877 into pandas-dev:master Dec 12, 2018
@jreback
Copy link
Contributor

jreback commented Dec 12, 2018

thanks @SimonAlecks

thoo added a commit to thoo/pandas that referenced this pull request Dec 12, 2018
* upstream/master:
  pct change bug issue 21200 (pandas-dev#21235)
  DOC: Fix summaries not ending with a period (pandas-dev#24190)
  DOC: Add back currentmodule to api.rst (pandas-dev#24232)
  DOC: Fixed implicit imports for whatsnew (v >= version 20.0) (pandas-dev#24199)
  remove enum import for PY2 compat, xref pandas-dev#22802 (pandas-dev#24170)
  CI: Simplify Azure Pipelines configuration files (pandas-dev#23111)
  REF/TST: Add more pytest idiom to indexing/multiindex/test_getitem.py (pandas-dev#24053)
@simonariddell
Copy link
Contributor Author

@jreback and @WillAyd thanks so much for your patience on my first PR!
Any suggestions for the next bug I tackle? I'd like to learn the index structure. (I'm perfectly capable of finding a bug to work on on my own, just curious if anything in particular that you have a preference for).

@WillAyd
Copy link
Member

WillAyd commented Dec 13, 2018

Since it came out of this PR #23918 would be a logical follow up

@simonariddell simonariddell deleted the groupby_pctchange_bug branch December 15, 2018 07:54
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants