Skip to content

ENH Consistent apply output when grouping with freq #12362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

stephen-hoover
Copy link
Contributor

The BinGrouper.apply (used by the TimeGrouper) and BaseGrouper.apply (used by the Grouper) have different output types. To make them consistent, remove BinGrouper.apply and let it use the same method as the superclass BaseGrouper. This requires changing BinGrouper.groupings to return an object which looks sufficiently like a Grouping object for BaseGrouper.apply to be happy. Use a namedtuple to create an object with the needed attributes.

I think this change will only affect the output of apply with a custom function, and only when grouping by a TimeGrouper (although a Grouper with a freq specified turns into a TimeGrouper, so that counts).

Closes #11742 .

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

tests!

@stephen-hoover
Copy link
Contributor Author

I wanted to make sure it wouldn't be outright rejected first. :-) I'll write tests now.

@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from 69dc772 to 5571ac7 Compare February 17, 2016 02:25
@stephen-hoover
Copy link
Contributor Author

Tests. On the master branch, TestGroupBy.test_timegrouper_apply_return_type_series fails, and TestGroupBy.test_timegrouper_apply_return_type_value raises a TypeError.

@@ -2027,7 +2009,10 @@ def names(self):
@property
def groupings(self):
# for compat
return None
MockGrouping = collections.namedtuple(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an actual Grouping object here rather than a mock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that looks a lot better. Fixed.

@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from 5571ac7 to 1ad4ebe Compare February 17, 2016 03:23
@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

Can u have a look thru online groupby issues (there are a few!) and see if this has some similarity with any

I recall one that had to do with the dtype changing in apply (and had to do with that code u removed) so maybe this would fix another issue

@stephen-hoover
Copy link
Contributor Author

A search through issues tagged "Groupby", or containing "TimeGrouper" and "apply", or "Grouper", "apply", and "freq" doesn't show anything that looks to me like it would be fixed by this.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

your PR might close #5839

@jreback jreback added Bug Groupby Dtype Conversions Unexpected or buggy dtype conversions Resample resample method labels Feb 17, 2016
@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

i'll have to look at this more, but pls add a whatsnew

@jreback jreback added this to the 0.18.0 milestone Feb 17, 2016
@jreback jreback changed the title ENH Consistent apply output when grouping with freq ENH Consistent apply output when grouping with freq Feb 17, 2016
@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from 1ad4ebe to 1a2bdc5 Compare February 18, 2016 03:54
@stephen-hoover
Copy link
Contributor Author

I added a What's New entry. Not sure the best way to describe / categorize this -- perhaps it belongs under bugs?

It doesn't look like #5839 is affected. I checked to be sure, and the examples there are unchanged by this PR.

This PR only affects apply operations after a single groupby involving grouping times by a freq argument. And even then, something like df.groupby([pd.Grouper(freq='M', key='date')]).apply(lambda x: x.value.sum()) is unchanged. The only thing that changes is the output if you don't put the Grouper in a list (which is now consistent with the same thing using the Grouper in a list).

I scanned through the docs and didn't see any examples or instructions that would be affected by this PR.

def sumfunc_series(x):
return pd.Series([x['value'].sum()], ('sum',))

grp_out = df.groupby(pd.Grouper(key='date')).apply(sumfunc_series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also tests that these produce the same output when doing a .resample(..).apply(..) (both functions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this result instead of grp_out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name of grp_out to expected, since that's the thing left unchanged by this PR.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

I am not convinced this is doing the right thing. notice the return types dtypes.

In [25]: df.groupby(pd.Grouper(key='date')).apply(sumfunc_series)
Out[25]: 
            sum
date           
10/10/2000   10
11/10/2000   13

In [26]: df.groupby(pd.Grouper(key='date')).apply(sumfunc_series).dtypes
Out[26]: 
sum    int64
dtype: object

@jreback jreback modified the milestones: 0.18.1, 0.18.0 Feb 18, 2016
@stephen-hoover
Copy link
Contributor Author

I'm not sure if it's doing the right thing, but this PR doesn't change the output of df.groupby(pd.Grouper(key='date')).apply(sumfunc_series), and now that output type is consistent with the output of df_dt.groupby(pd.TimeGrouper(freq='M', key='date')).apply(sumfunc_series).

When you suggest testing the output of .resample(..).apply(..), you mean to compare the result to df_dt.set_index('date').resample('M').apply(sumfunc_series)? I would have expected that to be the same, but it's not. Also, this PR doesn't affect the output of .resample(..).apply(..). I don't know enough about the resample code path to say if it should. Outputs:

In [45]: df_dt.set_index('date').resample('M').apply(sumfunc_series)
Out[45]: 
            value
date             
2000-10-31    NaN
2000-11-30    NaN

In [46]: df_dt.set_index('date').resample('M').apply(sumfunc_value)
Out[46]: 
            value
date             
2000-10-31     10
2000-11-30     13

Corresponding outputs using a resampling groupby, this PR:

In [50]: df_dt.groupby(pd.Grouper(key='date', freq='M')).apply(sumfunc_series)
Out[50]: 
            sum
date           
2000-10-31   10
2000-11-30   13

In [51]: df_dt.groupby(pd.Grouper(key='date', freq='M')).apply(sumfunc_value)
Out[51]: 
date
2000-10-31    10
2000-11-30    13
Freq: M, dtype: int64

Outputs on master branch:

In [16]: df_dt.set_index('date').resample('M').apply(sumfunc_series)
Out[16]: 
            value
date             
2000-10-31    NaN
2000-11-30    NaN

In [17]: df_dt.set_index('date').resample('M').apply(sumfunc_value)
Out[17]: 
            value
date             
2000-10-31     10
2000-11-30     13

In [18]: df_dt.groupby(pd.Grouper(key='date', freq='M')).apply(sumfunc_series)
Out[18]: 
date           
2000-10-31  sum    10
2000-11-30  sum    13
dtype: int64

In [19]: df_dt.groupby(pd.Grouper(key='date', freq='M')).apply(sumfunc_value)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-4ef8c22ce5dd> in <module>()
----> 1 df_dt.groupby(pd.Grouper(key='date', freq='M')).apply(sumfunc_value)
...
TypeError: cannot concatenate a non-NDFrame object

The test failure is in pandas.tests.test_generic.TestDataFrame.test_to_xarray, which doesn't look related to my change in test_groupby.py.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

you can rebase on master; the to_xarray issue is fixed

@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from c41eb84 to 046a0fb Compare February 19, 2016 02:05
@stephen-hoover
Copy link
Contributor Author

Rebased.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

can you rebase/update (and move note to 0.18.1). and i'll take a look.

@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from 046a0fb to ad087fc Compare March 24, 2016 21:45
@stephen-hoover
Copy link
Contributor Author

@jreback , done.

@@ -73,8 +73,7 @@ API changes

- ``CParserError`` is now a ``ValueError`` instead of just an ``Exception`` (:issue:`12551`)



- Using ``apply`` on resampling groupby operations now has the same output types as similar ``apply``s on other groupby operations. (:issue:`11742`, :issue:`12362`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't list the PR number, only the issue number. give some more context, like: .groupby(...).apply(...). I think a brief example might be waranted here (e.g. show old and new)

@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch 3 times, most recently from c5db074 to c7b5b3e Compare March 29, 2016 20:52
The `BinGrouper.apply` and `BaseGrouper.apply` have different output types. To make them consistent, remove `BinGrouper.apply` and let it use the same method as the superclass `BaseGrouper`. This requires changing `BinGrouper.groupings` to return a list of `Grouping` objects (there will always only be one) instead of `None`.
@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from c7b5b3e to d73f332 Compare March 30, 2016 21:42
@stephen-hoover
Copy link
Contributor Author

@jreback , I added extra detail and examples to the What's New, and rebased. Sorry for the delay.


.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the new an ipython block (so the code runs)

@jreback
Copy link
Contributor

jreback commented Mar 31, 2016

thanks @stephen-hoover

some doc-comments. ping when pushed and i can look.

@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from 3d43dde to 3b9c4f7 Compare March 31, 2016 18:42
@stephen-hoover stephen-hoover force-pushed the consistent-freq-grouping-11742 branch from 3b9c4f7 to 8cf618f Compare April 1, 2016 02:06
@stephen-hoover
Copy link
Contributor Author

@jreback , modified the docs. Let me know if this looks good.

@jreback jreback closed this in 2c79a50 Apr 1, 2016
@jreback
Copy link
Contributor

jreback commented Apr 1, 2016

thanks @stephen-hoover

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby Resample resample method
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants