-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
df.groupby().apply() with only one group returns wrong shape! #5839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you can try playing around with Since you have a complicated apply function, consider just doing something like
|
@dragoljub try on 0.13 as well |
@dragoljub I think returning a MI series itself from the applied function is pretty odd (I know we talked about this before), but inferring what to do with this is pretty non-trivial. Best to just iterate over the groups and concat yourself. |
Is 0.13 master out? I didn't see an announcement? Thanks, On Fri, Jan 3, 2014 at 1:27 PM, jreback [email protected] wrote:
|
not not officially quite yet (binaries are up in the dev section still on pydata)...just a day or 2 on the release |
I think the apply method is very powerful on groupbys and Dataframes and would love it to handle this corner case. For now I'm checking if the len(grp) == 1 and if so the result of apply will be "stacked" to get back the proper shape. The reason I'm using apply this way is because it follows the df.apply() semantics quite nicely except instead of processing individual rows/cols we process groups of rows or columns. The only thing I want to do is preserve the multi-index so assignment back to the original DataFrame works nicely. df['new_col'] = df.apply(my_func) --> Works well, input is 1 value/row output is 1 value. For clean and consistent code I want to continue using apply on data frame groupbys with a multi-index: df['new_col'] = df.groupby(levels=my_levels).apply(my_func) --> Also works quite well, input is one data frame and output should be a series. Regardless of the index it should not return a DataFrame if the return value from the apply func is returning a series... To me this is a very important operation. Being able to apply multivariate machine learning functions (multiple columns as input) on various groupings of data to yield another series column with a consistent multi-index. |
sure....can you put up an example that doesn't have an external dep so can include in the test suite? i'll mark as a bug |
Here is the same example without the clustering code. All this does is sum the each groups rows and normalize them by the mean of all group values and return that as a series with the same full multi-index. Hopefully this can be used as a test that checks for the corner case with only one group in the groupby. 😄 P.S. Really really really looking forward to 0.13.0! In [1]: import numpy as np
...: import pandas as pd
...:
...: print pd.__version__
...:
...: # Generate Test DataFrame
...: NUM_ROWS = 1000
...: NUM_COLS = 10
...: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
...: index_cols = col_names[:5]
...:
...: # Set DataFrame to have 5 level Hierarchical Index.
...: # Sort the index!
...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
...: df = df.set_index(index_cols).sort_index()
...: df
...:
...: # Group by first 4 index columns.
...: grp = df.groupby(level=index_cols[:4])
...:
...: # Find index of largest group.
...: big_loc = grp.size().idxmax()
...:
...: # Create function to apply on groups
...: def easy_func(df):
...: """Sum rows and return series with same multi-index."""
...: row_sum_data = df.sum(axis=1).values
...: group_mean = df.mean().mean()
...: return pd.Series(row_sum_data/group_mean, name='row_sum', index=df.index.get_level_values(4))
...:
0.12.0
In [2]: # Apply easy_func on each subgroup of DataFrame
...: out_good = grp.apply(easy_func)
...: out_good
Out[2]:
A0 A1 A2 A3 A4
0 0 0 0 0 5.000000
4 5.000000
1 1 4.545455
2 5.454545
2 3 5.000000
3 3 5.000000
4 2 5.000000
1 0 0 5.217391
2 5.652174
3 3.913043
3 5.217391
1 0 4.400000
2 5.600000
3 0 5.000000
4 1 4.615385
...
4 4 3 0 3 5.357143
4 7.500000
1 1 5.000000
2 1 6.521739
1 3.913043
2 4.565217
3 2 5.000000
4 2 5.000000
4 0 0 4.200000
0 5.400000
1 5.400000
2 1 5.000000
3 3 5.000000
4 3 5.238095
3 4.761905
Name: row_sum, Length: 1000, dtype: float64
In [3]: out_good.shape
Out[3]: (1000L,)
In [4]: # Select out biggest group while keeping index levels and try same apply
...: out_bad = df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(easy_func)
...: out_bad
Out[4]:
A4 1 2 2 2 4 4 \
A0 A1 A2 A3
4 0 4 4 4.803922 3.431373 7.54902 6.176471 4.803922 2.745098
A4 4
A0 A1 A2 A3
4 0 4 4 5.490196
In [5]: out_bad.shape
Out[5]: (1, 7)
In [6]: out_bad.stack()
Out[6]:
A0 A1 A2 A3 A4
4 0 4 4 1 4.803922
2 3.431373
2 7.549020
2 6.176471
4 4.803922
4 2.745098
4 5.490196
dtype: float64 |
@dragoljub you are doing a filter followed by a groupby here (you are sort of doing the filter manually, but same idea). This is quite difficult to guess the output shape in all circumanstances. It is possible we could change the kw |
The index/filter is just for illustration purposes to show what happens when there is a groupby() with only one grouping found. I'll have to look into the squeeze parameter to see if it returns the long form DataFrame apply() result rather than wide one that needs to be stacked after the apply returns. From what I can tell the squeeze parameter is False by default, which is actually what I want. I want to preserve all multi-index levels regardless if they are constant after grouping. I'm still not sure why the df.groupby().apply() method returns an unstacked data frame (of the wrong shape) for a grouping of size one, while the same apply on groupbys of more than one grouping returns the expected DataFrame shapes just concatenated? I thought pd.concat([df]) just returns the same df? |
no, what I mean is do an unrolled groupby, something like:
rather than use apply. This offers a very fine level of control (you can also use a list instead of a dict) what I mean is the |
I see. Thanks for the suggestion, I'll keep that in my toolbox for future multivariate computations on groupings of data. The thing is I'm never actually changing the index/keys, just trying to keep the original index from the grouping. 😄 I have a fairly generic use case where I want to group data, and compute a resultant column for each grouping while keeping the original index unchanged. This is a groupby without any aggregation that allows me to simply insert the resultant column back into the original DataFrame that was grouped. I really like the df.groupby().apply(f) syntax. 👻 I hoped that under the hood with the default do not reduce dimensionality flag set the only thing apply would do for each grouping result was concatenate the results keeping the multi-index values untouched. It just seems inconsistent that with two groups the apply results in a long form concatenated data frame (as expected) but with only one grouping we get a wide form frame with the lowest dimension unstacked. If all apply results were either in wide form or long form that would make more sense but not having this corner case. Do you know what apply feature is causing this to happen so that I better understand how apply works? |
this is what squeeze does the idea is sometimes u want to have a dimenension reduction if u have all 1 sized groupes (eg to a series) |
I just gave squeeze=True a try on 13.1 and as I expected all this did was drop the first levels of my grouping, exactly what I wanted to avoid. The issue remains. When there is a grouping of only one group and apply return a series with a multi-index the last level gets unstacked. Squeeze does not make any difference when the groupby has more than one grouping. Also if I'm returning a single column DataFrame as a result of the apply (instead of a series) the unstacking does not happen. Still seems like a corner case to me. If I change the return code in the apply function to convert the series to a DataFrame def easy_func(df):
"""Sum rows and return series with same multi-index."""
row_sum_data = df.sum(axis=1).values
group_mean = df.mean().mean()
return pd.Series(row_sum_data/group_mean, name='row_sum', index=df.index.get_level_values(4)).to_frame() # groupby apply on only one group results in unstacked last level for a multi-index series
In [13]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4], squeeze=False).apply(easy_func)
Out[13]:
A4 0 0 0 1 2 4
A0 A1 A2 A3
1 0 1 0 6.315789 3.684211 4.736842 6.842105 3.157895 5.263158
[1 rows x 6 columns]
# squeez=True removes all index levels that I'm trying to keep
In [14]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4], squeeze=True).apply(easy_func)
Out[14]:
A4
0 6.315789
0 3.684211
0 4.736842
1 6.842105
2 3.157895
4 5.263158
Name: row_sum, dtype: float64
# Squeez on groupings with more than one group dos not seem to have any affect
In [15]: df.groupby(level=index_cols[:4], squeeze=True).apply(easy_func)
Out[15]:
A0 A1 A2 A3 A4
0 0 1 0 1 5.000000
1 2 5.000000
2 1 5.000000
4 3 4.736842
4 5.263158
2 0 0 4.761905
3 5.238095
1 2 5.000000
2 0 5.000000
3 1 5.192308
2 4.615385
3 5.192308
4 2 5.000000
3 0 1 7.065217
1 6.521739
...
4 4 3 0 1 5.000000
2 2 5.000000
3 0 6.451613
1 7.096774
3 3.870968
3 2.580645
4 1 4.500000
3 5.500000
4 1 1 6.346154
2 4.615385
3 4.038462
2 0 4.500000
3 6.000000
3 5.000000
4 4.500000
Name: row_sum, Length: 1000, dtype: float64
# same as above result with squeez=True
In [16]: df.groupby(level=index_cols[:4], squeeze=False).apply(easy_func)
Out[16]:
A0 A1 A2 A3 A4
0 0 1 0 1 5.000000
1 2 5.000000
2 1 5.000000
4 3 4.736842
4 5.263158
2 0 0 4.761905
3 5.238095
1 2 5.000000
2 0 5.000000
3 1 5.192308
2 4.615385
3 5.192308
4 2 5.000000
3 0 1 7.065217
1 6.521739
...
4 4 3 0 1 5.000000
2 2 5.000000
3 0 6.451613
1 7.096774
3 3.870968
3 2.580645
4 1 4.500000
3 5.500000
4 1 1 6.346154
2 4.615385
3 4.038462
2 0 4.500000
3 6.000000
3 5.000000
4 4.500000
Name: row_sum, Length: 1000, dtype: float64
#using the updated apply function that returns a series on one group does not unstack the last level as expected :)
In [4]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(easy_func)
Out[4]:
row_sum
A0 A1 A2 A3 A4
0 3 0 0 0 5.492958
1 4.647887
1 5.070423
3 4.225352
3 4.225352
4 6.338028
[6 rows x 1 columns] |
ok...will reopen and see if someone can address in next release |
This issue is very similar to a behavior that I consider a bug or misfeature. Because |
+1 for fixing this. |
tracked in #13056 |
I have just bumped into the same problem (pandas 0.22). It would be great to handle the case of a single group consistently with multiple groups. In a large scale data analysis it just can happen that some input has one group only and such cases crash my analysis right now. |
same issue here with pandas==1.5. If I groupby on a dataframe where the groupby column has multiple different categorical values, the output is a series; if I groupby on a dataframe where the column has only one categorical value, the output is a dataframe. |
I have reached a corner case in the wonderful groupby().apply() method. A groupby with only one group causes the apply method to return the wrong output shape. Instead of a series with a multi-index the result is retunred as a DataFrame with last row index level as columns. 😦
The text was updated successfully, but these errors were encountered: