Skip to content

df.groupby().apply() with only one group returns wrong shape! #5839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #13056
dragoljub opened this issue Jan 3, 2014 · 20 comments
Closed
Tracked by #13056

df.groupby().apply() with only one group returns wrong shape! #5839

dragoljub opened this issue Jan 3, 2014 · 20 comments

Comments

@dragoljub
Copy link

I have reached a corner case in the wonderful groupby().apply() method. A groupby with only one group causes the apply method to return the wrong output shape. Instead of a series with a multi-index the result is retunred as a DataFrame with last row index level as columns. 😦

In [1]: import numpy as np
   ...: import pandas as pd
   ...: from sklearn.cluster import DBSCAN as DBSCAN
   ...: 
   ...: print pd.__version__
   ...: 
   ...: # Generate Test DataFrame
   ...: NUM_ROWS = 1000
   ...: NUM_COLS = 10
   ...: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
   ...: index_cols = col_names[:5] 
   ...: 
   ...: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: df = df.set_index(index_cols).sort_index()
   ...: df
   ...: 
   ...: # Group by first 4 index columns.
   ...: grp = df.groupby(level=index_cols[:4])
   ...: 
   ...: # Find index of largest group.
   ...: big_loc = grp.size().idxmax()
   ...: 
   ...: # Create function to apply clustering on groups
   ...: def grp_func(df):
   ...:     """Run clustering on subgroup and return series of results."""
   ...:     db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
   ...:     return pd.Series(db.labels_, name='cluster_id', index=df.index.get_level_values(4))
   ...: 
0.12.0

In [2]: # Apply clustering on each subgroup of DataFrame
   ...: out_good = grp.apply(grp_func)
   ...: out_good
   ...: out_good.shape
Out[2]: (1000L,)

In [3]: # Select out biggest group wihile keeping index levels and try same apply
   ...: out_bad = df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(grp_func)
   ...: out_bad
   ...: out_bad.shape
Out[3]: (1, 7)

In [4]: out_good
Out[4]: 
A0  A1  A2  A3  A4
0   0   0   0   0     1
                3     0
            1   3     0
            2   1     1
                3     2
                3     0
            3   3     1
                3     0
        1   1   1     0
            2   0     0
                2     1
                4     2
            3   4     0
                4     1
            4   0     0
...
4   4   3   0   2     0
            1   1     1
                3     2
                4     0
            3   1     0
                2     1
            4   4     0
        4   0   4     0
            1   1     3
                1     1
                2     2
                2     0
            3   1     1
                3     0
            4   4     0
Name: cluster_id, Length: 1000, dtype: float64

In [5]: out_bad

Out[5]: 
A4           1  1  2  3  3  3  4
A0 A1 A2 A3                     
3  0  0  3   6  5  3  0  1  4  2

# If you stack the bad result it comes out looking OK, but now I need a workaround for this corner case to use apply.
In [17]: out_bad.stack()

Out[17]: A0  A1  A2  A3  A4
3   0   0   3   1     3
                1     6
                2     5
                3     1
                3     4
                3     0
                4     2
dtype: float64
@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

you can try playing around with squeeze kw to groupby (though its off by default).

Since you have a complicated apply function, consider just doing something like

pd.concat([ function(g) for g in grouped_frame ])

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

@dragoljub try on 0.13 as well

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

@dragoljub I think returning a MI series itself from the applied function is pretty odd (I know we talked about this before), but inferring what to do with this is pretty non-trivial. Best to just iterate over the groups and concat yourself.

@dragoljub
Copy link
Author

Is 0.13 master out? I didn't see an announcement?

Thanks,
-Gagi

On Fri, Jan 3, 2014 at 1:27 PM, jreback [email protected] wrote:

@dragoljub https://github.com/dragoljub try on 0.13 as well


Reply to this email directly or view it on GitHubhttps://github.com//issues/5839#issuecomment-31555081
.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

not not officially quite yet (binaries are up in the dev section still on pydata)...just a day or 2 on the release

@dragoljub
Copy link
Author

I think the apply method is very powerful on groupbys and Dataframes and would love it to handle this corner case. For now I'm checking if the len(grp) == 1 and if so the result of apply will be "stacked" to get back the proper shape.

The reason I'm using apply this way is because it follows the df.apply() semantics quite nicely except instead of processing individual rows/cols we process groups of rows or columns. The only thing I want to do is preserve the multi-index so assignment back to the original DataFrame works nicely.

df['new_col'] = df.apply(my_func) --> Works well, input is 1 value/row output is 1 value.

For clean and consistent code I want to continue using apply on data frame groupbys with a multi-index:

df['new_col'] = df.groupby(levels=my_levels).apply(my_func) --> Also works quite well, input is one data frame and output should be a series. Regardless of the index it should not return a DataFrame if the return value from the apply func is returning a series...

To me this is a very important operation. Being able to apply multivariate machine learning functions (multiple columns as input) on various groupings of data to yield another series column with a consistent multi-index.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

sure....can you put up an example that doesn't have an external dep so can include in the test suite? i'll mark as a bug

@dragoljub
Copy link
Author

Here is the same example without the clustering code. All this does is sum the each groups rows and normalize them by the mean of all group values and return that as a series with the same full multi-index. Hopefully this can be used as a test that checks for the corner case with only one group in the groupby. 😄 P.S. Really really really looking forward to 0.13.0!

In [1]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: print pd.__version__
   ...: 
   ...: # Generate Test DataFrame
   ...: NUM_ROWS = 1000
   ...: NUM_COLS = 10
   ...: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
   ...: index_cols = col_names[:5] 
   ...: 
   ...: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: df = df.set_index(index_cols).sort_index()
   ...: df
   ...: 
   ...: # Group by first 4 index columns.
   ...: grp = df.groupby(level=index_cols[:4])
   ...: 
   ...: # Find index of largest group.
   ...: big_loc = grp.size().idxmax()
   ...: 
   ...: # Create function to apply on groups
   ...: def easy_func(df):
   ...:     """Sum rows and return series with same multi-index."""
   ...:     row_sum_data = df.sum(axis=1).values
   ...:     group_mean = df.mean().mean()
   ...:     return pd.Series(row_sum_data/group_mean, name='row_sum', index=df.index.get_level_values(4))
   ...: 
0.12.0

In [2]: # Apply easy_func on each subgroup of DataFrame
   ...: out_good = grp.apply(easy_func)
   ...: out_good
Out[2]: 
A0  A1  A2  A3  A4
0   0   0   0   0     5.000000
                4     5.000000
            1   1     4.545455
                2     5.454545
            2   3     5.000000
            3   3     5.000000
            4   2     5.000000
        1   0   0     5.217391
                2     5.652174
                3     3.913043
                3     5.217391
            1   0     4.400000
                2     5.600000
            3   0     5.000000
            4   1     4.615385
...
4   4   3   0   3     5.357143
                4     7.500000
            1   1     5.000000
            2   1     6.521739
                1     3.913043
                2     4.565217
            3   2     5.000000
            4   2     5.000000
        4   0   0     4.200000
                0     5.400000
                1     5.400000
            2   1     5.000000
            3   3     5.000000
            4   3     5.238095
                3     4.761905
Name: row_sum, Length: 1000, dtype: float64

In [3]: out_good.shape
Out[3]: (1000L,)

In [4]: # Select out biggest group while keeping index levels and try same apply
   ...: out_bad = df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(easy_func)
   ...: out_bad

Out[4]: 
A4                  1         2        2         2         4         4  \
A0 A1 A2 A3                                                              
4  0  4  4   4.803922  3.431373  7.54902  6.176471  4.803922  2.745098   

A4                  4  
A0 A1 A2 A3            
4  0  4  4   5.490196  

In [5]: out_bad.shape
Out[5]: (1, 7)

In [6]: out_bad.stack()
Out[6]: 
A0  A1  A2  A3  A4
4   0   4   4   1     4.803922
                2     3.431373
                2     7.549020
                2     6.176471
                4     4.803922
                4     2.745098
                4     5.490196
dtype: float64

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@dragoljub you are doing a filter followed by a groupby here (you are sort of doing the filter manually, but same idea).

This is quite difficult to guess the output shape in all circumanstances.

It is possible we could change the kw squeeze to handle this, but I would need a compelling reason. (IMHO you should just do this via concat anyhow, this seems a bit too magical to do otherwise)

@dragoljub
Copy link
Author

The index/filter is just for illustration purposes to show what happens when there is a groupby() with only one grouping found.

I'll have to look into the squeeze parameter to see if it returns the long form DataFrame apply() result rather than wide one that needs to be stacked after the apply returns. From what I can tell the squeeze parameter is False by default, which is actually what I want. I want to preserve all multi-index levels regardless if they are constant after grouping.

I'm still not sure why the df.groupby().apply() method returns an unstacked data frame (of the wrong shape) for a grouping of size one, while the same apply on groupbys of more than one grouping returns the expected DataFrame shapes just concatenated? I thought pd.concat([df]) just returns the same df?

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

no, what I mean is do an unrolled groupby, something like:

pd.concat(dict([ (func_to_compute_keys(g),func_for_value(grp)) 
                           for g, grp in df.groupby(...) ]))

rather than use apply.

This offers a very fine level of control (you can also use a list instead of a dict)

what I mean is the df.groupby(...).apply(...) works for most situations, but occasionally its not perfect as it has to do some inference, so try the above.

@dragoljub
Copy link
Author

I see. Thanks for the suggestion, I'll keep that in my toolbox for future multivariate computations on groupings of data.

The thing is I'm never actually changing the index/keys, just trying to keep the original index from the grouping. 😄 I have a fairly generic use case where I want to group data, and compute a resultant column for each grouping while keeping the original index unchanged. This is a groupby without any aggregation that allows me to simply insert the resultant column back into the original DataFrame that was grouped.

I really like the df.groupby().apply(f) syntax. 👻 I hoped that under the hood with the default do not reduce dimensionality flag set the only thing apply would do for each grouping result was concatenate the results keeping the multi-index values untouched. It just seems inconsistent that with two groups the apply results in a long form concatenated data frame (as expected) but with only one grouping we get a wide form frame with the lowest dimension unstacked. If all apply results were either in wide form or long form that would make more sense but not having this corner case.

Do you know what apply feature is causing this to happen so that I better understand how apply works?

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

this is what squeeze does
try that and see if it works (it still could be a bug and is missing the case)

the idea is sometimes u want to have a dimenension reduction if u have all 1 sized groupes (eg to a series)

@dragoljub
Copy link
Author

I just gave squeeze=True a try on 13.1 and as I expected all this did was drop the first levels of my grouping, exactly what I wanted to avoid. The issue remains. When there is a grouping of only one group and apply return a series with a multi-index the last level gets unstacked.

Squeeze does not make any difference when the groupby has more than one grouping. Also if I'm returning a single column DataFrame as a result of the apply (instead of a series) the unstacking does not happen. Still seems like a corner case to me.

If I change the return code in the apply function to convert the series to a DataFrame index=df.index.get_level_values(4)).to_frame() It works as expected without unstacking a level.

def easy_func(df):
    """Sum rows and return series with same multi-index."""
    row_sum_data = df.sum(axis=1).values
    group_mean = df.mean().mean()
    return pd.Series(row_sum_data/group_mean, name='row_sum', index=df.index.get_level_values(4)).to_frame()
# groupby apply on only one group results in unstacked last level for a multi-index series
In [13]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4], squeeze=False).apply(easy_func)
Out[13]:
A4                  0         0         0         1         2         4
A0 A1 A2 A3
1  0  1  0   6.315789  3.684211  4.736842  6.842105  3.157895  5.263158

[1 rows x 6 columns]

# squeez=True removes all index levels that I'm trying to keep
In [14]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4], squeeze=True).apply(easy_func)
Out[14]:
A4
0     6.315789
0     3.684211
0     4.736842
1     6.842105
2     3.157895
4     5.263158
Name: row_sum, dtype: float64

# Squeez on groupings with more than one group dos not seem to have any affect
In [15]: df.groupby(level=index_cols[:4], squeeze=True).apply(easy_func)
Out[15]:
A0  A1  A2  A3  A4
0   0   1   0   1     5.000000
            1   2     5.000000
            2   1     5.000000
            4   3     4.736842
                4     5.263158
        2   0   0     4.761905
                3     5.238095
            1   2     5.000000
            2   0     5.000000
            3   1     5.192308
                2     4.615385
                3     5.192308
            4   2     5.000000
        3   0   1     7.065217
                1     6.521739
...
4   4   3   0   1     5.000000
            2   2     5.000000
            3   0     6.451613
                1     7.096774
                3     3.870968
                3     2.580645
            4   1     4.500000
                3     5.500000
        4   1   1     6.346154
                2     4.615385
                3     4.038462
            2   0     4.500000
                3     6.000000
                3     5.000000
                4     4.500000
Name: row_sum, Length: 1000, dtype: float64

# same as above result with squeez=True
In [16]: df.groupby(level=index_cols[:4], squeeze=False).apply(easy_func)
Out[16]:
A0  A1  A2  A3  A4
0   0   1   0   1     5.000000
            1   2     5.000000
            2   1     5.000000
            4   3     4.736842
                4     5.263158
        2   0   0     4.761905
                3     5.238095
            1   2     5.000000
            2   0     5.000000
            3   1     5.192308
                2     4.615385
                3     5.192308
            4   2     5.000000
        3   0   1     7.065217
                1     6.521739
...
4   4   3   0   1     5.000000
            2   2     5.000000
            3   0     6.451613
                1     7.096774
                3     3.870968
                3     2.580645
            4   1     4.500000
                3     5.500000
        4   1   1     6.346154
                2     4.615385
                3     4.038462
            2   0     4.500000
                3     6.000000
                3     5.000000
                4     4.500000
Name: row_sum, Length: 1000, dtype: float64

#using the updated apply function that returns a series on one group does not unstack the last level as expected :)
In [4]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(easy_func)
Out[4]:
                 row_sum
A0 A1 A2 A3 A4
0  3  0  0  0   5.492958
            1   4.647887
            1   5.070423
            3   4.225352
            3   4.225352
            4   6.338028

[6 rows x 1 columns]

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok...will reopen and see if someone can address in next release

@jreback jreback reopened this Apr 23, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 23, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jerryatmda
Copy link

This issue is very similar to a behavior that I consider a bug or misfeature. Because
I have a different use-case and a small package for reproducing it, I have submitted it
as a separate issue: #11224

@gte620v
Copy link
Contributor

gte620v commented Jan 28, 2016

+1 for fixing this.

@jreback
Copy link
Contributor

jreback commented May 9, 2016

tracked in #13056

@jreback jreback closed this as completed May 9, 2016
@vss888
Copy link

vss888 commented May 25, 2018

I have just bumped into the same problem (pandas 0.22). It would be great to handle the case of a single group consistently with multiple groups. In a large scale data analysis it just can happen that some input has one group only and such cases crash my analysis right now.

@MoritzLaurer
Copy link

MoritzLaurer commented Oct 27, 2022

same issue here with pandas==1.5. If I groupby on a dataframe where the groupby column has multiple different categorical values, the output is a series; if I groupby on a dataframe where the column has only one categorical value, the output is a dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants