df.groupby().apply() with only one group returns wrong shape! #5839

dragoljub · 2014-01-03T21:02:49Z

I have reached a corner case in the wonderful groupby().apply() method. A groupby with only one group causes the apply method to return the wrong output shape. Instead of a series with a multi-index the result is retunred as a DataFrame with last row index level as columns. 😦

In [1]: import numpy as np
   ...: import pandas as pd
   ...: from sklearn.cluster import DBSCAN as DBSCAN
   ...: 
   ...: print pd.__version__
   ...: 
   ...: # Generate Test DataFrame
   ...: NUM_ROWS = 1000
   ...: NUM_COLS = 10
   ...: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
   ...: index_cols = col_names[:5] 
   ...: 
   ...: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: df = df.set_index(index_cols).sort_index()
   ...: df
   ...: 
   ...: # Group by first 4 index columns.
   ...: grp = df.groupby(level=index_cols[:4])
   ...: 
   ...: # Find index of largest group.
   ...: big_loc = grp.size().idxmax()
   ...: 
   ...: # Create function to apply clustering on groups
   ...: def grp_func(df):
   ...:     """Run clustering on subgroup and return series of results."""
   ...:     db = DBSCAN(eps=1, min_samples=1, metric='euclidean').fit(df.values)
   ...:     return pd.Series(db.labels_, name='cluster_id', index=df.index.get_level_values(4))
   ...: 
0.12.0

In [2]: # Apply clustering on each subgroup of DataFrame
   ...: out_good = grp.apply(grp_func)
   ...: out_good
   ...: out_good.shape
Out[2]: (1000L,)

In [3]: # Select out biggest group wihile keeping index levels and try same apply
   ...: out_bad = df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(grp_func)
   ...: out_bad
   ...: out_bad.shape
Out[3]: (1, 7)

In [4]: out_good
Out[4]: 
A0  A1  A2  A3  A4
0   0   0   0   0     1
                3     0
            1   3     0
            2   1     1
                3     2
                3     0
            3   3     1
                3     0
        1   1   1     0
            2   0     0
                2     1
                4     2
            3   4     0
                4     1
            4   0     0
...
4   4   3   0   2     0
            1   1     1
                3     2
                4     0
            3   1     0
                2     1
            4   4     0
        4   0   4     0
            1   1     3
                1     1
                2     2
                2     0
            3   1     1
                3     0
            4   4     0
Name: cluster_id, Length: 1000, dtype: float64

In [5]: out_bad

Out[5]: 
A4           1  1  2  3  3  3  4
A0 A1 A2 A3                     
3  0  0  3   6  5  3  0  1  4  2

# If you stack the bad result it comes out looking OK, but now I need a workaround for this corner case to use apply.
In [17]: out_bad.stack()

Out[17]: A0  A1  A2  A3  A4
3   0   0   3   1     3
                1     6
                2     5
                3     1
                3     4
                3     0
                4     2
dtype: float64

jreback · 2014-01-03T21:26:33Z

you can try playing around with squeeze kw to groupby (though its off by default).

Since you have a complicated apply function, consider just doing something like

pd.concat([ function(g) for g in grouped_frame ])

jreback · 2014-01-03T21:26:58Z

@dragoljub try on 0.13 as well

jreback · 2014-01-03T21:32:08Z

@dragoljub I think returning a MI series itself from the applied function is pretty odd (I know we talked about this before), but inferring what to do with this is pretty non-trivial. Best to just iterate over the groups and concat yourself.

dragoljub · 2014-01-03T21:51:57Z

Is 0.13 master out? I didn't see an announcement?

Thanks,
-Gagi

On Fri, Jan 3, 2014 at 1:27 PM, jreback [email protected] wrote:

@dragoljub https://github.com/dragoljub try on 0.13 as well

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/5839#issuecomment-31555081
.

jreback · 2014-01-03T22:03:40Z

not not officially quite yet (binaries are up in the dev section still on pydata)...just a day or 2 on the release

dragoljub · 2014-01-03T22:08:37Z

I think the apply method is very powerful on groupbys and Dataframes and would love it to handle this corner case. For now I'm checking if the len(grp) == 1 and if so the result of apply will be "stacked" to get back the proper shape.

The reason I'm using apply this way is because it follows the df.apply() semantics quite nicely except instead of processing individual rows/cols we process groups of rows or columns. The only thing I want to do is preserve the multi-index so assignment back to the original DataFrame works nicely.

df['new_col'] = df.apply(my_func) --> Works well, input is 1 value/row output is 1 value.

For clean and consistent code I want to continue using apply on data frame groupbys with a multi-index:

df['new_col'] = df.groupby(levels=my_levels).apply(my_func) --> Also works quite well, input is one data frame and output should be a series. Regardless of the index it should not return a DataFrame if the return value from the apply func is returning a series...

To me this is a very important operation. Being able to apply multivariate machine learning functions (multiple columns as input) on various groupings of data to yield another series column with a consistent multi-index.

jreback · 2014-01-03T22:10:33Z

sure....can you put up an example that doesn't have an external dep so can include in the test suite? i'll mark as a bug

dragoljub · 2014-01-03T22:34:33Z

Here is the same example without the clustering code. All this does is sum the each groups rows and normalize them by the mean of all group values and return that as a series with the same full multi-index. Hopefully this can be used as a test that checks for the corner case with only one group in the groupby. 😄 P.S. Really really really looking forward to 0.13.0!

In [1]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: print pd.__version__
   ...: 
   ...: # Generate Test DataFrame
   ...: NUM_ROWS = 1000
   ...: NUM_COLS = 10
   ...: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
   ...: index_cols = col_names[:5] 
   ...: 
   ...: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # Sort the index!
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...: df = df.set_index(index_cols).sort_index()
   ...: df
   ...: 
   ...: # Group by first 4 index columns.
   ...: grp = df.groupby(level=index_cols[:4])
   ...: 
   ...: # Find index of largest group.
   ...: big_loc = grp.size().idxmax()
   ...: 
   ...: # Create function to apply on groups
   ...: def easy_func(df):
   ...:     """Sum rows and return series with same multi-index."""
   ...:     row_sum_data = df.sum(axis=1).values
   ...:     group_mean = df.mean().mean()
   ...:     return pd.Series(row_sum_data/group_mean, name='row_sum', index=df.index.get_level_values(4))
   ...: 
0.12.0

In [2]: # Apply easy_func on each subgroup of DataFrame
   ...: out_good = grp.apply(easy_func)
   ...: out_good
Out[2]: 
A0  A1  A2  A3  A4
0   0   0   0   0     5.000000
                4     5.000000
            1   1     4.545455
                2     5.454545
            2   3     5.000000
            3   3     5.000000
            4   2     5.000000
        1   0   0     5.217391
                2     5.652174
                3     3.913043
                3     5.217391
            1   0     4.400000
                2     5.600000
            3   0     5.000000
            4   1     4.615385
...
4   4   3   0   3     5.357143
                4     7.500000
            1   1     5.000000
            2   1     6.521739
                1     3.913043
                2     4.565217
            3   2     5.000000
            4   2     5.000000
        4   0   0     4.200000
                0     5.400000
                1     5.400000
            2   1     5.000000
            3   3     5.000000
            4   3     5.238095
                3     4.761905
Name: row_sum, Length: 1000, dtype: float64

In [3]: out_good.shape
Out[3]: (1000L,)

In [4]: # Select out biggest group while keeping index levels and try same apply
   ...: out_bad = df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(easy_func)
   ...: out_bad

Out[4]: 
A4                  1         2        2         2         4         4  \
A0 A1 A2 A3                                                              
4  0  4  4   4.803922  3.431373  7.54902  6.176471  4.803922  2.745098   

A4                  4  
A0 A1 A2 A3            
4  0  4  4   5.490196  

In [5]: out_bad.shape
Out[5]: (1, 7)

In [6]: out_bad.stack()
Out[6]: 
A0  A1  A2  A3  A4
4   0   4   4   1     4.803922
                2     3.431373
                2     7.549020
                2     6.176471
                4     4.803922
                4     2.745098
                4     5.490196
dtype: float64

jreback · 2014-04-21T19:08:39Z

@dragoljub you are doing a filter followed by a groupby here (you are sort of doing the filter manually, but same idea).

This is quite difficult to guess the output shape in all circumanstances.

It is possible we could change the kw squeeze to handle this, but I would need a compelling reason. (IMHO you should just do this via concat anyhow, this seems a bit too magical to do otherwise)

dragoljub · 2014-04-21T19:29:38Z

The index/filter is just for illustration purposes to show what happens when there is a groupby() with only one grouping found.

I'll have to look into the squeeze parameter to see if it returns the long form DataFrame apply() result rather than wide one that needs to be stacked after the apply returns. From what I can tell the squeeze parameter is False by default, which is actually what I want. I want to preserve all multi-index levels regardless if they are constant after grouping.

I'm still not sure why the df.groupby().apply() method returns an unstacked data frame (of the wrong shape) for a grouping of size one, while the same apply on groupbys of more than one grouping returns the expected DataFrame shapes just concatenated? I thought pd.concat([df]) just returns the same df?

jreback · 2014-04-21T19:41:43Z

no, what I mean is do an unrolled groupby, something like:

pd.concat(dict([ (func_to_compute_keys(g),func_for_value(grp)) 
                           for g, grp in df.groupby(...) ]))

rather than use apply.

This offers a very fine level of control (you can also use a list instead of a dict)

what I mean is the df.groupby(...).apply(...) works for most situations, but occasionally its not perfect as it has to do some inference, so try the above.

dragoljub · 2014-04-21T20:59:40Z

I see. Thanks for the suggestion, I'll keep that in my toolbox for future multivariate computations on groupings of data.

The thing is I'm never actually changing the index/keys, just trying to keep the original index from the grouping. 😄 I have a fairly generic use case where I want to group data, and compute a resultant column for each grouping while keeping the original index unchanged. This is a groupby without any aggregation that allows me to simply insert the resultant column back into the original DataFrame that was grouped.

I really like the df.groupby().apply(f) syntax. 👻 I hoped that under the hood with the default do not reduce dimensionality flag set the only thing apply would do for each grouping result was concatenate the results keeping the multi-index values untouched. It just seems inconsistent that with two groups the apply results in a long form concatenated data frame (as expected) but with only one grouping we get a wide form frame with the lowest dimension unstacked. If all apply results were either in wide form or long form that would make more sense but not having this corner case.

Do you know what apply feature is causing this to happen so that I better understand how apply works?

jreback · 2014-04-21T21:03:07Z

this is what squeeze does
try that and see if it works (it still could be a bug and is missing the case)

the idea is sometimes u want to have a dimenension reduction if u have all 1 sized groupes (eg to a series)

dragoljub · 2014-04-23T00:17:47Z

I just gave squeeze=True a try on 13.1 and as I expected all this did was drop the first levels of my grouping, exactly what I wanted to avoid. The issue remains. When there is a grouping of only one group and apply return a series with a multi-index the last level gets unstacked.

Squeeze does not make any difference when the groupby has more than one grouping. Also if I'm returning a single column DataFrame as a result of the apply (instead of a series) the unstacking does not happen. Still seems like a corner case to me.

If I change the return code in the apply function to convert the series to a DataFrame index=df.index.get_level_values(4)).to_frame() It works as expected without unstacking a level.

def easy_func(df):
    """Sum rows and return series with same multi-index."""
    row_sum_data = df.sum(axis=1).values
    group_mean = df.mean().mean()
    return pd.Series(row_sum_data/group_mean, name='row_sum', index=df.index.get_level_values(4)).to_frame()

# groupby apply on only one group results in unstacked last level for a multi-index series
In [13]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4], squeeze=False).apply(easy_func)
Out[13]:
A4                  0         0         0         1         2         4
A0 A1 A2 A3
1  0  1  0   6.315789  3.684211  4.736842  6.842105  3.157895  5.263158

[1 rows x 6 columns]

# squeez=True removes all index levels that I'm trying to keep
In [14]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4], squeeze=True).apply(easy_func)
Out[14]:
A4
0     6.315789
0     3.684211
0     4.736842
1     6.842105
2     3.157895
4     5.263158
Name: row_sum, dtype: float64

# Squeez on groupings with more than one group dos not seem to have any affect
In [15]: df.groupby(level=index_cols[:4], squeeze=True).apply(easy_func)
Out[15]:
A0  A1  A2  A3  A4
0   0   1   0   1     5.000000
            1   2     5.000000
            2   1     5.000000
            4   3     4.736842
                4     5.263158
        2   0   0     4.761905
                3     5.238095
            1   2     5.000000
            2   0     5.000000
            3   1     5.192308
                2     4.615385
                3     5.192308
            4   2     5.000000
        3   0   1     7.065217
                1     6.521739
...
4   4   3   0   1     5.000000
            2   2     5.000000
            3   0     6.451613
                1     7.096774
                3     3.870968
                3     2.580645
            4   1     4.500000
                3     5.500000
        4   1   1     6.346154
                2     4.615385
                3     4.038462
            2   0     4.500000
                3     6.000000
                3     5.000000
                4     4.500000
Name: row_sum, Length: 1000, dtype: float64

# same as above result with squeez=True
In [16]: df.groupby(level=index_cols[:4], squeeze=False).apply(easy_func)
Out[16]:
A0  A1  A2  A3  A4
0   0   1   0   1     5.000000
            1   2     5.000000
            2   1     5.000000
            4   3     4.736842
                4     5.263158
        2   0   0     4.761905
                3     5.238095
            1   2     5.000000
            2   0     5.000000
            3   1     5.192308
                2     4.615385
                3     5.192308
            4   2     5.000000
        3   0   1     7.065217
                1     6.521739
...
4   4   3   0   1     5.000000
            2   2     5.000000
            3   0     6.451613
                1     7.096774
                3     3.870968
                3     2.580645
            4   1     4.500000
                3     5.500000
        4   1   1     6.346154
                2     4.615385
                3     4.038462
            2   0     4.500000
                3     6.000000
                3     5.000000
                4     4.500000
Name: row_sum, Length: 1000, dtype: float64

#using the updated apply function that returns a series on one group does not unstack the last level as expected :)
In [4]: df[[big_loc == a[:4] for a in df.index.values]].groupby(level=index_cols[:4]).apply(easy_func)
Out[4]:
                 row_sum
A0 A1 A2 A3 A4
0  3  0  0  0   5.492958
            1   4.647887
            1   5.070423
            3   4.225352
            3   4.225352
            4   6.338028

[6 rows x 1 columns]

jreback · 2014-04-23T00:32:23Z

ok...will reopen and see if someone can address in next release

jerryatmda · 2015-10-02T18:03:44Z

This issue is very similar to a behavior that I consider a bug or misfeature. Because
I have a different use-case and a small package for reproducing it, I have submitted it
as a separate issue: #11224

gte620v · 2016-01-28T03:51:19Z

+1 for fixing this.

jreback · 2016-05-09T20:37:17Z

tracked in #13056

vss888 · 2018-05-25T17:40:07Z

I have just bumped into the same problem (pandas 0.22). It would be great to handle the case of a single group consistently with multiple groups. In a large scale data analysis it just can happen that some input has one group only and such cases crash my analysis right now.

MoritzLaurer · 2022-10-27T10:24:25Z

same issue here with pandas==1.5. If I groupby on a dataframe where the groupby column has multiple different categorical values, the output is a series; if I groupby on a dataframe where the column has only one categorical value, the output is a dataframe.

jreback closed this as completed Apr 21, 2014

jreback added the Won't Fix label Apr 21, 2014

jreback reopened this Apr 23, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Apr 23, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

jerryatmda mentioned this issue Oct 2, 2015

DataFrame.loc[] returns inconsistent types depending on row count #11224

Closed

jreback mentioned this issue Feb 17, 2016

ENH Consistent apply output when grouping with freq #12362

Closed

edfall mentioned this issue May 9, 2016

API: inconsistent return format of groupby apply #13056

Closed

6 tasks

jreback closed this as completed May 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.groupby().apply() with only one group returns wrong shape! #5839

df.groupby().apply() with only one group returns wrong shape! #5839

dragoljub commented Jan 3, 2014

jreback commented Jan 3, 2014

jreback commented Jan 3, 2014

jreback commented Jan 3, 2014

dragoljub commented Jan 3, 2014

jreback commented Jan 3, 2014

dragoljub commented Jan 3, 2014

jreback commented Jan 3, 2014

dragoljub commented Jan 3, 2014

jreback commented Apr 21, 2014

dragoljub commented Apr 21, 2014

jreback commented Apr 21, 2014

dragoljub commented Apr 21, 2014

jreback commented Apr 21, 2014

dragoljub commented Apr 23, 2014

jreback commented Apr 23, 2014

jerryatmda commented Oct 2, 2015

gte620v commented Jan 28, 2016

jreback commented May 9, 2016

vss888 commented May 25, 2018

MoritzLaurer commented Oct 27, 2022 •

edited

Loading

df.groupby().apply() with only one group returns wrong shape! #5839

df.groupby().apply() with only one group returns wrong shape! #5839

Comments

dragoljub commented Jan 3, 2014

jreback commented Jan 3, 2014

jreback commented Jan 3, 2014

jreback commented Jan 3, 2014

dragoljub commented Jan 3, 2014

jreback commented Jan 3, 2014

dragoljub commented Jan 3, 2014

jreback commented Jan 3, 2014

dragoljub commented Jan 3, 2014

jreback commented Apr 21, 2014

dragoljub commented Apr 21, 2014

jreback commented Apr 21, 2014

dragoljub commented Apr 21, 2014

jreback commented Apr 21, 2014

dragoljub commented Apr 23, 2014

jreback commented Apr 23, 2014

jerryatmda commented Oct 2, 2015

gte620v commented Jan 28, 2016

jreback commented May 9, 2016

vss888 commented May 25, 2018

MoritzLaurer commented Oct 27, 2022 • edited Loading

MoritzLaurer commented Oct 27, 2022 •

edited

Loading