GroupBy aggregate: Must produce aggregated value #24016

digital-thinking · 2018-11-30T14:59:10Z

I want to aggregate my dataframe by id and this is not the first time I came across this error, which is basically happening because it seems not to be allowed to return a numpy array in the aggregation function.
There are dozens of use cases, where rows are aggregated to numpy arrays and wrapping it with a list does lead to several other problems, like the datatype of the hole dataframe is of type “object” after transformation, even if every column had the same type. When feeding such a dataframe into keras it is a mess to get the data into the correct format.

groupedDf = df[['id','vec','label']].groupby('id', as_index=False).agg(
                                        {'label':'mean',
                                        'vec': lambda x: return_some_numpy_array(x) })

When using list() as a wrapper it works but it would be more comfortable to just use the numpy array without having to unwrap the list.

Related code:
generic.py[908]

if isinstance(output, (Series, Index, np.ndarray)):
   raise Exception('Must produce aggregated value')

Maybe anyone has another idea to get over this issue?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-30T15:03:00Z

Can you provide a minimal reproducible example and the output of show_versions?

WillAyd · 2018-11-30T15:06:22Z

I would say generally we don't really have first class support for doing this, as outside the confines of apply it would be ambiguous as to the desired shape of the returned function.

What type would you be expecting besides object here?

digital-thinking · 2018-11-30T15:09:53Z

Thanks for you feedback:
@WillAyd As far as I can see only object is a valid datatype here, sorry for that.

Example ('0.23.0'):

import numpy as np
import pandas as pd

def makeNumpyArray(rows):    
    array = np.zeros((2,2))
    # do somthing
    return array          
    

df = pd.DataFrame([[1,np.array([10,20,30])],
               [1,np.array([40,50,60])], 
               [2,np.array([20,30,40])],], columns=['category','data'])

g=df.groupby('category', as_index=False).agg({'data': lambda x: makeNumpyArray(x)})

To give you some more background:
I am dealing with large data sets, where every wraping/unwrapping needs a lot of memory and cpu time.

WillAyd · 2018-11-30T16:54:55Z

Closing as I don't think this is something we really do or want to support. You'd be better off leveraging a MultiIndex to store your data rather than trying to place a NumPy array in a particular location of the frame

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Nov 30, 2018

WillAyd closed this as completed Nov 30, 2018

WillAyd added Groupby and removed Needs Info Clarification about behavior needed to assess issue labels Nov 30, 2018

jointfull mentioned this issue Mar 7, 2019

Inconsistent behavior when using GroupBy and pandas.Series.mode #25581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy aggregate: Must produce aggregated value #24016

GroupBy aggregate: Must produce aggregated value #24016

digital-thinking commented Nov 30, 2018 •

edited

Loading

TomAugspurger commented Nov 30, 2018

WillAyd commented Nov 30, 2018

digital-thinking commented Nov 30, 2018 •

edited

Loading

WillAyd commented Nov 30, 2018

GroupBy aggregate: Must produce aggregated value #24016

GroupBy aggregate: Must produce aggregated value #24016

Comments

digital-thinking commented Nov 30, 2018 • edited Loading

TomAugspurger commented Nov 30, 2018

WillAyd commented Nov 30, 2018

digital-thinking commented Nov 30, 2018 • edited Loading

WillAyd commented Nov 30, 2018

digital-thinking commented Nov 30, 2018 •

edited

Loading

digital-thinking commented Nov 30, 2018 •

edited

Loading