Skip to content

GroupBy aggregate: Must produce aggregated value #24016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
digital-thinking opened this issue Nov 30, 2018 · 4 comments
Closed

GroupBy aggregate: Must produce aggregated value #24016

digital-thinking opened this issue Nov 30, 2018 · 4 comments
Labels

Comments

@digital-thinking
Copy link

digital-thinking commented Nov 30, 2018

I want to aggregate my dataframe by id and this is not the first time I came across this error, which is basically happening because it seems not to be allowed to return a numpy array in the aggregation function.
There are dozens of use cases, where rows are aggregated to numpy arrays and wrapping it with a list does lead to several other problems, like the datatype of the hole dataframe is of type “object” after transformation, even if every column had the same type. When feeding such a dataframe into keras it is a mess to get the data into the correct format.

groupedDf = df[['id','vec','label']].groupby('id', as_index=False).agg(
                                        {'label':'mean',
                                        'vec': lambda x: return_some_numpy_array(x) })

When using list() as a wrapper it works but it would be more comfortable to just use the numpy array without having to unwrap the list.

Related code:
generic.py[908]

if isinstance(output, (Series, Index, np.ndarray)):
   raise Exception('Must produce aggregated value')

Maybe anyone has another idea to get over this issue?

@TomAugspurger
Copy link
Contributor

Can you provide a minimal reproducible example and the output of show_versions?

@WillAyd
Copy link
Member

WillAyd commented Nov 30, 2018

I would say generally we don't really have first class support for doing this, as outside the confines of apply it would be ambiguous as to the desired shape of the returned function.

What type would you be expecting besides object here?

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Nov 30, 2018
@digital-thinking
Copy link
Author

digital-thinking commented Nov 30, 2018

Thanks for you feedback:
@WillAyd As far as I can see only object is a valid datatype here, sorry for that.

Example ('0.23.0'):

import numpy as np
import pandas as pd

def makeNumpyArray(rows):    
    array = np.zeros((2,2))
    # do somthing
    return array          
    

df = pd.DataFrame([[1,np.array([10,20,30])],
               [1,np.array([40,50,60])], 
               [2,np.array([20,30,40])],], columns=['category','data'])

g=df.groupby('category', as_index=False).agg({'data': lambda x: makeNumpyArray(x)})

To give you some more background:
I am dealing with large data sets, where every wraping/unwrapping needs a lot of memory and cpu time.

@WillAyd
Copy link
Member

WillAyd commented Nov 30, 2018

Closing as I don't think this is something we really do or want to support. You'd be better off leveraging a MultiIndex to store your data rather than trying to place a NumPy array in a particular location of the frame

@WillAyd WillAyd closed this as completed Nov 30, 2018
@WillAyd WillAyd added Groupby and removed Needs Info Clarification about behavior needed to assess issue labels Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants