Skip to content

ENH: .pipe() on DataFrameGroupBy #46655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kwhkim opened this issue Apr 6, 2022 · 2 comments
Open

ENH: .pipe() on DataFrameGroupBy #46655

kwhkim opened this issue Apr 6, 2022 · 2 comments
Labels
Apply Apply, Aggregate, Transform, Map Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@kwhkim
Copy link
Contributor

kwhkim commented Apr 6, 2022

Is your feature request related to a problem?

DataFrameGroupBy.pipe() can not be used for UDF

This is related to Higher Order Methods API, and inconsistency.

From the doc I see

we have four methods: .pipe(), .apply(), .agg(), .transform(), applymap().

For DataFrame, .pipe() is applying a function to a DataFrame, whereas DataFrameGroupBy.pipe(func) is just a syntactic sugar(maybe) for DataFrameGroupyBy.func(). We can use UDF unless it's a proper method for DataFrameGroupBy(This is what doc says, and I experimented a little and it looks like so).

For DataFrameGroupBy, .apply() is applying a fuction to a grouped DataFrame, whereas DataFrameGroupBy.apply(func) is for applying func to the DataFrame's columns.

Describe the solution you'd like

I propose for consistency, using .pipe() for both DataFrame and DataFrameGroupBy to apply a function to a (grouped or not) DataFrame.

One other thought, do we really need .apply() for essentially doing .applymap()? For consistency I think .apply() better be reserved for applying a function to columns(axis=0) or rows(axis=1). And if we think of .apply() rather free method(in comparison to .agg()(function should be a reducer) and .transformer()(function should be a transformer)), we might better distinguish what would be the function input, for example naming .apply_df() and .apply_ser().

API breaking implications

.apply() should be banned from applying functions to a DataFrame and be specialized in applying functions to columns

Describe alternatives you've considered

let .apply() be as it is and adopt more specific method like .apply_df() and .apply_ser()

Additional context

Here is some exmple illustrating my point.

import numpy as np
import pandas as pd
from scipy import trim_mean
import functools

n = 1000
df = pd.DataFrame(
    {
        "Store": np.random.choice(["Store_1", "Store_2"], n),
        "Product": np.random.choice(["Product_1", "Product_2"], n),
        "Revenue": (np.random.random(n) * 50 + 10).round(2),
        "Quantity": np.random.randint(1, 10, size=n),
    }
)

f1 = functools.partial(trim_mean, proportiontocut =0.2)

df[['Revenue', 'Quantity']].pipe(f1)
## array([34.6658    ,  5.09666667])

df.groupby(['Store', 'Product']).pipe(f1)
## ValueError: Can only compare identically-labeled DataFrame objects
@kwhkim kwhkim added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2022
@rhshadrach
Copy link
Member

whereas DataFrameGroupBy.pipe(func) is just a syntactic sugar(maybe) for DataFrameGroupyBy.func().

This is not correct; pipe is used to support method chaining. E.g.

def mean_median_diff(gb):
    return gb.mean() - gb.median()

df = pd.DataFrame({'a': [1, 1, 1, 2], 'b': [3, 5, 6, 5]})
result = df.groupby('a').pipe(mean_median_diff)
print(result)

          b
a          
1 -0.333333
2  0.000000

Without pipe, one would have to call mean_median_diff(df.groupby...).

For DataFrameGroupBy, .apply() is applying a fuction to a grouped DataFrame, whereas DataFrameGroupBy.apply(func) is for applying func to the DataFrame's columns.

Not sure what you mean here; you seem to be talking about the same method (DataFrameGroupBy.apply) twice.

I propose for consistency, using .pipe() for both DataFrame and DataFrameGroupBy to apply a function to a (grouped or not) DataFrame.

I don't understand how this differs from the current behavior.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2022
@kwhkim
Copy link
Contributor Author

kwhkim commented Apr 10, 2022

It seems @rhshadrach is right on .pipe.

Anyway I think .pipe() is somewhat limited in its use.

Take a look at the code below.

df.skew()
df.kurt()

def diff_mean_skew(gb):
    return gb.mean() - gb.skew()

def diff_mean_kurt(gb):
    return gb.mean() - gb.kurt()

df[['Revenue', 'Quantity']].pipe(diff_mean_skew)
# Revenue     37.392213
# Quantity     5.686070
# dtype: float64

df[['Revenue', 'Quantity']].pipe(diff_mean_kurt)
df.groupby(['Store', 'Product']).pipe(diff_mean_skew)
# 		Revenue	Quantity
# Store	Product		
# Store_1	Product_1	35.516486	4.593809
#              Product_2	39.556072	3.717556
# Store_2	Product_1	36.599322	4.729909
#              Product_2	33.351800	5.146413

df.groupby(['Store', 'Product']).pipe(diff_mean_kurt) 
# AttributeError

So to use DataFrameGroupBy.pipe() we have only a bunch of DataFrameGroupBy methods.

I hope at least there should be a guide how to define a function that can be used for DataFrameGroupBy.

Or maybe i might be missing something?

from scipy.stats import kurtosis
kurtosis(df[['Revenue', 'Quantity']])
def diff_mean_kurt2(gb):
    return gb.mean() - kurtosis(gb)

df[['Revenue', 'Quantity']].pipe(diff_mean_kurt2)
# Revenue     37.392053
# Quantity     5.692420
# dtype: float64

df.groupby(['Store', 'Product']).pipe(diff_mean_kurt2) 
# ValueError

But df.groupby().apply() works fine.

and df.apply() seems to work on columns (at least firstly),
but df.groupby().apply() works on a split DataFrame.
(No consistency?)

df[['Revenue', 'Quantity']].apply(lambda x: functools.partial(trim_mean, proportiontocut=0.2)(x))
# Revenue     36.602167
# Quantity     4.533333
# dtype: float64

df.groupby(['Store', 'Product']).apply(functools.partial(trim_mean, proportiontocut=0.2))
# Store    Product  
# Store_1  Product_1                              [36.003, 4.6]
#              Product_2                            [40.07, 3.8125]
# Store_2  Product_1    [36.77153846153846, 4.6923076923076925]
#               Product_2                   [32.78333333333333, 5.2]
# dtype: object

df.groupby(['Store', 'Product']).apply(lambda s: 
                                       pd.Series(functools.partial(trim_mean, proportiontocut = 0.2)(s),
                                                index = s.columns))
# 		Revenue	Quantity
# Store	Product		
# Store_1	Product_1	36.003000	4.600000
#              Product_2	40.070000	3.812500
# Store_2	Product_1	36.771538	4.692308
#              Product_2	32.783333	5.200000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants