ENH: `.pipe()` on `DataFrameGroupBy` #46655

kwhkim · 2022-04-06T06:01:11Z

Is your feature request related to a problem?

DataFrameGroupBy.pipe() can not be used for UDF

This is related to Higher Order Methods API, and inconsistency.

From the doc I see

we have four methods: .pipe(), .apply(), .agg(), .transform(), applymap().

For DataFrame, .pipe() is applying a function to a DataFrame, whereas DataFrameGroupBy.pipe(func) is just a syntactic sugar(maybe) for DataFrameGroupyBy.func(). We can use UDF unless it's a proper method for DataFrameGroupBy(This is what doc says, and I experimented a little and it looks like so).

For DataFrameGroupBy, .apply() is applying a fuction to a grouped DataFrame, whereas DataFrameGroupBy.apply(func) is for applying func to the DataFrame's columns.

Describe the solution you'd like

I propose for consistency, using .pipe() for both DataFrame and DataFrameGroupBy to apply a function to a (grouped or not) DataFrame.

One other thought, do we really need .apply() for essentially doing .applymap()? For consistency I think .apply() better be reserved for applying a function to columns(axis=0) or rows(axis=1). And if we think of .apply() rather free method(in comparison to .agg()(function should be a reducer) and .transformer()(function should be a transformer)), we might better distinguish what would be the function input, for example naming .apply_df() and .apply_ser().

API breaking implications

.apply() should be banned from applying functions to a DataFrame and be specialized in applying functions to columns

Describe alternatives you've considered

let .apply() be as it is and adopt more specific method like .apply_df() and .apply_ser()

Additional context

Here is some exmple illustrating my point.

import numpy as np
import pandas as pd
from scipy import trim_mean
import functools

n = 1000
df = pd.DataFrame(
    {
        "Store": np.random.choice(["Store_1", "Store_2"], n),
        "Product": np.random.choice(["Product_1", "Product_2"], n),
        "Revenue": (np.random.random(n) * 50 + 10).round(2),
        "Quantity": np.random.randint(1, 10, size=n),
    }
)

f1 = functools.partial(trim_mean, proportiontocut =0.2)

df[['Revenue', 'Quantity']].pipe(f1)
## array([34.6658    ,  5.09666667])

df.groupby(['Store', 'Product']).pipe(f1)
## ValueError: Can only compare identically-labeled DataFrame objects

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-04-09T02:21:36Z

whereas DataFrameGroupBy.pipe(func) is just a syntactic sugar(maybe) for DataFrameGroupyBy.func().

This is not correct; pipe is used to support method chaining. E.g.

def mean_median_diff(gb):
    return gb.mean() - gb.median()

df = pd.DataFrame({'a': [1, 1, 1, 2], 'b': [3, 5, 6, 5]})
result = df.groupby('a').pipe(mean_median_diff)
print(result)

          b
a          
1 -0.333333
2  0.000000

Without pipe, one would have to call mean_median_diff(df.groupby...).

For DataFrameGroupBy, .apply() is applying a fuction to a grouped DataFrame, whereas DataFrameGroupBy.apply(func) is for applying func to the DataFrame's columns.

Not sure what you mean here; you seem to be talking about the same method (DataFrameGroupBy.apply) twice.

I propose for consistency, using .pipe() for both DataFrame and DataFrameGroupBy to apply a function to a (grouped or not) DataFrame.

I don't understand how this differs from the current behavior.

kwhkim · 2022-04-10T11:53:23Z

It seems @rhshadrach is right on .pipe.

Anyway I think .pipe() is somewhat limited in its use.

Take a look at the code below.

df.skew()
df.kurt()

def diff_mean_skew(gb):
    return gb.mean() - gb.skew()

def diff_mean_kurt(gb):
    return gb.mean() - gb.kurt()

df[['Revenue', 'Quantity']].pipe(diff_mean_skew)
# Revenue     37.392213
# Quantity     5.686070
# dtype: float64

df[['Revenue', 'Quantity']].pipe(diff_mean_kurt)
df.groupby(['Store', 'Product']).pipe(diff_mean_skew)
# 		Revenue	Quantity
# Store	Product		
# Store_1	Product_1	35.516486	4.593809
#              Product_2	39.556072	3.717556
# Store_2	Product_1	36.599322	4.729909
#              Product_2	33.351800	5.146413

df.groupby(['Store', 'Product']).pipe(diff_mean_kurt) 
# AttributeError

So to use DataFrameGroupBy.pipe() we have only a bunch of DataFrameGroupBy methods.

I hope at least there should be a guide how to define a function that can be used for DataFrameGroupBy.

Or maybe i might be missing something?

from scipy.stats import kurtosis
kurtosis(df[['Revenue', 'Quantity']])
def diff_mean_kurt2(gb):
    return gb.mean() - kurtosis(gb)

df[['Revenue', 'Quantity']].pipe(diff_mean_kurt2)
# Revenue     37.392053
# Quantity     5.692420
# dtype: float64

df.groupby(['Store', 'Product']).pipe(diff_mean_kurt2) 
# ValueError

But df.groupby().apply() works fine.

and df.apply() seems to work on columns (at least firstly),
but df.groupby().apply() works on a split DataFrame.
(No consistency?)

df[['Revenue', 'Quantity']].apply(lambda x: functools.partial(trim_mean, proportiontocut=0.2)(x))
# Revenue     36.602167
# Quantity     4.533333
# dtype: float64

df.groupby(['Store', 'Product']).apply(functools.partial(trim_mean, proportiontocut=0.2))
# Store    Product  
# Store_1  Product_1                              [36.003, 4.6]
#              Product_2                            [40.07, 3.8125]
# Store_2  Product_1    [36.77153846153846, 4.6923076923076925]
#               Product_2                   [32.78333333333333, 5.2]
# dtype: object

df.groupby(['Store', 'Product']).apply(lambda s: 
                                       pd.Series(functools.partial(trim_mean, proportiontocut = 0.2)(s),
                                                index = s.columns))
# 		Revenue	Quantity
# Store	Product		
# Store_1	Product_1	36.003000	4.600000
#              Product_2	40.070000	3.812500
# Store_2	Product_1	36.771538	4.692308
#              Product_2	32.783333	5.200000

kwhkim added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2022

rhshadrach added Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: `.pipe()` on `DataFrameGroupBy` #46655

ENH: `.pipe()` on `DataFrameGroupBy` #46655

kwhkim commented Apr 6, 2022 •

edited

Loading

rhshadrach commented Apr 9, 2022

kwhkim commented Apr 10, 2022

ENH: .pipe() on DataFrameGroupBy #46655

ENH: .pipe() on DataFrameGroupBy #46655

Comments

kwhkim commented Apr 6, 2022 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

rhshadrach commented Apr 9, 2022

kwhkim commented Apr 10, 2022

ENH: `.pipe()` on `DataFrameGroupBy` #46655

ENH: `.pipe()` on `DataFrameGroupBy` #46655

kwhkim commented Apr 6, 2022 •

edited

Loading