-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: .pipe()
on DataFrameGroupBy
#46655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not correct; pipe is used to support method chaining. E.g.
Without pipe, one would have to call
Not sure what you mean here; you seem to be talking about the same method (DataFrameGroupBy.apply) twice.
I don't understand how this differs from the current behavior. |
It seems @rhshadrach is right on Anyway I think Take a look at the code below. df.skew()
df.kurt()
def diff_mean_skew(gb):
return gb.mean() - gb.skew()
def diff_mean_kurt(gb):
return gb.mean() - gb.kurt()
df[['Revenue', 'Quantity']].pipe(diff_mean_skew)
# Revenue 37.392213
# Quantity 5.686070
# dtype: float64
df[['Revenue', 'Quantity']].pipe(diff_mean_kurt)
df.groupby(['Store', 'Product']).pipe(diff_mean_skew)
# Revenue Quantity
# Store Product
# Store_1 Product_1 35.516486 4.593809
# Product_2 39.556072 3.717556
# Store_2 Product_1 36.599322 4.729909
# Product_2 33.351800 5.146413
df.groupby(['Store', 'Product']).pipe(diff_mean_kurt)
# AttributeError So to use I hope at least there should be a guide how to define a function that can be used for Or maybe i might be missing something? from scipy.stats import kurtosis
kurtosis(df[['Revenue', 'Quantity']])
def diff_mean_kurt2(gb):
return gb.mean() - kurtosis(gb)
df[['Revenue', 'Quantity']].pipe(diff_mean_kurt2)
# Revenue 37.392053
# Quantity 5.692420
# dtype: float64
df.groupby(['Store', 'Product']).pipe(diff_mean_kurt2)
# ValueError But and df[['Revenue', 'Quantity']].apply(lambda x: functools.partial(trim_mean, proportiontocut=0.2)(x))
# Revenue 36.602167
# Quantity 4.533333
# dtype: float64
df.groupby(['Store', 'Product']).apply(functools.partial(trim_mean, proportiontocut=0.2))
# Store Product
# Store_1 Product_1 [36.003, 4.6]
# Product_2 [40.07, 3.8125]
# Store_2 Product_1 [36.77153846153846, 4.6923076923076925]
# Product_2 [32.78333333333333, 5.2]
# dtype: object
df.groupby(['Store', 'Product']).apply(lambda s:
pd.Series(functools.partial(trim_mean, proportiontocut = 0.2)(s),
index = s.columns))
# Revenue Quantity
# Store Product
# Store_1 Product_1 36.003000 4.600000
# Product_2 40.070000 3.812500
# Store_2 Product_1 36.771538 4.692308
# Product_2 32.783333 5.200000 |
Is your feature request related to a problem?
DataFrameGroupBy.pipe()
can not be used for UDFThis is related to Higher Order Methods API, and inconsistency.
From the doc I see
we have four methods:
.pipe()
,.apply()
,.agg()
,.transform()
,applymap()
.For
DataFrame
,.pipe()
is applying a function to aDataFrame
, whereasDataFrameGroupBy.pipe(func)
is just a syntactic sugar(maybe) forDataFrameGroupyBy.func()
. We can use UDF unless it's a proper method forDataFrameGroupBy
(This is what doc says, and I experimented a little and it looks like so).For
DataFrameGroupBy
,.apply()
is applying a fuction to a groupedDataFrame
, whereasDataFrameGroupBy.apply(func)
is for applyingfunc
to theDataFrame
's columns.Describe the solution you'd like
I propose for consistency, using
.pipe()
for bothDataFrame
andDataFrameGroupBy
to apply a function to a (grouped or not)DataFrame
.One other thought, do we really need
.apply()
for essentially doing.applymap()
? For consistency I think.apply()
better be reserved for applying a function to columns(axis=0
) or rows(axis=1
). And if we think of.apply()
rather free method(in comparison to.agg()
(function should be a reducer) and.transformer()
(function should be a transformer)), we might better distinguish what would be the function input, for example naming.apply_df()
and.apply_ser()
.API breaking implications
.apply()
should be banned from applying functions to aDataFrame
and be specialized in applying functions to columnsDescribe alternatives you've considered
let
.apply()
be as it is and adopt more specific method like.apply_df()
and.apply_ser()
Additional context
Here is some exmple illustrating my point.
The text was updated successfully, but these errors were encountered: