You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write
SELECT l_returnflag,
l_linestatus,
Sum(l_extendedprice * ( 1- l_discount )) AS sum_disc_price,
Avg(l_quantity) AS avg_qty,
Count(*) AS count_order
FROM lineitem
GROUP BY l_returnflag,
l_linestatus
?
There's two problems with the current API:
we can't mix and match sum and avg within the same groupby
we can't transform columns within the group by aggregation. For example, tpc-h would require l_extendedprice*(1-l_discount) to be written inside the group by, rather than outside it
There's a couple of solutions I can think of
1. Expressions
Once again, the expressions API may help us here (#249)
We could introduce GroupBy.aggregate, which you can pass an expression to and which must reduce itself to a 1-row column per group
So, for example, the above could be expressed with:
Alternatively, .aggregate could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then (col('l_extendedprice')*(1-col('discount'))) would have had to be computed before entering the group_by. E.g.:
Maybe we could also define our own dataclass to make the above more readable
Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation
Which solution is best?
if we're going for "how should the API of a new dataframe library look like?", then I prefer option 1
if we're going for "a minimal API which all major dataframe libraries today can agree on" then I prefer option 2
All things considered, I might be leaning more towards option 2
The text was updated successfully, but these errors were encountered:
I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write
?
There's two problems with the current API:
sum
andavg
within the same groupbyl_extendedprice*(1-l_discount)
to be written inside the group by, rather than outside itThere's a couple of solutions I can think of
1. Expressions
Once again, the expressions API may help us here (#249)
We could introduce
GroupBy.aggregate
, which you can pass an expression to and which must reduce itself to a 1-row column per groupSo, for example, the above could be expressed with:
In the pandas implementation:
col('l_quantity').mean().rename('avg_qty')
could get mapped toavg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean")
(col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price')
would probably require a custom lambda functionThis would be maximally flexible, but it would mean that people would write code which could be inefficient in the pandas implementation
2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples
Alternatively,
.aggregate
could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then(col('l_extendedprice')*(1-col('discount')))
would have had to be computed before entering thegroup_by
. E.g.:Maybe we could also define our own dataclass to make the above more readable
Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation
Which solution is best?
All things considered, I might be leaning more towards option 2
The text was updated successfully, but these errors were encountered: