-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Performance regression in 0.24+ on GroupBy.apply #25883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you isolate the frame operation from the groupby? Curious if the regression is noticeable in the former |
You mean like so? Well, there is a slight difference (~ 30 % slower) but I wouldn't trust my "benchmark" here too much. I ran the code a few times and the numbers varied between 0.5 and 1 second. import time
import numpy as np
import pandas as pd
nrows, ncols = 1000, 1000
df = pd.DataFrame(np.random.rand(nrows, ncols))
start = time.time()
for _ in range(100):
df.apply(lambda x: x - x.mean())
end = time.time()
print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))
# [pandas==0.23.4] execution time: 25.8880 seconds
# [pandas==0.24.0] execution time: 36.0216 seconds
# [pandas==0.24.2] execution time: 34.6180 seconds Additionally, i tested the original code sample with having only one group. It's still about 2 times slower in 0.24+ compared to 0.23.4, but not as drastic as with multiple groups. nrows, ncols = 10000, 10000
df["key"] = [1] * nrows
# [pandas==0.23.4] execution time: 5.5250 seconds
# [pandas==0.24.0] execution time: 12.1590 seconds
# [pandas==0.24.2] execution time: 12.1540 seconds |
Right I'm just trying to isolate potential regressions in GroupBy versus Frame operations. Given you don't see the same regression with scalars I'm inclined to believe it's the latter that may be at fault here. Can you try your last example on master? I think those results might be misleading as the apply operation would still get called twice even with only one group in 0.24.2 (see #24748 which just changed this behavior) so might not be a clean comparison to make @TomAugspurger we were never able to get the ASV site back up running were we? |
I guess I got closer to the problem, it really seems to be related to data frame operations, i.e. subtracting a scalar from a data frame already seems to be the regression. And it's huge, on column-heavy shapes, this simple operation is 153 times slower ô.O import timeit
import numpy as np # 1.16.2
import pandas as pd
def benchmark():
nrows, ncols = 100, 100
df = pd.DataFrame(np.random.rand(nrows, ncols))
_ = df - 1
time = timeit.timeit(benchmark, number=100)
print("# {:>8.4f} sec pandas=={}".format(time, pd.__version__)) Here are my benchmarking results of the
Concerning import timeit
import numpy as np # 1.16.2
import pandas as pd
def benchmark():
nrows, ncols = 1000, 100
df = pd.DataFrame(np.random.rand(nrows, ncols))
df["key"] = range(nrows)
numeric_columns = list(range(ncols))
grouping = df.groupby(by="key")
grouping[numeric_columns].apply(lambda x: x.mean())
time = timeit.timeit(benchmark, number=10)
print("# {:>8.4f} sec pandas=={}".format(time, pd.__version__))
|
Note that if this is your actual function, you can/should instead do this, which has always been faster df[numeric_columns] - grouping[numeric_columns].transform('mean') |
Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise. |
Local ASV result comparing current head to the last commit on 0.23.4 confirms a regression for the frame ops: before after ratio
[af7b0ba4] [95c78d65]
<master>
+ 3.30±0.2ms 249±3ms 75.37 binary_ops.Ops2.time_frame_float_div_by_zero
+ 8.08±0.4ms 256±10ms 31.75 binary_ops.Ops2.time_frame_int_div_by_zero
+ 12.6±0.2ms 261±7ms 20.78 binary_ops.Ops2.time_frame_float_floor_by_zero
+ 30.2±0.4ms 106±1ms 3.52 binary_ops.Ops.time_frame_multi_and(False, 1)
+ 29.9±0.7ms 102±4ms 3.43 binary_ops.Ops.time_frame_multi_and(False, 'default')
+ 34.1±0.4ms 110±2ms 3.24 binary_ops.Ops.time_frame_multi_and(True, 1)
+ 39.2±0.4ms 111±0.9ms 2.83 binary_ops.Ops.time_frame_multi_and(True, 'default')
+ 3.85±0.06ms 5.18±0.1ms 1.35 binary_ops.Ops.time_frame_add(True, 1)
+ 28.6±0.1μs 37.4±0.2μs 1.31 binary_ops.Ops2.time_series_dot
+ 3.54±0.06ms 4.29±0.3ms 1.21 binary_ops.Ops.time_frame_add(False, 1)
+ 512±1μs 600±3μs 1.17 binary_ops.Ops2.time_frame_series_dot
- 107±2ms 61.2±1ms 0.57 binary_ops.Ops.time_frame_comparison(False, 'default')
- 108±1ms 61.1±0.9ms 0.57 binary_ops.Ops.time_frame_comparison(False, 1)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. I've updated the title to reflect this as I think that is the larger issue. @blu3r4y if you can run ASVs for GroupBy apply to confirm regression there that could be helpful as another issue |
Yes, regression with DataFrame + Scalar Ops in 0.24+ have already been reported in #24990. @WillAyd I will run ASV for |
That makes sense - thanks Mario!
…Sent from my iPhone
On Mar 27, 2019, at 3:14 AM, Mario Kahlhofer ***@***.***> wrote:
Can you check for duplicates? We have another issue for ops that were previously blockwise, but are now columnwise.
Yes, regression with DataFrame + Scalar Ops in 0.24+ have already been reported in #24990.
@WillAyd I will run ASV for GroupBy.apply soon, so that we keep this issue isolated on GroupBy.apply, right? Or should I make a new issue then?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
As suggested by @WillAyd, i ran ASV for
These findings must be taken with a grain of salt, since the benchmarks do not show stable results (maybe interesting for the discussion in #23412). Although, I guess the impact on I renamed THIS issue to focus on the performance regression in v0.23.4 0409521 vs. v0.24.0 83eb242 (5 warm-up runs + 5 reported runs)
v0.23.4 0409521 vs. HEAD 437efa6 (5 warm-up runs + 5 reported runs)
|
FWIW, on a desktop computer the benchmark numbers are fairly stable:
If the accuracy is not sufficient, you can add |
I'm seeing 0.2614 seconds on my machine on main; but that means relatively little in isolation. However I think this issue is too old, comparisons with 0.23 perf are no longer useful. |
Code Sample
Problem description
The function
GroupBy.apply
is a lot slower (~ 25 times) with version 0.24.0, compared to the 0.23.4 release. The problem still persists in the latest 0.24.2 release.The code sample above shows this performance regression.
The purpose of the sample is to subtract the group mean from all elements in this group.
The problem does only occur when the lambda for
apply()
returns a data frame.There are no performance issues with scalar return values, e.g.
lambda x: x.mean()
.Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: