-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: performance regression in 1.0 compared to 0.25 #30790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FYI, the benchmark machine should be slowly filling in some of those values. It was hanging for a couple weeks on an xarray benchmark. |
Looking at the indexing ones like |
For the GroupBy regressions, I can't really replicate it locally in my default environments (using py 37), it gives only a slight slowdown (more in order of 10% instead of x2). But creating a py36 env for 0.25.3 and master, then I do see a clearer slowdown (although more x1.5 instead of x2). Example code from one of the benchmarks (groupby::GroupByMethods):
Profiling that example, one of the factors contributing to the slowdown is more time spent in the |
Another in the top of the list is this very simple benchmark:
This I can reproduce locally, and gives more than a 2x slowdown. From a quick profile, on master the total amount of time for |
@TomAugspurger is the benchmark machine still filling in values? Would be helpful to know where to bisect if there was more If not no big deal just wanted to ask |
I see you closed #18532; is there nothing there that we should be concerned with? |
Possibly, if you have time to look into them.
…On Wed, Jan 8, 2020 at 7:49 PM jbrockmendel ***@***.***> wrote:
I see you closed #18532
<#18532>; is there nothing
there that we should be concerned with?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30790?email_source=notifications&email_token=AAKAOIWCTHCV5AAHY3PNQ4DQ4Z7A5A5CNFSM4KD5AJOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOUHMQ#issuecomment-572343218>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIVEIGJBPCQD2SLRL33Q4Z7A5ANCNFSM4KD5AJOA>
.
|
Short reproducer for the indexing case:
and as Tom mentioned, this seems mostly due to additional time taken by |
Are the extract_array dtype checks inside |
Yes, I think the dtype checks are for a large part there. |
yah id like to avoid that is possible.
Does it break any tests if we make change .array -> ._values? All of the places that come to mind for using extract_array should be unaffected |
@TomAugspurger is it possible the benchmarks machine is not running anymore the last days? (I don't see any updates from since I posted this issue) |
See these in the logs when running
Not sure what's up. |
Can't reproduce this. I just ran locally a |
Yeah, I'm guessing something got messed up on a run and asv can't recover. Fixing things manually now. |
Seems like conda got into a bad state on the benchmark machine. Restarting
helped, and so the benchmarks should start running again now.
…On Fri, Jan 10, 2020 at 1:47 PM Joris Van den Bossche < ***@***.***> wrote:
Are the extract_array dtype checks inside .array? Could be avoided by
using ._values?
Yes, I think the dtype checks are for a large part there.
Using _values to avoid this might be a bit tricky, though. For example,
for Series with datetime64 (tz-naive) returns numpy array and not
DatetimeArray. I don't know what depends on _values returning a numpy
array in that case.
We could define .array directly on Series and rely on the block to return
the correct thing (right now Series._values calls internal_values() on
the block, we could add something similar for .array, although it is yet
another "values definition")
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30790?email_source=notifications&email_token=AAKAOITP6GYVOA4MVIL7LNTQ5DGEVA5CNFSM4KD5AJOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIVAOZQ#issuecomment-573179750>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIVBE4OROWDIRP4JCH3Q5DGEVANCNFSM4KD5AJOA>
.
|
any other regressions we should focus on in the short term? |
Indexing a multi-index seemingly went from O(1) to O(N): Code to reproduce:
(Edit: previously said O(N^2) because I was counting the number of elements indexed too) |
@valtron thanks for reporting this. any chance you can bisect to figure out when this was introduced? |
@jbrockmendel introduced in the |
@valtron can you open a new dedicated issue for this? |
The issue is here: #31648 |
moved off 1.1.1 milestone (scheduled for this week) as no PRs to fix in the pipeline |
moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline |
moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline |
closing this, these are old; they maybe stiill here but need new measurements. pls open a new specific issue. |
I ran a full benchmark on a separate machine locally, comparing current master against 0.25.3.
Some identified cases:
extract_array
, reproducer below at PERF: performance regression in 1.0 compared to 0.25 #30790 (comment)Index.__new__
, reproducer below at PERF: performance regression in 1.0 compared to 0.25 #30790 (comment)asof
due to additional copy / take: DEPR: is_copy arg of take #30615Full results:
I already commented on a few PRs, for the rest would need to take a further look. Help is certainly welcome to check certain cases.
One recurrent theme seems to be a rather consistent slowdown of a bunch of groupby methods. This can also be seen on the benchmark machine (eg https://pandas.pydata.org/speed/pandas/index.html#groupby.GroupByMethods.time_dtype_as_group?p-dtype='int'&p-method='all'&p-method='any'&odfpy=)
The text was updated successfully, but these errors were encountered: