Skip to content

Performance regression in DataFrame reduction ops #38592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Dec 20, 2020 · 13 comments · Fixed by #42475
Closed

Performance regression in DataFrame reduction ops #38592

jorisvandenbossche opened this issue Dec 20, 2020 · 13 comments · Fixed by #42475
Labels
Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche
Copy link
Member

The stat_ops.FrameOps are generally showing a slowdown since a few days, see eg https://pandas.pydata.org/speed/pandas/#stat_ops.FrameOps.time_op?python=3.8&Cython=0.29.21&p-op='mean'&p-dtype='int'&p-axis=0&commits=8dbb593d-7043f8fa

The indicated range is 8dbb593...7043f8f, but don't directly see which commit might be the cause cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Reduction Operations sum, mean, min, max, etc. labels Dec 20, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Dec 20, 2020
@IngErnestoAlvarez
Copy link

I think it's in this commit: 7043f8f.
Where astype function was deleted and replaced with astype_nansafe on "pandas/core/indexes/base.py::Index::astype" method.
Before, it only calls astype_nansafe if this condition was true is_integer_dtype(dtype) and not is_extension_array_dtype(dtype).
Now it calls it every time.
That can be the issue. "pandas/core/indexes/numeric.py::NumericIndex::astype" method did manage to know when to call astype_nansafe and when to call directly super().astype. Now that it's deleted, it always call astype_nansafe.

@jorisvandenbossche
Copy link
Member Author

That commit has been reverted, but that didn't seem to have solved the performance regression.

@jbrockmendel can you look into this?

@jbrockmendel
Copy link
Member

That commit has been reverted, but that didn't seem to have solved the performance regression.

does this mean we can un-revert? its a nice little cleanup

can you look into this?

cant make any promises at the moment, but ill keep a tab open for this so it doesnt fall off the radar screen entirely.

@jorisvandenbossche
Copy link
Member Author

does this mean we can un-revert?

No, it was reverted because of a behaviour regression -> #38607

cant make any promises at the moment, but ill keep a tab open for this so it doesnt fall off the radar screen entirely.

If you can't look at it on the short term, I would propose to revert the change for now, until you have time (otherwise later it might become much more difficult to do that, if it turns out to be needed).
From quickly testing the different potential commits in the range, it seems #38507 is the cause of the slowdown.

@jorisvandenbossche
Copy link
Member Author

Ah, I think the reason is that it is now doing a non-copy ravel, while before the copying ravel ensured we would have good layout for a performant reduction (while this is not the case when the DataFrame is created from a 2D numpy array).

@jbrockmendel
Copy link
Member

Huh, and to think I made a non-copying ravel specifically with perf in mind. you learn something every day

@simonjayhawkins
Copy link
Member

@jbrockmendel @jorisvandenbossche Any idea what the status is here?

@jbrockmendel
Copy link
Member

Fixed by #41911/#41924

@simonjayhawkins
Copy link
Member

Fixed by #41911/#41924

These are for the recent perf regression? This issue is quite old.

@jbrockmendel
Copy link
Member

my mistake, i thought this was the one discussed on the call on wednesday. no idea off the top of my head what the status is here.

@simonjayhawkins
Copy link
Member

yeah, no idea either. will leave the 1.3 milestone for now since this was a regression on master early on when we started 1.3. need further investigation

@jorisvandenbossche
Copy link
Member Author

This is still causing a regression in the reductions (for integer dtypes) compared to 1.2.x.
Just re-ran some benchmarks, and it's giving a 2.5 to 3.5 slowdown for sum and mean.

@jorisvandenbossche
Copy link
Member Author

Now, based on my comment above (#38592 (comment)), this might depend on the memory layout of the DataFrame, and thus on how the DataFrame was created.

But so we maybe should remove some of the extra code added in #38507 to ensure the memory layout is preserved in the ravel() call, since that doesn't seem to be helpful (at least based on this single benchmark ..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants