-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: avoid creating many Series in apply_standard #34909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mgr = ser._mgr | ||
blk = mgr.blocks[0] | ||
|
||
for (arr, name) in zip(values, self.index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you push this to an internals method instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im looking at that now. the other place where this pattern could be really useful is in groupby.ops, but its tougher there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure also exposing an api for this would be ok as well (eg another internals method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im still troubleshooting the groupby.ops usage, would like to punt on making this an internals method for the time being
…rf-apply_standard
does this fix #34506 ? |
The groupby.ops analogue of this is continuing to be a PITA, so im exploring other avenues there (xred #34982). This should be good to go though, and as a follow-up we can rip out libreduction.Reducer. |
thanks, yeah incremental refactors are good. |
This commit caused some regressions, eg https://pandas.pydata.org/speed/pandas/#frame_methods.Apply.time_apply_np_mean, Was that expected? (above you mention "achieves parity on that asv.") |
Ah, I suppose this is just fixed by #35166 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This avoids going through perennial-problem-causing libreduction code (xref #34014, #34080) and instead does the same trick in python-space by re-assigning block.values instead of creating new Series objects.
If we avoid libreduction but dont do this optimization, the most-affected asv is
time_apply_ref_by_name
that clocks in at 6.92x slower. This achieves parity on that asv.ATM I'm still getting 4 test failures locally, need to troubleshoot.Update: passing