Skip to content

PERF/REF: groupby sample #42233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 2, 2021
Merged

PERF/REF: groupby sample #42233

merged 15 commits into from
Jul 2, 2021

Conversation

mzeitlin11
Copy link
Member

Benchmarks:


       before           after         ratio
     [d61ace50]       [7fb839ce]
     <master~4>       <gb_sample>
-         116±5ms         29.8±3ms     0.26  groupby.Sample.time_sample
-        808±20ms         75.9±1ms     0.09  groupby.Sample.time_sample_weights

@mzeitlin11 mzeitlin11 added Groupby Performance Memory or execution speed performance Refactor Internal refactoring of code labels Jun 25, 2021
# ------ #


def preprocess_weights(obj: FrameOrSeries, weights, axis: int) -> np.ndarray:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pretty much moved as is, with the only change being to convert to an ndarray earlier for better performance on later validation steps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im really not a fan of Series/DataFrame methods living in this file. are there any other natural homes for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

util/_validators is almost a good fit, but while this validates, it also returns a modified input. Could go in core/common? But still learning where everything lives, happy to move if anyone has a better location

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this function is the only one that really depends on Series/DataFrame; could it just stay inside the NDFrame method? the others could go in e.g. core.array_algos.sample

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nicer, but the issue is that the weights processing also needs to be called from groupby sample as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to implement something like a core.methods directory for Series/DataFrame methods that have been refactored to their own files (e.g. describe). I think algos.SelectN might make sense in something like that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea - have moved to core/sample.py (on same level as describe) for now. If others like this organization, I can follow up moving describe and sample (and maybe others like you mention) to core/methods.

return weights


def process_sampling_size(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionals here were refactored for mypy (but IMO clearer to follow as well)

@jreback jreback added this to the 1.4 milestone Jul 2, 2021
@jreback jreback merged commit fee2b87 into pandas-dev:master Jul 2, 2021
@jreback
Copy link
Contributor

jreback commented Jul 2, 2021

very nice @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the gb_sample branch July 2, 2021 00:07
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REF: Move sampling logic into algorithms.py
3 participants