-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF/REF: groupby sample #42233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF/REF: groupby sample #42233
Conversation
pandas/core/algorithms.py
Outdated
# ------ # | ||
|
||
|
||
def preprocess_weights(obj: FrameOrSeries, weights, axis: int) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was pretty much moved as is, with the only change being to convert to an ndarray
earlier for better performance on later validation steps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im really not a fan of Series/DataFrame methods living in this file. are there any other natural homes for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
util/_validators
is almost a good fit, but while this validates, it also returns a modified input. Could go in core/common
? But still learning where everything lives, happy to move if anyone has a better location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this function is the only one that really depends on Series/DataFrame; could it just stay inside the NDFrame method? the others could go in e.g. core.array_algos.sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nicer, but the issue is that the weights processing also needs to be called from groupby sample as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to implement something like a core.methods directory for Series/DataFrame methods that have been refactored to their own files (e.g. describe). I think algos.SelectN might make sense in something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea - have moved to core/sample.py (on same level as describe) for now. If others like this organization, I can follow up moving describe and sample (and maybe others like you mention) to core/methods.
pandas/core/algorithms.py
Outdated
return weights | ||
|
||
|
||
def process_sampling_size( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditionals here were refactored for mypy (but IMO clearer to follow as well)
very nice @mzeitlin11 |
Benchmarks: