-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: GroupBy.quantile #51722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: GroupBy.quantile #51722
Conversation
qs : ndarray[float64_t] | ||
The quantile values to search for. | ||
starts : ndarray[int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do starts and ends need to be passed separately? On initial review I'm expecting that the groups are all monotonic in each array, so does starts[i]
always equal ends[i-1] - 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib.generate_slices returns a pair of ndarrays, id have to look at that more closely to comment on the why
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
I'd lean away from the pure-Python/Cython hybrid approach; I think this looks great as is. Does the Boolean issue (#51424) hold this up with the Cython approach? |
I don't think so, no. |
@jbrockmendel - I'm good here if you're okay with the trade-offs you mentioned in the OP. If that's the case, can you merge main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - could use a line in the whatsnew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @jbrockmendel |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.The existing implementation does a lexsort over (values, ids) for each column which gets very expensive. By using grouper._get_splitter()._sorted_data, we only sort by ids once, then we cheaply iterate over groups and do group-by-group argsorts. This is roughly equivalent to
I tried an implementation that used _python_apply_general and ripped out the cython group_quantile entirely and found it had some rough edges with a) axis=1, b) dtypes where GroupBy.quantile behaves different from DataFrame.quantile (xref #51424), and it performed poorly as ngroups becomes big. With a large number of rows though, that performs better than this. I speculate that np.percentile is doing something more efficiently than our cython group_quantile, but haven't figured it out yet.
Some timings!
Did this for each of main, this PR, and the pure-python implementation described above. The results:
on main I had to interrupt (with ctl-z, not ctl-c!) the largest cases.
This out-performs main in all cases. The pure-python version out-performs this in small-ngroups cases and large-nrows cases, but suffers pretty dramatically in the opposite cases.
Potential caveats:
Potential downsides vs main:
We could plausibly keep multiple implementations and dispatch based on sizes, but im not sure groupby.quantile is important enough to really merit that. So which version to use really comes down to what sized cases we think are the most common.