Skip to content

CLN: group_quantile #43489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

mzeitlin11
Copy link
Member

No description provided.

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Clean Groupby labels Sep 10, 2021
# with the calculations.
labels_for_lexsort = np.where(labels == -1, labels.max() + 1, labels)
else:
labels_for_lexsort = labels
Copy link
Member Author

@mzeitlin11 mzeitlin11 Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This labels.any() conditional is strange - special casing the all 0 label case (which seems rare and specific) when we really care about -1 labels. Regardless, since we already check for -1 in loop above, no need to check with any and where here

@@ -1420,7 +1424,7 @@ cdef cummin_max(groupby_t[:, ::1] out,
cdef:
Py_ssize_t i, j, N, K
groupby_t val, mval, na_val
uint8_t[:, ::1] seen_na
uint8_t[:, ::1] seen_na = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this because previously seen_na was treated as potentially uninitialized, leading to an emission of extra code whenever it was accessed, for example
if (unlikely(!__pyx_v_seen_na.memview)) { __Pyx_RaiseUnboundMemoryviewSliceNogil("seen_na"); __PYX_ERR(0, 1462, __pyx_L7_error)

order = (values, labels_for_lexsort)
sort_arr = np.lexsort(order).astype(np.int64, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the big perf gain comes from moving this call out of cython and not repeating it for each column

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. If you're planning on doing that, I'll just close this for now and collect the other cleanups elsewhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually suggesting you do that in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sounds good :) I'll take a look at that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Clean Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants