-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[POC] PERF: 1d version of the cython grouped aggregation algorithms #39861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] PERF: 1d version of the cython grouped aggregation algorithms #39861
Conversation
this is a really strange result i wouldn't expect any difference here my guess is that the 2d result array memory layout is unfavorable compared to the 1d |
The inner part of the function for the 2D version is:
while for the 1D version it is:
So all
That was my first guess as well (as for example for the plain (non grouped) reductions, this matters), but from testing it with the example above, that doesn't seem to matter in this case:. Explicitly testing the different memory layouts with the 2D algorithm:
And additionally explicitly with how it would work for a 1D column that needs to be reshaped to pass into the algorithm:
So those all don't really show any significant difference. |
And to be explicit: I am testing the 2D (vs 1D) algorithm here for a single column (so an array of shape |
i think if u reverse the ordering eg j, lab this will have same perf (need to transpose when creating memory) as the indexing is the costly part here |
Exactly, and I reduce the amount of indexing (indexing into 2D vs 1D)
Not fully sure what you mean, but I did verify the result for all cases to be identical with the actual |
Ah, I think I understand now. You didn't mean something was wrong in the current example, but that if the 2D algorithm would be written to work on data transposed compared to what it is now, it would be faster? Currently, the values are passed in as array with shape (I don't directly see how that would make a difference, though, as we are still indexing in a 2D array to get a single scalar) |
yea exactly try to fill in reversed indexing order to see if that makes the 2d case better |
the issue is the cost of the indexing is dependent on the memory layout |
and i think the 2d case would be more performing if we allocate 1d contiguous blocks as these will be cache performant and since then aggregating you don't need to concat then |
Will try that. Maybe that it indeed has significant enough impact for the 1D-as-2D array case. |
I gave it a try (see the last commit, please check if this is what you meant), but unfortunately not seeing any improvement. Using the same example from above, first the 2D version as currently on master (testing on a single column as (N, 1) array): out_shape = (len(uniques), 1)
out_dtype = "float64"
values = df[['a']]._mgr.blocks[0].values.T
result = np.empty(out_shape, out_dtype)
result.fill(np.nan)
counts = np.zeros(len(uniques), dtype=np.int64)
pd._libs.groupby.group_mean_float64(result, counts, values, codes, min_count=-1) and then the "transposed" version (so without transposing the block values, and thus the array passed into the function has shape (1, N) instead of (N, 1)): out_shape_transposed = (1, len(uniques))
values_transposed = df[['a']]._mgr.blocks[0].values
result_transposed = np.empty(out_shape_transposed, out_dtype)
result_transposed.fill(np.nan)
counts = np.zeros(len(uniques), dtype=np.int64)
pd._libs.groupby.group_mean_transposed_float64(result_transposed, counts, values_transposed, codes, min_count=-1) Ensuring we still get the same correct result:
And now timing it:
So the 2D version with reversed indexing (transposed input array) is giving more or less the same speed as the 2D version currently in pandas. |
@jreback my suspicion is that we don't see improvement because it's mostly the array setitem that is costly (at least relative to the array getitem), and for setitem it seems to make less difference what the memory layout is of the array we are setting values to. |
ok my point is that there is something fundamentally different here in indexing - i'm this is the speed up - just have to repro |
Can you clarify this part of your sentence ;) |
|
||
counts[lab] += 1 | ||
for j in range(K): | ||
val = values[j, i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make a difference if we defined temp_values = values[:, i]
(or values[i] in non-transposed version) outside the j-loop and then set val = temp_values[j]
inside the loop?
definitely seems weird that this would have a big perf impact
what I mean (and this is a theory). is that the 2d (even though its a single column), is created in a strided way and so the indexing of this is more expensive that a single contiguous block. IOW i would expect that creating the 2D array in a transposed way (and changing the indexing to make accesing the last dim first) would negate any indexing penalty vs indexing a single contguous array. |
another possibility: the memoryview declarations dont include anything about the strides, might be optimizations available there |
Just tried adding the contiguity to the declaration and the times from the OP went
|
is this on the original? |
The exact code being profiled is based on the OP:
where branch's diff from master is
|
great so this is a memory layout issue happy to take this improvement then |
Thanks Brock! Indeed, specifying the strides makes that cython generates simpler (faster) code. And since in your diff, it's only specified for Now, for me, adding the same trick to the 1D version of the algorithm, also gives a speed-up. And so the 1D version still gives a similar speed-up as before, for me (around 30%). |
@jorisvandenbossche i still don't understand where this benefit is coming from can u be more specific - again you are now indexing 1d vs 2d right? is that the difference ? |
All the code is here, the algorithm is in the diff, the code that I ran is above. If you don't believe me, you can test it yourself. It's quite simple: indexing a 1D array is cheaper as indexing a 2D array (i.e. |
@scoder the timings here seem very weird to 2/3 of this thread's participants. do they make sense to you? |
I don't think it's worth @scoder's time to go through our long discussion and the complex example. So here is a simplified reproducer that shows the difference.
and then testing and timing it with:
So indexing into a 1D array is faster than indexing into a 2D array (although it is in practice just a 1D array reshaped as (N, 1), and so the same number of elements are get/set). Is this to be expected? |
One thing that jumps up is that you're mixing the old |
Also seems worth testing if PGO makes a difference here. If the buffer index calculation code is the issue, PGO tuning might be able to make up for its disadvantages. At least somewhat. |
Yes, in my small example I used the old But, surprisingly, quicky changing
|
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby.pyx#L650 With the exception of the #39861 (comment) tries making the memoryview declarations more specific and gets a decent speedup. Broadly consistent with Joris's results, im finding that if I further change the signature #39861 (comment) to replace (using |
|
Trying on a 2-column case (groupby.pyx edits the same as #39861 (comment))
2x speedup definitely worth implementing. Still not clear why the K=1 case has the perf penalty vs the 1D version. |
btw why is the cast necessary? i thought memoryviews have a shape attribute (searching for "shape" https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html and looking at the last result) |
Also with 100x larger array, I see a similar difference: 130ms for 2D function vs 110ms for 1D function (using the dummy function getting/setting values of an array from #39861 (comment)).
Is specifying Can someone try to explain why you don't expect the 1D function to be faster than the 2D function? |
incrementing |
How is that the same number of strides? To be very explicit, this is what cython generates for
Because |
OK, I was under the impression the compiler could figure that out. |
would it make sense to have the 2D functions check if they are single-column and if so dispatch to the 1D version? |
yep! |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@jorisvandenbossche gentle ping |
Do we actually want to add such 1D versions in the short term? As I said in the top post, this would give a lot of duplication (I don't think it is easily template-able to still share the algorithm), so I mainly did this as an experiment to check what the benefit would be / open discussion. (I am personally fine with the duplication in the idea that it is only temporary, but I have the impression that you both want to limit that as much as possible) |
yea would rather limit code duplicate like this |
Agreed |
For now, this is just a small example to open discussion about how to tackle this topic (not meant for merging).
Context: for a columnar dataframe (with ArrayManager), we would also perform the groupby operations column by column (instead of on the 2D blocks). Most of our custom cython aggregation algorithms are currently written for 2D arrays (eg
group_add
,group_mean
, etc. I think only the fill/shift/any/all are 1D). But by being 2D, that means they have some performance penalty when using it for a single column.Experiment: I copied the implementation for "mean" and made a 1D version of it (exactly the same code, but removing one
for
loop over the columns, and simplifying the indices to index into 1D arrays instead of 2D arrays).And with that did some timings to see the impact:
Using the existing 2D algorithm on a single column:
Using the new 1D version:
Ensure that both give the same result:
And timings for both:
So for a single column, the 1D version is ca 30-40% faster (increasing the size of the array with 10x, I also get 40% faster).
Some notes:
df.groupby("key").mean()
example above, thegroup_mean
takes ca 1.5x the time of the factorization step (that's for multiple columns, though, when profiling it for grouping a single column, the ratio becomes the other way around: ca 38% in group_mean and ca 62% in factorization)xref #39146