-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[ArrayManager] Add libreduction frame Slider for ArrayManager #40171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrayManager] Add libreduction frame Slider for ArrayManager #40171
Conversation
@@ -491,3 +505,83 @@ cdef class BlockSlider: | |||
mgr.blocks = self.blocks | |||
mgr._blklocs = self.orig_blklocs | |||
mgr._blknos = self.orig_blknos | |||
|
|||
|
|||
cdef class ArraySlider(FrameSlider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i doubt it will be big, but there might be some speedups available py doing @cython.final
on both of BlockSlider and ArraySlider
# GH#35417 attributes we need to restore at each step in case | ||
# the function modified them. | ||
self.orig_arrays = self.dummy._mgr.arrays | ||
self.arrays = list(self.dummy._mgr.arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockSlider defines self.blk_values = [block.values for block in self.dummy._mgr.blocks]
which i think can now use the arrays property to match self.arrays here
for i, arr in enumerate(self.arrays): | ||
self.base_ptrs[i] = (<ndarray>arr).data | ||
|
||
def __dealloc__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this is the same in both classes; put it in the base class?
skipping a bunch of commenting in-line: i think a bunch more of this can be shared between the classes |
Have you happened to profile this? IIRC from when ive tried to remove fast_apply the big perf hit came in the BlockManager/Block/Index constructors, at least one of which I'd expect ArrayManager to have an easier time with. |
Yes, I profiled it (because I was first trying to avoid doing this PR ;)), and indeed a significant part of the overhead comes from the constructors. But I am not sure this can be much improved. Eg the Additionally, part of the overhead comes from slicing the Index, which turns out to be quite expensive. This is something that can be improved (my idea was to add an Another part of overhead comes from checking whether the result is "like indexed" (getting the axes, comparing them) at: pandas/pandas/core/groupby/ops.py Lines 260 to 263 in 7c59101
There are certainly some percentages here and there to shave off in the python implementation, but I think the cython version will always be faster. But we should maybe discuss how much slowdown we want to accept to get rid of the libreduction fast_apply altogether (but would like to keep that separate from ArrayManager performance evaluation). |
Mind posting the results?
I like this idea. IIUC it would allow us to go through the fastpath independent of what type of Index we have (and get rid of _index_data)
Sure. I'll throw out a number: if we could get worst-cast down to within about 3x and most-cases within 1.5x, I'd be open to removing the cython paths. They are a disproportionate producer of headaches. (From #36459 (possibly out of date) "entirely disabling the cython path leads to 4 currently-xfailed tests passing") Besides which, if/when cython3 becomes available, we may have to get rid of these anyway. |
I didn't yet find an easy way to post flamegraphs from snakeviz... But so it's just a profile of the groupby call in this snippet (taken from the benchmark case that was affected by using the python fallback for ArrayManager): N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
{
"key": labels,
"key2": labels2,
"value1": np.random.randn(N),
"value2": ["foo", "bar", "baz", "qux"] * (N // 4),
}
)
df_am = df._as_manager("array")
df_am.groupby("key").apply(lambda x: 1) I should note that this specific benchmark has a resulting (aggregated) frame with about 2000 rows, starting from a dataframe with And with this rather special (IMO) case, I only see a 3-4x slowdown (this is comparing BM with fast_apply vs AM with fallback, but my guess is that BM with fallback will be similar):
but when increasing
(so that's already close to the numbers you cite ;)) |
I was thinking |
Actually, in this second example with bigger data (
|
yeah ideally if you can create a common base class here. of course if we can blow awa all the cython even better :-> |
@jorisvandenbossche this is just perf, not a blocker for getting AM fully functional? if so, mind putting it on the backburner for a day or two while i try out my get-rid-of-libreduction idea? |
Using the example #40171 (comment) (which IIRC is based on the asv that was most-affected by disabling fast_apply last time I looked getting rid of libreduction)
In master, where this goes through fast_apply:
If I just disable fast_apply but leave everything else as in master:
3.552 seconds in BlockManager.get_slice, which I think we can cut down significantly by, among other things, making the BlockManager/Block signatures stricter and skip validation in their So the question: how much overhead are we willing to trade for getting rid of libreduction? |
Indeed, just performance. No problem in first exploring the "get rid of reduce". Can you move your comment above to a new issue to discuss this? |
sure |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@jorisvandenbossche can we use the Mothballed pattern here? |
Yes, closing this awaiting any decision on #40263 |
xref #39146 and follow-up on #40050 (comment)
While getting groupby working for ArrayManager in #40050, I didn't yet use
fast_apply
/libreduction.apply_frame_axis0
because that relies on BlockManager internals. But,libreduction
is faster than the python fallback (luckily, otherwise we would have it for nothing ;)).So this PR is adding a version of the
BlockSlider
but then for a frame backed by an ArrayManager.I didn't combine
BlockSlider
/ArraySlider
in a single class for now (first wanted to get it working). I assume it would be "possible" with some gymnastics (there are several details that are different, such as different strides/shape index being used to move, access toblklocs
/blklocs
, etc).So personally I would prefer keeping it as a separate class (certainly given that we don't intend to keep both long term, I think)
For benchmarks, we have
groupby.Apply.time_scalar_function_single/multi_col
that exercises this fast path, and this PR brings the ArrayManager code path back on par with the BlockManager version (before this PR, that benchmark saw a 3x slowdown because of using the python fallback).