[ArrayManager] Add libreduction frame Slider for ArrayManager #40171

jorisvandenbossche · 2021-03-02T11:45:22Z

xref #39146 and follow-up on #40050 (comment)

While getting groupby working for ArrayManager in #40050, I didn't yet use fast_apply / libreduction.apply_frame_axis0 because that relies on BlockManager internals. But, libreduction is faster than the python fallback (luckily, otherwise we would have it for nothing ;)).

So this PR is adding a version of the BlockSlider but then for a frame backed by an ArrayManager.

I didn't combine BlockSlider / ArraySlider in a single class for now (first wanted to get it working). I assume it would be "possible" with some gymnastics (there are several details that are different, such as different strides/shape index being used to move, access to blklocs/blklocs, etc).
So personally I would prefer keeping it as a separate class (certainly given that we don't intend to keep both long term, I think)

For benchmarks, we have groupby.Apply.time_scalar_function_single/multi_col that exercises this fast path, and this PR brings the ArrayManager code path back on par with the BlockManager version (before this PR, that benchmark saw a 3x slowdown because of using the python fallback).

jbrockmendel · 2021-03-02T15:37:45Z

pandas/_libs/reduction.pyx

@@ -491,3 +505,83 @@ cdef class BlockSlider:
        mgr.blocks = self.blocks
        mgr._blklocs = self.orig_blklocs
        mgr._blknos = self.orig_blknos
+
+
+cdef class ArraySlider(FrameSlider):


i doubt it will be big, but there might be some speedups available py doing @cython.final on both of BlockSlider and ArraySlider

jbrockmendel · 2021-03-02T15:39:59Z

pandas/_libs/reduction.pyx

+        # GH#35417 attributes we need to restore at each step in case
+        #  the function modified them.
+        self.orig_arrays = self.dummy._mgr.arrays
+        self.arrays = list(self.dummy._mgr.arrays)


BlockSlider defines self.blk_values = [block.values for block in self.dummy._mgr.blocks] which i think can now use the arrays property to match self.arrays here

jbrockmendel · 2021-03-02T15:40:35Z

pandas/_libs/reduction.pyx

+        for i, arr in enumerate(self.arrays):
+            self.base_ptrs[i] = (<ndarray>arr).data
+
+    def __dealloc__(self):


looks like this is the same in both classes; put it in the base class?

jbrockmendel · 2021-03-02T15:41:53Z

skipping a bunch of commenting in-line: i think a bunch more of this can be shared between the classes

jbrockmendel · 2021-03-02T15:44:31Z

For benchmarks, we have groupby.Apply.time_scalar_function_single/multi_col that exercises this fast path, and this PR brings the ArrayManager code path back on par with the BlockManager version (before this PR, that benchmark saw a 3x slowdown because of using the python fallback).

Have you happened to profile this? IIRC from when ive tried to remove fast_apply the big perf hit came in the BlockManager/Block/Index constructors, at least one of which I'd expect ArrayManager to have an easier time with.

jorisvandenbossche · 2021-03-02T16:07:52Z

Yes, I profiled it (because I was first trying to avoid doing this PR ;)), and indeed a significant part of the overhead comes from the constructors. But I am not sure this can be much improved. Eg the _chop method in groupby/ops.py is already constructing a DataFrame from a manager, which already is a fastpath.

Additionally, part of the overhead comes from slicing the Index, which turns out to be quite expensive. This is something that can be improved (my idea was to add an Index._slice() method that can be used in Manager.get_slice(), because the main Index.__getitem__ is harder to optimize in general).

Another part of overhead comes from checking whether the result is "like indexed" (getting the axes, comparing them) at:

pandas/pandas/core/groupby/ops.py

Lines 260 to 263 in 7c59101

    
           group_axes = group.axes 
        
           res = f(group) 
        
           if not _is_indexed_like(res, group_axes, axis): 
        
               mutated = True

There are certainly some percentages here and there to shave off in the python implementation, but I think the cython version will always be faster. But we should maybe discuss how much slowdown we want to accept to get rid of the libreduction fast_apply altogether (but would like to keep that separate from ArrayManager performance evaluation).

jbrockmendel · 2021-03-02T17:28:18Z

Yes, I profiled it

Mind posting the results?

my idea was to add an Index._slice() method that can be used in Manager.get_slice(), because the main Index.getitem is harder to optimize in general

I like this idea. IIUC it would allow us to go through the fastpath independent of what type of Index we have (and get rid of _index_data)

But we should maybe discuss how much slowdown we want to accept to get rid of the libreduction fast_apply altogether

Sure. I'll throw out a number: if we could get worst-cast down to within about 3x and most-cases within 1.5x, I'd be open to removing the cython paths. They are a disproportionate producer of headaches. (From #36459 (possibly out of date) "entirely disabling the cython path leads to 4 currently-xfailed tests passing")

Besides which, if/when cython3 becomes available, we may have to get rid of these anyway.

jorisvandenbossche · 2021-03-02T18:26:08Z

Mind posting the results?

I didn't yet find an easy way to post flamegraphs from snakeviz... But so it's just a profile of the groupby call in this snippet (taken from the benchmark case that was affected by using the python fallback for ArrayManager):

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)
df_am = df._as_manager("array")

df_am.groupby("key").apply(lambda x: 1)

I should note that this specific benchmark has a resulting (aggregated) frame with about 2000 rows, starting from a dataframe with N = 10 ** 4, which means that on average the intermediate DataFrame that gets constructed has 5 rows ... Which is pretty small (and in addition it uses a dummy aggregation function that doesn't take much time itself).

And with this rather special (IMO) case, I only see a 3-4x slowdown (this is comparing BM with fast_apply vs AM with fallback, but my guess is that BM with fallback will be similar):

In [2]: %timeit df.groupby("key").apply(lambda x: 1)
6.65 ms ± 854 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df_am.groupby("key").apply(lambda x: 1)
21.8 ms ± 998 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

but when increasing N to 10 ** 6, it becomes less than a 2x slowdown:

In [6]: %timeit df.groupby("key").apply(lambda x: 1)
162 ms ± 5.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_am.groupby("key").apply(lambda x: 1)
263 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(so that's already close to the numbers you cite ;))

jbrockmendel · 2021-03-02T18:32:01Z

But so it's just a profile

I was thinking %prun -s cumtime for i in range(100): do_thing()

jorisvandenbossche · 2021-03-02T18:32:17Z

Actually, in this second example with bigger data (N = 10 ** 6), the slowdown seems to come from the fact that with the python fallback, we call splitter._get_sorted_data() twice ..
Removing this duplication gives:

In [2]: %timeit df.groupby("key").apply(lambda x: 1)
158 ms ± 4.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit df_am.groupby("key").apply(lambda x: 1)
159 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

jreback · 2021-03-02T22:40:33Z

yeah ideally if you can create a common base class here. of course if we can blow awa all the cython even better :->

jbrockmendel · 2021-03-04T01:26:31Z

@jorisvandenbossche this is just perf, not a blocker for getting AM fully functional? if so, mind putting it on the backburner for a day or two while i try out my get-rid-of-libreduction idea?

jbrockmendel · 2021-03-04T01:46:42Z

Using the example #40171 (comment) (which IIRC is based on the asv that was most-affected by disabling fast_apply last time I looked getting rid of libreduction)

import numpy as np
import pandas as pd

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = pd.DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)

%prun -s cumtime for i in range(100): df.groupby("key").apply(lambda x: 1)

In master, where this goes through fast_apply:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.001    0.000    1.027    0.010 groupby.py:913(apply)
      100    0.002    0.000    1.023    0.010 groupby.py:962(_python_apply_general)
      100    0.002    0.000    0.975    0.010 ops.py:263(apply)
      100    0.000    0.000    0.838    0.008 ops.py:1020(fast_apply)
      100    0.310    0.003    0.829    0.008 {pandas._libs.reduction.apply_frame_axis0}
   199001    0.124    0.000    0.287    0.000 base.py:725(_engine)
   199101    0.125    0.000    0.206    0.000 base.py:4511(__getitem__)
408416/207611    0.060    0.000    0.100    0.000 {built-in method builtins.len}
   199001    0.045    0.000    0.066    0.000 base.py:4345(_get_engine_target)
   199002    0.047    0.000    0.063    0.000 common.py:157(cast_scalar_indexer)
      401    0.003    0.000    0.058    0.000 series.py:277(__init__)
   200503    0.040    0.000    0.056    0.000 base.py:751(__len__)
      100    0.000    0.000    0.048    0.000 ops.py:232(_get_splitter)
      100    0.000    0.000    0.047    0.000 ops.py:380(group_info)
      100    0.000    0.000    0.047    0.000 ops.py:398(_get_compressed_codes)
      100    0.000    0.000    0.046    0.000 ops.py:339(codes)
      100    0.000    0.000    0.046    0.000 ops.py:341(<listcomp>)
      200    0.000    0.000    0.046    0.000 grouper.py:582(codes)
      100    0.000    0.000    0.046    0.000 grouper.py:603(_make_codes)
      100    0.002    0.000    0.046    0.000 generic.py:1246(_wrap_applied_output)
      100    0.000    0.000    0.045    0.000 ops.py:1003(sorted_data)
      100    0.002    0.000    0.042    0.000 algorithms.py:559(factorize)
      300    0.002    0.000    0.042    0.000 construction.py:455(sanitize_array)
      100    0.000    0.000    0.040    0.000 generic.py:3533(take)
      100    0.001    0.000    0.037    0.000 managers.py:1448(take)
      600    0.001    0.000    0.035    0.000 take.py:23(take_nd)
      100    0.001    0.000    0.032    0.000 cast.py:121(maybe_convert_platform)
      600    0.003    0.000    0.031    0.000 take.py:75(_take_nd_ndarray)

If I just disable fast_apply but leave everything else as in master:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.001    0.000    5.545    0.055 groupby.py:913(apply)
      100    0.002    0.000    5.540    0.055 groupby.py:962(_python_apply_general)
      100    0.528    0.005    5.494    0.055 ops.py:263(apply)
   198400    0.168    0.000    4.660    0.000 ops.py:990(__iter__)
   198300    0.302    0.000    4.438    0.000 ops.py:1025(_chop)
   198300    0.350    0.000    3.552    0.000 managers.py:788(get_slice)
   198300    0.179    0.000    1.513    0.000 managers.py:794(<listcomp>)
   594900    0.742    0.000    1.334    0.000 blocks.py:358(getitem_block)
   198300    0.206    0.000    1.006    0.000 base.py:4511(__getitem__)
   198400    0.275    0.000    0.684    0.000 managers.py:149(__init__)
   198300    0.162    0.000    0.666    0.000 numeric.py:125(_shallow_copy)
   198400    0.158    0.000    0.562    0.000 frame.py:550(__init__)
   198300    0.131    0.000    0.504    0.000 base.py:648(_shallow_copy)
   198800    0.276    0.000    0.363    0.000 generic.py:235(__init__)
   198600    0.189    0.000    0.341    0.000 base.py:563(_simple_new)
   594900    0.191    0.000    0.280    0.000 blocks.py:138(_simple_new)
   198400    0.114    0.000    0.234    0.000 managers.py:155(<listcomp>)
  2217700    0.191    0.000    0.194    0.000 {built-in method builtins.isinstance}
   594900    0.192    0.000    0.192    0.000 blocks.py:353(_slice)
   595400    0.122    0.000    0.174    0.000 managers.py:233(ndim)
   793500    0.122    0.000    0.122    0.000 {built-in method __new__ of type object at 0x103788410}
   397100    0.087    0.000    0.120    0.000 base.py:6112(ensure_index)
   199800    0.120    0.000    0.120    0.000 {pandas._libs.lib.is_scalar}
   198300    0.067    0.000    0.108    0.000 ops.py:956(_is_indexed_like)
   198600    0.085    0.000    0.105    0.000 base.py:714(_reset_identity)
   198800    0.087    0.000    0.087    0.000 flags.py:47(__init__)
   596000    0.079    0.000    0.079    0.000 blocks.py:295(mgr_locs)
   198400    0.071    0.000    0.071    0.000 frame.py:692(axes)
      400    0.003    0.000    0.056    0.000 series.py:277(__init__)
603600/602300    0.053    0.000    0.054    0.000 {built-in method builtins.len}

3.552 seconds in BlockManager.get_slice, which I think we can cut down significantly by, among other things, making the BlockManager/Block signatures stricter and skip validation in their __init__ methods. For this to be safe, we need something like #40182 to keep that validation in place for downstream libraries.

So the question: how much overhead are we willing to trade for getting rid of libreduction?

jorisvandenbossche · 2021-03-04T14:46:39Z

this is just perf, not a blocker for getting AM fully functional?

Indeed, just performance. No problem in first exploring the "get rid of reduce".

Can you move your comment above to a new issue to discuss this?

jbrockmendel · 2021-03-04T15:20:58Z

Can you move your comment above to a new issue to discuss this?

sure

github-actions · 2021-04-04T00:11:31Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jbrockmendel · 2021-04-07T17:43:21Z

@jorisvandenbossche can we use the Mothballed pattern here?

jorisvandenbossche · 2021-04-20T08:58:49Z

Yes, closing this awaiting any decision on #40263

[ArrayManager] Add libreduction frame Slider for ArrayManager

c987c06

jorisvandenbossche added Groupby Internals Related to non-user accessible pandas implementation labels Mar 2, 2021

jorisvandenbossche requested a review from jbrockmendel March 2, 2021 11:45

fixup typo from cleanup

c717db3

jbrockmendel reviewed Mar 2, 2021

View reviewed changes

jorisvandenbossche mentioned this pull request Mar 2, 2021

PERF: don't sort data twice in groupby apply when not using libreduction fast_apply #40176

Merged

jbrockmendel mentioned this pull request Mar 13, 2021

CLN: docstrings, annotations, raising corner cases #40409

Merged

jorisvandenbossche mentioned this pull request Mar 15, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

jbrockmendel mentioned this pull request Mar 23, 2021

PERF: Define Block.__init__ in cython #40586

Merged

github-actions bot added the Stale label Apr 4, 2021

This was referenced Apr 14, 2021

PERF: put BlockManager constructor in cython #40842

Merged

PERF: implement get_slice in cython #41045

Merged

jorisvandenbossche removed the Stale label Apr 20, 2021

jorisvandenbossche closed this Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ArrayManager] Add libreduction frame Slider for ArrayManager #40171

[ArrayManager] Add libreduction frame Slider for ArrayManager #40171

jorisvandenbossche commented Mar 2, 2021 •

edited

Loading

jbrockmendel Mar 2, 2021

jbrockmendel Mar 2, 2021

jbrockmendel Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

jreback commented Mar 2, 2021

jbrockmendel commented Mar 4, 2021

jbrockmendel commented Mar 4, 2021

jorisvandenbossche commented Mar 4, 2021

jbrockmendel commented Mar 4, 2021

github-actions bot commented Apr 4, 2021

jbrockmendel commented Apr 7, 2021

jorisvandenbossche commented Apr 20, 2021

[ArrayManager] Add libreduction frame Slider for ArrayManager #40171

[ArrayManager] Add libreduction frame Slider for ArrayManager #40171

Conversation

jorisvandenbossche commented Mar 2, 2021 • edited Loading

jbrockmendel Mar 2, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 2, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 2, 2021

Choose a reason for hiding this comment

jbrockmendel commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

jbrockmendel commented Mar 2, 2021

jorisvandenbossche commented Mar 2, 2021

jreback commented Mar 2, 2021

jbrockmendel commented Mar 4, 2021

jbrockmendel commented Mar 4, 2021

jorisvandenbossche commented Mar 4, 2021

jbrockmendel commented Mar 4, 2021

github-actions bot commented Apr 4, 2021

jbrockmendel commented Apr 7, 2021

jorisvandenbossche commented Apr 20, 2021

jorisvandenbossche commented Mar 2, 2021 •

edited

Loading