PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561

jorisvandenbossche · 2020-04-15T08:53:22Z

jbrockmendel · 2020-04-15T13:58:38Z

While you're doing this, want to get the other .iloc[:, i] cases within dispatch_to_series?

jorisvandenbossche · 2020-04-16T11:54:59Z

Yes, will do.

(note, I would also be fine with you including this in #33561, if we decide to have a switch between column/blockwise. But it's mainly because you seemed quite hesitant to include something like this, that I thought to do this PR separately)

jbrockmendel · 2020-04-16T14:03:04Z

pandas/core/ops/__init__.py


    else:
        # Remaining cases have less-obvious dispatch rules
        raise NotImplementedError(right)

-    new_data = expressions.evaluate(column_op, str_rep, left, right)


does this have a perf impact? IIRC a while back I tried removing it because I thought it shouldn't, but found that it did

As far as I understand, the get_array_op should take care of this?

For example, the arrays_ops.py::na_arithmetic_op also calls expressions.evaluate to perform the actual op.

That was what i thought too, and I could be remembering incorrectly. But this is worth double-checking.

pandas/core/ops/__init__.py

jbrockmendel · 2020-04-16T14:11:18Z

pandas/core/ops/__init__.py


-            def column_op(a, b):
-                return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}


doesnt need to be in this PR, but since you're attention is already focused on this line: this can be incorrect if b is e.g. Categorical, since b.iloc[i] will have the scalar behavior instead of the Categorical (usually raising) behavior. (This is also a place where NaT causes issues, and the idea of only-one-NA causes me nightmares)

The main idea I've had here is to use b.iloc[[i]] instead of b.iloc[i], haven't gotten around to implementing it. Does this change make it easier to address the issue?

when using something like b.iloc[[i]], you would need to handle broadcasting here as well (since the array ops expect either same length array or scalar). So since that is a pre-existing issue, I would leave that for a dedicated PR.
Is there an issue for this?

I dont think there's an issue for this. If I had found one, i would have added the Numeric label

pandas/core/ops/__init__.py

jreback · 2020-04-17T21:48:32Z

can you point to some asv's that change with this? (nice cleanup anyhow)

pandas/core/ops/__init__.py

jorisvandenbossche · 2020-05-22T13:10:36Z

can you point to some asv's that change with this? (nice cleanup anyhow)

There is already a MixedFrameWithSeriesAxis benchmark that captures this partly:

Running that benchmark specifically, on master:

[100.00%] ··· arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1                                                                                                                                ok
[100.00%] ··· ========== ==========
                opname             
              ---------- ----------
                  eq      194±4ms  
                  ne      191±3ms  
                  lt      189±3ms  
                  le      190±4ms  
                  ge      209±50ms 
                  gt      225±20ms 
                 add      4.31±1ms 
                 sub      4.68±1ms 
               truediv    4.64±1ms 
               floordiv   54.8±2ms 
                 mul      4.73±1ms 
                 pow      15.4±2ms 
              ========== ==========

and with this branch:

[100.00%] ··· arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1                                                                                                                                ok
[100.00%] ··· ========== ============
                opname               
              ---------- ------------
                  eq       68.5±9ms  
                  ne       70.3±9ms  
                  lt       62.7±6ms  
                  le      68.1±10ms  
                  ge       70.5±8ms  
                  gt       70.7±9ms  
                 add       4.44±1ms  
                 sub       5.52±1ms  
               truediv    5.45±0.8ms 
               floordiv    60.7±3ms  
                 mul       4.42±1ms  
                 pow       15.1±4ms  
              ========== ============

The comparison ops take the code path I changed here, and those show a consistent speed-up (the arithmetic ones actually take a different code path)

pandas/core/ops/__init__.py

jbrockmendel · 2020-05-22T18:21:53Z

pandas/core/ops/__init__.py


    else:
        # Remaining cases have less-obvious dispatch rules
        raise NotImplementedError(right)

-    new_data = expressions.evaluate(column_op, str_rep, left, right)
-    return new_data
+    return type(left)._from_arrays(


i think you can just return arrays here, since that will be passed to left._construct_result (though the docstring does say DataFrame, i guess thats wrong ATM)

I want to explicitly use _from_arrays, as that is part of the reason that makes this PR a perf improvement (the _from_arrays constructor specifically handles the case of a list of arrays corresponding to columns). Also, the default DataFrame constructor handles a list of arrays differently.

sounds good

PERF: operate on arrays instead of Series in DataFrame/DataFrame ops

1102f0d

jbrockmendel mentioned this pull request Apr 15, 2020

REF: put EA concat logic in _concat_arrays #33535

Closed

jorisvandenbossche added 2 commits April 16, 2020 13:47

Merge remote-tracking branch 'upstream/master' into perf-frame-op-array

cd0218c

also for frame/series cases

44ec4f4

jbrockmendel reviewed Apr 16, 2020

View reviewed changes

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Apr 16, 2020

View reviewed changes

cleanup

9ff61c4

jorisvandenbossche added the Performance Memory or execution speed performance label Apr 17, 2020

jorisvandenbossche added this to the 1.1 milestone Apr 17, 2020

jreback requested changes Apr 17, 2020

View reviewed changes

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

jreback requested changes May 17, 2020

View reviewed changes

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

jorisvandenbossche added 2 commits May 22, 2020 14:43

Merge remote-tracking branch 'upstream/master' into perf-frame-op-array

c4b7823

use list comprehension

b8de50e

jbrockmendel reviewed May 22, 2020

View reviewed changes

pandas/core/ops/__init__.py Outdated Show resolved Hide resolved

jbrockmendel reviewed May 22, 2020

View reviewed changes

jorisvandenbossche added 2 commits May 22, 2020 20:36

fixup

103db6c

Merge remote-tracking branch 'upstream/master' into perf-frame-op-array

d16ced8

jorisvandenbossche merged commit 64859ec into pandas-dev:master May 25, 2020

jorisvandenbossche deleted the perf-frame-op-array branch May 25, 2020 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561

PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561

jorisvandenbossche commented Apr 15, 2020

jbrockmendel commented Apr 15, 2020

jorisvandenbossche commented Apr 16, 2020

jbrockmendel Apr 16, 2020

jorisvandenbossche Apr 17, 2020

jbrockmendel Apr 17, 2020

jbrockmendel Apr 16, 2020

jorisvandenbossche Apr 17, 2020

jbrockmendel Apr 17, 2020

jreback commented Apr 17, 2020 •

edited

Loading

jorisvandenbossche commented May 22, 2020 •

edited

Loading

jbrockmendel May 22, 2020

jorisvandenbossche May 22, 2020

jbrockmendel May 22, 2020


		def column_op(a, b):
		return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))}

PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561

PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561

Conversation

jorisvandenbossche commented Apr 15, 2020

jbrockmendel commented Apr 15, 2020

jorisvandenbossche commented Apr 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 17, 2020 • edited Loading

jorisvandenbossche commented May 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 17, 2020 •

edited

Loading

jorisvandenbossche commented May 22, 2020 •

edited

Loading