PERF: DataFrame.clip / Series.clip #51472

lukemanley · 2023-02-18T00:35:40Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Perf improvement in DataFrame.clip and Series.clip:

asv continuous -f 1.1 upstream/main perf-dataframe-clip -b frame_methods.Clip

       before           after         ratio
     [2070bb8d]       [39c44771]
     <main>           <perf-dataframe-clip>
-      14.1±0.4ms       11.2±0.4ms     0.79  frame_methods.Clip.time_clip('float64')
-         147±7ms         98.6±5ms     0.67  frame_methods.Clip.time_clip('Float64')
-       133±0.7ms         78.4±1ms     0.59  frame_methods.Clip.time_clip('float64[pyarrow]')

asv continuous -f 1.1 upstream/main perf-dataframe-clip -b series_methods.Clip

       before           after         ratio
     [2070bb8d]       [39c44771]
     <main>           <perf-dataframe-clip>
-        613±30μs         533±30μs     0.87  series_methods.Clip.time_clip(1000)
-        635±60μs          509±8μs     0.80  series_methods.Clip.time_clip(50)

phofl

The copy on write failures are legit. I guess you’ll have to do self.copy(deep=None) instead of self.copy()

lukemanley · 2023-02-18T12:40:19Z

The copy on write failures are legit. I guess you’ll have to do self.copy(deep=None) instead of self.copy()

I just tried that, still seems to fail. I made a slightly different change that seems to work, let me know if this looks ok.

phofl

This moves away from CoW syntax now:

Both objects should share memory, if clip is a no-op:

df = DataFrame({"a": [1.5, 2, 3]})
df_copy = df.copy()
arr_a = get_array(df, "a")
view = df[:]
df.clip(lower=1, inplace=True)
print(np.shares_memory(get_array(df, "a"), arr_a))

Also with inplace and without references, this can still modify the existing array inplace.

Your current pr avoids this (we don't have tests for this yet, because we did not get to it and we covered where extensively :))

Additionally, this introduces bugs when setting other dtypes:

df = DataFrame({"a": [1, 2, 3]})
df = df.clip(lower=1.5)

This is a no-op on this branch but works on main (should probably add a test)

In general, I'd prefer doing this on the block level. This keeps CoW handling in more or less on place.

lukemanley · 2023-02-19T03:15:04Z

Thanks for the explanation. I've updated the PR to use BlockManager.where and added a test for the int/float case you highlighted. I've updated the timings the OP.

phofl · 2023-02-19T21:42:47Z

Opened #51492 that covers possible cases. Would like to merge this first to ensure that we don't miss anything. Do you know why where is so much slower? Meaning the block method?

phofl · 2023-02-19T21:43:16Z

You could look into putmask for inplace ops, that avoids copying if no reference is found

lukemanley · 2023-02-20T00:55:53Z

You could look into putmask for inplace ops, that avoids copying if no reference is found

Added inplace via putmask.

Do you know why where is so much slower? Meaning the block method?

Yes, I think I'll open a separate PR for this. When we compute a boolean mask on an EA-backed frame, we get a "boolean" (EA) mask. This gets converted to an object dtype ndarray within BaseBlockManager.apply which is quite slow.

As an example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10**6, 2), dtype="Float64")

mask = df > 0

%timeit mask._values                                 # -> dtype object (SLOW...)
%timeit mask.to_numpy(dtype=bool, na_value=False)    # -> dtype bool

# 25.2 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 539 µs ± 9.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

phofl · 2023-02-20T10:14:12Z

Merged the clip tests, could you rebase?

Ah that makes sense, got it.

phofl

small comments, otherwise lgtm

lukemanley · 2023-02-21T22:30:27Z

small comments, otherwise lgtm

@phofl - sorry, which small comments are you referring to?

phofl · 2023-02-21T23:19:10Z

pandas/core/generic.py

+                cond = mask | (self <= upper)
+                mgr = mgr.where(other=upper, cond=cond, align=False)
+            if lower is not None:
+                cond = mask | (self >= lower)


Can we use > or >= in both cases? Also why don’t we need a mask in the inplace case?

The condition in putmask identifies the values to be updated so we look for values outside the threshold (e.g. ">")

The condition in where identifies the values to be left as is which are NA values (mask) and anything within the threshold (e.g. "<=")

So I think the difference is consistent with the meaning of those different methods. Using <= and >= in putmask would actually turn no-ops at the boundary into unnecessary ops I think.

Thx, makes sense. Did not consider that, could you add a really short comment to that effect?

phofl · 2023-02-21T23:19:46Z

Sorry looks like the did not get posted. Had some connection troubles today, reposted them

phofl · 2023-02-22T09:05:56Z

doc/source/whatsnew/v2.0.0.rst

@@ -1122,6 +1122,7 @@ Performance improvements
 - Performance improvement in :func:`merge` and :meth:`DataFrame.join` when joining on a sorted :class:`MultiIndex` (:issue:`48504`)
 - Performance improvement in :func:`to_datetime` when parsing strings with timezone offsets (:issue:`50107`)
 - Performance improvement in :meth:`DataFrame.loc` and :meth:`Series.loc` for tuple-based indexing of a :class:`MultiIndex` (:issue:`48384`)
+- Performance improvement in :meth:`DataFrame.clip` and :meth:`Series.clip` (:issue:`51472`)


One last comment. Can you move to 2.1?

phofl · 2023-02-22T21:23:38Z

thx @lukemanley

PERF: DataFrame.clip

47dcdf7

lukemanley added the Performance Memory or execution speed performance label Feb 18, 2023

lukemanley added 2 commits February 17, 2023 19:36

whatsnew

74d94dc

finalize

cde3a95

phofl reviewed Feb 18, 2023

View reviewed changes

fix COW test failure

c13fd1a

phofl reviewed Feb 18, 2023

View reviewed changes

lukemanley added 2 commits February 18, 2023 21:28

use BlockManager.where

ff9467e

update asv

39c4477

add inplace

6eb43d0

lukemanley added 4 commits February 20, 2023 06:26

Merge remote-tracking branch 'upstream/main' into perf-dataframe-clip

126ea51

update inplace tests

70fbf81

Merge remote-tracking branch 'upstream/main' into perf-dataframe-clip

c482024

fix

1026357

phofl reviewed Feb 21, 2023

View reviewed changes

lukemanley added 2 commits February 21, 2023 20:04

add comments

3dfa04d

Merge remote-tracking branch 'upstream/main' into perf-dataframe-clip

393e2ae

phofl approved these changes Feb 22, 2023

View reviewed changes

phofl reviewed Feb 22, 2023

View reviewed changes

move whatsnew to 2.1.0

7dc0df4

phofl added this to the 2.1 milestone Feb 22, 2023

phofl merged commit 14a315c into pandas-dev:main Feb 22, 2023

lukemanley deleted the perf-dataframe-clip branch February 23, 2023 01:39

lukemanley mentioned this pull request Feb 24, 2023

PERF: DataFrame.where for EA dtype mask #51574

Merged

4 tasks

Raghav-Bell mentioned this pull request Aug 28, 2023

clip op: TypeError: list indices must be integers or slices, not NoneType #54817

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DataFrame.clip / Series.clip #51472

PERF: DataFrame.clip / Series.clip #51472

lukemanley commented Feb 18, 2023 •

edited

Loading

phofl left a comment

lukemanley commented Feb 18, 2023

phofl left a comment

lukemanley commented Feb 19, 2023

phofl commented Feb 19, 2023

phofl commented Feb 19, 2023 •

edited

Loading

lukemanley commented Feb 20, 2023

phofl commented Feb 20, 2023

phofl left a comment

lukemanley commented Feb 21, 2023

phofl Feb 21, 2023

lukemanley Feb 22, 2023 •

edited

Loading

phofl Feb 22, 2023

phofl commented Feb 21, 2023

phofl Feb 22, 2023

phofl commented Feb 22, 2023

PERF: DataFrame.clip / Series.clip #51472

PERF: DataFrame.clip / Series.clip #51472

Conversation

lukemanley commented Feb 18, 2023 • edited Loading

phofl left a comment

Choose a reason for hiding this comment

lukemanley commented Feb 18, 2023

phofl left a comment

Choose a reason for hiding this comment

lukemanley commented Feb 19, 2023

phofl commented Feb 19, 2023

phofl commented Feb 19, 2023 • edited Loading

lukemanley commented Feb 20, 2023

phofl commented Feb 20, 2023

phofl left a comment

Choose a reason for hiding this comment

lukemanley commented Feb 21, 2023

phofl Feb 21, 2023

Choose a reason for hiding this comment

lukemanley Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

phofl Feb 22, 2023

Choose a reason for hiding this comment

phofl commented Feb 21, 2023

phofl Feb 22, 2023

Choose a reason for hiding this comment

phofl commented Feb 22, 2023

lukemanley commented Feb 18, 2023 •

edited

Loading

phofl commented Feb 19, 2023 •

edited

Loading

lukemanley Feb 22, 2023 •

edited

Loading