-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: DataFrame.clip / Series.clip #51472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy on write failures are legit. I guess you’ll have to do self.copy(deep=None) instead of self.copy()
I just tried that, still seems to fail. I made a slightly different change that seems to work, let me know if this looks ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This moves away from CoW syntax now:
Both objects should share memory, if clip is a no-op:
df = DataFrame({"a": [1.5, 2, 3]})
df_copy = df.copy()
arr_a = get_array(df, "a")
view = df[:]
df.clip(lower=1, inplace=True)
print(np.shares_memory(get_array(df, "a"), arr_a))
Also with inplace and without references, this can still modify the existing array inplace.
Your current pr avoids this (we don't have tests for this yet, because we did not get to it and we covered where extensively :))
Additionally, this introduces bugs when setting other dtypes:
df = DataFrame({"a": [1, 2, 3]})
df = df.clip(lower=1.5)
This is a no-op on this branch but works on main (should probably add a test)
In general, I'd prefer doing this on the block level. This keeps CoW handling in more or less on place.
Thanks for the explanation. I've updated the PR to use |
Opened #51492 that covers possible cases. Would like to merge this first to ensure that we don't miss anything. Do you know why where is so much slower? Meaning the block method? |
You could look into putmask for inplace ops, that avoids copying if no reference is found |
Added inplace via putmask.
Yes, I think I'll open a separate PR for this. When we compute a boolean mask on an EA-backed frame, we get a "boolean" (EA) mask. This gets converted to an object dtype ndarray within As an example:
|
Merged the clip tests, could you rebase? Ah that makes sense, got it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments, otherwise lgtm
@phofl - sorry, which small comments are you referring to? |
cond = mask | (self <= upper) | ||
mgr = mgr.where(other=upper, cond=cond, align=False) | ||
if lower is not None: | ||
cond = mask | (self >= lower) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use > or >= in both cases? Also why don’t we need a mask in the inplace case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition in putmask
identifies the values to be updated so we look for values outside the threshold (e.g. ">")
The condition in where
identifies the values to be left as is which are NA values (mask) and anything within the threshold (e.g. "<=")
So I think the difference is consistent with the meaning of those different methods. Using <= and >= in putmask would actually turn no-ops at the boundary into unnecessary ops I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, makes sense. Did not consider that, could you add a really short comment to that effect?
Sorry looks like the did not get posted. Had some connection troubles today, reposted them |
doc/source/whatsnew/v2.0.0.rst
Outdated
@@ -1122,6 +1122,7 @@ Performance improvements | |||
- Performance improvement in :func:`merge` and :meth:`DataFrame.join` when joining on a sorted :class:`MultiIndex` (:issue:`48504`) | |||
- Performance improvement in :func:`to_datetime` when parsing strings with timezone offsets (:issue:`50107`) | |||
- Performance improvement in :meth:`DataFrame.loc` and :meth:`Series.loc` for tuple-based indexing of a :class:`MultiIndex` (:issue:`48384`) | |||
- Performance improvement in :meth:`DataFrame.clip` and :meth:`Series.clip` (:issue:`51472`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment. Can you move to 2.1?
thx @lukemanley |
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.Perf improvement in
DataFrame.clip
andSeries.clip
: