-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improved clip performance #16364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16364 +/- ##
==========================================
- Coverage 90.38% 90.36% -0.02%
==========================================
Files 161 161
Lines 50916 50933 +17
==========================================
+ Hits 46021 46028 +7
- Misses 4895 4905 +10
Continue to review full report at Codecov.
|
Additional test cases for pandas-dev#16364 when upper and / or lower is nan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a slight change in behaviour in that the new implementation does not preserve heterogeneous data types (eg int/float).
Not that this should hold back these perf improvements, but we might consider to keep it for 0.21 (this is also no regression)
result = self.values | ||
mask = isnull(result) | ||
if upper is not None: | ||
result = np.where(result >= upper, upper, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a with np.errstate
, as we are working with raw array
In [8]: pd.Series([0, np.nan, 2]).clip(0, 1)
/home/joris/scipy/pandas/pandas/core/generic.py:4117: RuntimeWarning: invalid value encountered in greater_equal
result = np.where(result >= upper, upper, result)
/home/joris/scipy/pandas/pandas/core/generic.py:4119: RuntimeWarning: invalid value encountered in less_equal
result = np.where(result <= lower, lower, result)
Out[8]:
0 0.0
1 NaN
2 1.0
dtype: float64
def _clip_with_scalar(self, lower, upper): | ||
|
||
if ((lower is not None and np.any(isnull(lower))) or | ||
(upper is not None and np.any(isnull(upper)))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the np.any
needed here? As lower/upper are already confirmed to be a scalar?
In principle could add a check for that (at the if statement to decide to take this path or not), but not sure that is worth it .. |
yeah this is a limitation of the current methodology let me see what i can do |
Additional test cases for #16364 when upper and / or lower is nan.
* upstream/master: (48 commits) BUG: Categorical comparison with unordered (pandas-dev#16339) ENH: Adding 'protocol' parameter to 'to_pickle'. PERF: improve MultiIndex get_loc performance (pandas-dev#16346) TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374) TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373) DOC: new oauth token TST: Add test for clip-na (pandas-dev#16369) ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315) MAINT: Add .iml to .gitignore (pandas-dev#16368) BUG/API: Categorical constructor scalar categories (pandas-dev#16340) ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220) PERF: improved clip performance (pandas-dev#16364) DOC: try new token for docs DOC: try with new secure token DOC: add developer section to the docs DEPS: Drop Python 3.4 support (pandas-dev#16303) DOC: remove credential helper DOC: force fetch on build docs DOC: redo dev docs access token DOC: add dataframe construction in merge_asof example (pandas-dev#16348) ...
Additional test cases for pandas-dev#16364 when upper and / or lower is nan.
closes pandas-dev#15400 (cherry picked from commit 42e2a87)
(cherry picked from commit e97865e)
Additional test cases for pandas-dev#16364 when upper and / or lower is nan.
closes #15400
In [1]: np.random.seed(1234)
In [2]: s = pd.Series(np.random.randn(50))
master
PR
prob as good as can do for now as we still have 2 where ops (numpy does this in a single loop), and we have a mask check and fill (and final construction).
but about 15x better