-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex #23591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @qwhelan! Thanks for submitting the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #23591 +/- ##
==========================================
+ Coverage 92.25% 92.25% +<.01%
==========================================
Files 161 161
Lines 51237 51239 +2
==========================================
+ Hits 47269 47271 +2
Misses 3968 3968
Continue to review full report at Codecov.
|
@@ -1061,6 +1061,14 @@ def func(self, other): | |||
name=result_name) | |||
return func | |||
|
|||
@property | |||
def is_all_dates(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we should actually change the default for Index suclasses i think
Thanks @TomAugspurger ! |
…fixed * upstream/master: (47 commits) CLN: remove values attribute from datetimelike EAs (pandas-dev#23603) DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381) PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589) PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591) TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502) Update description of Index._values/values/ndarray_values (pandas-dev#23507) Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552) DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953) ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654) CI: Auto-cancel redundant builds (pandas-dev#23523) Preserve EA dtype in DataFrame.stack (pandas-dev#23285) TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468) BUG: raise if invalid freq is passed (pandas-dev#23546) remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562) BUG: Fix error message for invalid HTML flavor (pandas-dev#23550) ENH: Support EAs in Series.unstack (pandas-dev#23284) DOC: Updating DataFrame.join docstring (pandas-dev#23471) TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888) BUG: Return KeyError for invalid string key (pandas-dev#23540) BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852) ...
We get a few orders of magnitude speedup in
IntervalIndex
slicing by simply overriding the base class definition ofis_all_dates
, like all otherIndex
derivatives also do. The root cause of the performance degradation is as follows:Series
, a newSeries
is created for the result.Series.__init__()
isSeries._set_axis()
, which in turn calls.is_all_dates
on the newIndex
Index.is_all_dates
is:which seems harmless at first glance. However, this eventually invokes
IntervalArray.__array__
, which is a pure Python for-loop creatingInterval
objects and leading to the performance regression here.As the value of
IntervalIndex.is_all_dates
appears to always beFalse
, even in the case of datetime-like left/right values, we simply override to return that value and shortcut the inadvertent copy described above.Benchmarks
Speed up of ~10704x for
time_loc_list
git diff upstream/master -u -- "*.py" | flake8 --diff