-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: PeriodIndex.dropna, difference, take #23095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jbrockmendel! Thanks for submitting the PR.
|
@@ -96,6 +96,8 @@ def time_value_counts(self, typ): | |||
class Indexing(object): | |||
|
|||
goal_time = 0.2 | |||
pi_with_nas = PeriodIndex(['1985Q1', 'NaT', '1985Q2'] * 1000, freq='Q') | |||
pi_diff = PeriodIndex(['1985Q1', 'NaT', '1985Q3'] * 1000, freq='Q') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup
gets called with every round of time_foo
. Doing this here is analogous to calling setup_class
once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what setup_cache
is for: https://asv.readthedocs.io/en/stable/writing_benchmarks.html#setup-and-teardown-functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup_cache pickles and then loads in each step of the loop
if is_period_dtype(self): | ||
# Avoid casting to object-dtype | ||
return self._ndarray_values | ||
return self.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger @jorisvandenbossche is there a better name for this that might fit with EA naming conventions?
@jschendel should IntervalIndex be special-cased here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have all EAs, this will always dispatch to the EA, and then we don't need a _take_values
anymore?
(if that is the case, I wouldn't care too much about the name now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jschendel should IntervalIndex be special-cased here?
The implementation should be good as-is for IntervalIndex
; IntervalIndex.values
is an IntervalArray
, which has a special-cased take
method. IntervalIndex._shallow_copy/_simple_new
should accept an IntervalArray
(which we'd get from take
), so should be good on that front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically .asi8 on Indexes now, yes? Is there a reason we are not aligning on that? or changing .asi8 to this?
if is_period_dtype(self): | ||
# Avoid casting to object-dtype | ||
return self._ndarray_values | ||
return self.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have all EAs, this will always dispatch to the EA, and then we don't need a _take_values
anymore?
(if that is the case, I wouldn't care too much about the name now)
@@ -519,7 +519,7 @@ def _simple_new(cls, values, name=None, dtype=None, **kwargs): | |||
@Appender(_index_shared_docs['_shallow_copy']) | |||
def _shallow_copy(self, values=None, **kwargs): | |||
if values is None: | |||
values = self.values | |||
values = self._take_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? What has take to do with shallow_copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about passing i8 instead of object to _simple_new
@@ -2158,7 +2169,7 @@ def take(self, indices, axis=0, allow_fill=True, | |||
if allow_fill and fill_value is not None: | |||
msg = 'Unable to fill values because {0} cannot contain NA' | |||
raise ValueError(msg.format(self.__class__.__name__)) | |||
taken = self.values.take(indices) | |||
taken = self._take_values.take(indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PeriodIndex can hold NA values, so how is influencing PeriodIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_take_values will be iNaT in locations where values are NaT, after passing to shallow_copy/simple_new, we get NaT back
@@ -2929,7 +2940,8 @@ def difference(self, other): | |||
self._assert_can_do_setop(other) | |||
|
|||
if self.equals(other): | |||
return self._shallow_copy([]) | |||
# pass an empty array with the appropriate dtype | |||
return self._shallow_copy(self[:0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you add a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or this was needed due to the change in shallow_copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just because _shallow_copy
calls _simple_new
, and being "simple" should not be getting a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I did not add a test, but can confirm this line is hit in the tests.
And the same question as on the other PR: to what extent is this conflicting with the PeriodArray PR of @TomAugspurger ? Shouldn't we rather focus on finishing that one first, and then doing some perf improvements / clean-ups afterwards? |
@jorisvandenbossche @TomAugspurger responding to many of joris's comments/questions (here and in #23093) in one place. The PeriodIndex/PeriodArray constructors are a pain point in large part because its i.e. following this PR, we'll be able to clean up the |
@jbrockmendel to be clear, I fully agree with your analysis about |
That's fair. I am definitely viewing the situation as "#22862 is a big PR and simplifying things before that will make that PR smaller, review easier, etc" |
Codecov Report
@@ Coverage Diff @@
## master #23095 +/- ##
==========================================
- Coverage 92.2% 92.19% -0.01%
==========================================
Files 169 169
Lines 50924 50933 +9
==========================================
+ Hits 46952 46960 +8
- Misses 3972 3973 +1
Continue to review full report at Codecov.
|
if is_period_dtype(values): | ||
take_vals = values._ndarray_values | ||
else: | ||
# TODO: Why is this _values where elsewhere it is values? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably responsible for using values._values
here instead of .values
. AFAIK, they should be equivalent for a MultiIndex (an ndarray of tuples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll clear this out in the next pass, assuming this goes through.
Sorry, I still don't see how introducing this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree this will be short lived. This is a new notion, meaning we now have actually 3 ways of representing datetimelike things. The high level EA notion (e.g. an DatetimeArray), a numpy m8 array and a numpy i8 array. You are introducing common nomenclature for the 3rd.
if is_period_dtype(self): | ||
# Avoid casting to object-dtype | ||
return self._ndarray_values | ||
return self.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically .asi8 on Indexes now, yes? Is there a reason we are not aligning on that? or changing .asi8 to this?
I think this idea is that they'll all just be `.values` once we have
PeriodArray / DatetimeArray.
…On Fri, Oct 12, 2018 at 7:50 AM Jeff Reback ***@***.***> wrote:
***@***.**** requested changes on this pull request.
I don't agree this will be short lived. This is a new notion, meaning we
now have
------------------------------
In pandas/core/indexes/base.py
<#23095 (comment)>:
> @@ -763,6 +763,17 @@ def get_values(self):
"""
return self.values
+ @Property
+ def _take_values(self):
+ """
+ Values to use in `take` operation, suitable to being passed back to
+ _shallow_copy/_simple_new.
+ """
+ if is_period_dtype(self):
+ # Avoid casting to object-dtype
+ return self._ndarray_values
+ return self.values
this is basically .asi8 on Indexes now, yes? Is there a reason we are not
aligning on that? or changing .asi8 to this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23095 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIgYuFBTeDDuieFgaOcWO-fKYWR-cks5ukJAegaJpZM4XYCM1>
.
|
I tend to agree with Tom on this one, but the more important point in my book is that this fixes the weird arguments being passed to PeriodIndex._shallow_copy which allows us to clean up the constructor mess, simplifying the PeriodArray transition. |
if they will be .values then don’t introduce something new |
Yah, I'm beginning to agree. Plenty of other related stuff to focus on. Closing. |
Related #23083
Besides the performance boost, this goes a long way towards making
PeriodIndex._simple_new
actually simple; avoids passing object-dtype to_simple_new