Skip to content

PERF: PeriodIndex.dropna, difference, take #23095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Oct 11, 2018

Related #23083

Besides the performance boost, this goes a long way towards making PeriodIndex._simple_new actually simple; avoids passing object-dtype to _simple_new

In [2]: pi_with_nas = pd.PeriodIndex(['1985Q1', 'NaT', '1985Q2'] * 1000, freq='Q')
In [3]: %timeit pi_with_nas.dropna()
The slowest run took 13.93 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.3 µs per loop  <-- PR
100 loops, best of 3: 4.96 ms per loop    <-- master

In [4]: pi_diff = pd.PeriodIndex(['1985Q1', 'NaT', '1985Q3'] * 1000, freq='Q')
In [5]: %timeit pi_with_nas.difference(pi_diff)
1000 loops, best of 3: 553 µs per loop  <-- PR
100 loops, best of 3: 5.29 ms per loop  <-- master

In [6]: %timeit pi_with_nas.symmetric_difference(pi_diff)
1000 loops, best of 3: 787 µs per loop  <-- PR
100 loops, best of 3: 10.3 ms per loop  <-- master

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

file_to_check.py:2454:-258: W605 invalid escape sequence '('
file_to_check.py:2454:-255: W605 invalid escape sequence ')'


@@ -96,6 +96,8 @@ def time_value_counts(self, typ):
class Indexing(object):

goal_time = 0.2
pi_with_nas = PeriodIndex(['1985Q1', 'NaT', '1985Q2'] * 1000, freq='Q')
pi_diff = PeriodIndex(['1985Q1', 'NaT', '1985Q3'] * 1000, freq='Q')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup gets called with every round of time_foo. Doing this here is analogous to calling setup_class once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_cache pickles and then loads in each step of the loop

if is_period_dtype(self):
# Avoid casting to object-dtype
return self._ndarray_values
return self.values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger @jorisvandenbossche is there a better name for this that might fit with EA naming conventions?

@jschendel should IntervalIndex be special-cased here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have all EAs, this will always dispatch to the EA, and then we don't need a _take_values anymore?
(if that is the case, I wouldn't care too much about the name now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel should IntervalIndex be special-cased here?

The implementation should be good as-is for IntervalIndex; IntervalIndex.values is an IntervalArray, which has a special-cased take method. IntervalIndex._shallow_copy/_simple_new should accept an IntervalArray (which we'd get from take), so should be good on that front.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically .asi8 on Indexes now, yes? Is there a reason we are not aligning on that? or changing .asi8 to this?

if is_period_dtype(self):
# Avoid casting to object-dtype
return self._ndarray_values
return self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have all EAs, this will always dispatch to the EA, and then we don't need a _take_values anymore?
(if that is the case, I wouldn't care too much about the name now)

@@ -519,7 +519,7 @@ def _simple_new(cls, values, name=None, dtype=None, **kwargs):
@Appender(_index_shared_docs['_shallow_copy'])
def _shallow_copy(self, values=None, **kwargs):
if values is None:
values = self.values
values = self._take_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? What has take to do with shallow_copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about passing i8 instead of object to _simple_new

@@ -2158,7 +2169,7 @@ def take(self, indices, axis=0, allow_fill=True,
if allow_fill and fill_value is not None:
msg = 'Unable to fill values because {0} cannot contain NA'
raise ValueError(msg.format(self.__class__.__name__))
taken = self.values.take(indices)
taken = self._take_values.take(indices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeriodIndex can hold NA values, so how is influencing PeriodIndex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_take_values will be iNaT in locations where values are NaT, after passing to shallow_copy/simple_new, we get NaT back

@@ -2929,7 +2940,8 @@ def difference(self, other):
self._assert_can_do_setop(other)

if self.equals(other):
return self._shallow_copy([])
# pass an empty array with the appropriate dtype
return self._shallow_copy(self[:0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add a test for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this was needed due to the change in shallow_copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just because _shallow_copy calls _simple_new, and being "simple" should not be getting a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not add a test, but can confirm this line is hit in the tests.

@jorisvandenbossche
Copy link
Member

And the same question as on the other PR: to what extent is this conflicting with the PeriodArray PR of @TomAugspurger ? Shouldn't we rather focus on finishing that one first, and then doing some perf improvements / clean-ups afterwards?

@jbrockmendel
Copy link
Member Author

Shouldn't we rather focus on finishing that one first, and then doing some perf improvements

@jorisvandenbossche @TomAugspurger responding to many of joris's comments/questions (here and in #23093) in one place.

The PeriodIndex/PeriodArray constructors are a pain point in large part because its _simple_new is not very simple, which is why there is the lower-level _from_ordinals. Why the complication in _simple_new? Because the values being passed to it are sometimes i8 (what we actually want), sometimes ndarray[object], and sometimes PeriodIndex. The latter two cases, as it turns out, are coming from calls to _shallow_copy, exactly the ones this PR affects.

i.e. following this PR, we'll be able to clean up the PeriodIndex constructors (on top of whats in #23093)

@jorisvandenbossche
Copy link
Member

@jbrockmendel to be clear, I fully agree with your analysis about _simple_new.
But the thing is that the PeriodArray PR already solves this partly (at least for PeriodIndex, complexity is mainly moved to PeriodArray), because then, it will be much more PeriodArray that will be passed around in the Index implementation, instead of sometimes object array / sometimes ordinals. And of course, we can then still try to simplify the different constructors in PeriodArray.
As another example, The PeriodArray currently does not yet solve the take issue you try to solve here, but once that PR would land, the solution for take would be completely different (implementing it on the PeriodArray, dispatching to that, and properly wrapping it again), and the other changes you have here might not be needed.

@jbrockmendel
Copy link
Member Author

And of course, we can then still try to simplify the different constructors in PeriodArray.

That's fair. I am definitely viewing the situation as "#22862 is a big PR and simplifying things before that will make that PR smaller, review easier, etc"

@codecov
Copy link

codecov bot commented Oct 11, 2018

Codecov Report

Merging #23095 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23095      +/-   ##
==========================================
- Coverage    92.2%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50924    50933       +9     
==========================================
+ Hits        46952    46960       +8     
- Misses       3972     3973       +1
Flag Coverage Δ
#multiple 90.62% <100%> (-0.01%) ⬇️
#single 42.3% <52.17%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.46% <100%> (+0.01%) ⬆️
pandas/core/indexes/period.py 93.44% <100%> (ø) ⬆️
pandas/util/testing.py 86.18% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.56% <100%> (+0.01%) ⬆️
pandas/core/indexes/datetimes.py 95.61% <0%> (-0.14%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ce3d0...62f0f1f. Read the comment docs.

if is_period_dtype(values):
take_vals = values._ndarray_values
else:
# TODO: Why is this _values where elsewhere it is values?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably responsible for using values._values here instead of .values. AFAIK, they should be equivalent for a MultiIndex (an ndarray of tuples).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll clear this out in the next pass, assuming this goes through.

@jorisvandenbossche
Copy link
Member

Sorry, I still don't see how introducing this _take_values brings us closer to our goal. The PeriodArray PR will remove that again, no?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree this will be short lived. This is a new notion, meaning we now have actually 3 ways of representing datetimelike things. The high level EA notion (e.g. an DatetimeArray), a numpy m8 array and a numpy i8 array. You are introducing common nomenclature for the 3rd.

if is_period_dtype(self):
# Avoid casting to object-dtype
return self._ndarray_values
return self.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically .asi8 on Indexes now, yes? Is there a reason we are not aligning on that? or changing .asi8 to this?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 12, 2018 via email

@jbrockmendel
Copy link
Member Author

I don't agree this will be short lived.

I think this idea is that they'll all just be .values once we have PeriodArray / DatetimeArray.

I tend to agree with Tom on this one, but the more important point in my book is that this fixes the weird arguments being passed to PeriodIndex._shallow_copy which allows us to clean up the constructor mess, simplifying the PeriodArray transition.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2018

if they will be .values then don’t introduce something new

@jbrockmendel
Copy link
Member Author

Yah, I'm beginning to agree. Plenty of other related stuff to focus on. Closing.

@jbrockmendel jbrockmendel deleted the take_vals branch April 5, 2020 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants