Skip to content

PERF: PeriodIndex.dropna, difference, take #23095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions asv_bench/benchmarks/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,8 @@ def time_value_counts(self, typ):
class Indexing(object):

goal_time = 0.2
pi_with_nas = PeriodIndex(['1985Q1', 'NaT', '1985Q2'] * 1000, freq='Q')
pi_diff = PeriodIndex(['1985Q1', 'NaT', '1985Q3'] * 1000, freq='Q')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup gets called with every round of time_foo. Doing this here is analogous to calling setup_class once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_cache pickles and then loads in each step of the loop


def setup(self):
self.index = PeriodIndex(start='1985', periods=1000, freq='D')
Expand All @@ -121,4 +123,17 @@ def time_intersection(self):
self.index[:750].intersection(self.index[250:])

def time_unique(self):
# GH#23083
self.index.unique()

def time_dropna(self):
# GH#23095
self.pi_with_nas.dropna()

def time_difference(self):
# GH#23095
self.pi_with_nas.difference(self.pi_diff)

def time_symmetric_difference(self):
# GH#23095
self.pi_with_nas.symmetric_difference(self.pi_diff)
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -701,7 +701,7 @@ Performance Improvements
(:issue:`21372`)
- Improved the performance of :func:`pandas.get_dummies` with ``sparse=True`` (:issue:`21997`)
- Improved performance of :func:`IndexEngine.get_indexer_non_unique` for sorted, non-unique indexes (:issue:`9466`)
- Improved performance of :func:`PeriodIndex.unique` (:issue:`23083`)
- Improved performance of :func:`PeriodIndex.unique`, :func:`PeriodIndex.difference`, :func:`PeriodIndex.symmetric_difference`, and :func:`PeriodIndex.dropna`, (:issue:`23083`, :issue:`23095`)


.. _whatsnew_0240.docs:
Expand Down
32 changes: 23 additions & 9 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@ def _simple_new(cls, values, name=None, dtype=None, **kwargs):
@Appender(_index_shared_docs['_shallow_copy'])
def _shallow_copy(self, values=None, **kwargs):
if values is None:
values = self.values
values = self._take_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? What has take to do with shallow_copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about passing i8 instead of object to _simple_new

attributes = self._get_attributes_dict()
attributes.update(kwargs)
if not len(values) and 'dtype' not in kwargs:
Expand Down Expand Up @@ -763,6 +763,17 @@ def get_values(self):
"""
return self.values

@property
def _take_values(self):
"""
Values to use in `take` operation, suitable to being passed back to
_shallow_copy/_simple_new.
"""
if is_period_dtype(self):
# Avoid casting to object-dtype
return self._ndarray_values
return self.values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger @jorisvandenbossche is there a better name for this that might fit with EA naming conventions?

@jschendel should IntervalIndex be special-cased here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have all EAs, this will always dispatch to the EA, and then we don't need a _take_values anymore?
(if that is the case, I wouldn't care too much about the name now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel should IntervalIndex be special-cased here?

The implementation should be good as-is for IntervalIndex; IntervalIndex.values is an IntervalArray, which has a special-cased take method. IntervalIndex._shallow_copy/_simple_new should accept an IntervalArray (which we'd get from take), so should be good on that front.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically .asi8 on Indexes now, yes? Is there a reason we are not aligning on that? or changing .asi8 to this?


@Appender(IndexOpsMixin.memory_usage.__doc__)
def memory_usage(self, deep=False):
result = super(Index, self).memory_usage(deep=deep)
Expand Down Expand Up @@ -2158,7 +2169,7 @@ def take(self, indices, axis=0, allow_fill=True,
if allow_fill and fill_value is not None:
msg = 'Unable to fill values because {0} cannot contain NA'
raise ValueError(msg.format(self.__class__.__name__))
taken = self.values.take(indices)
taken = self._take_values.take(indices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeriodIndex can hold NA values, so how is influencing PeriodIndex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_take_values will be iNaT in locations where values are NaT, after passing to shallow_copy/simple_new, we get NaT back

return self._shallow_copy(taken)

def _assert_take_fillable(self, values, indices, allow_fill=True,
Expand Down Expand Up @@ -2929,7 +2940,8 @@ def difference(self, other):
self._assert_can_do_setop(other)

if self.equals(other):
return self._shallow_copy([])
# pass an empty array with the appropriate dtype
return self._shallow_copy(self[:0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add a test for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this was needed due to the change in shallow_copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just because _shallow_copy calls _simple_new, and being "simple" should not be getting a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not add a test, but can confirm this line is hit in the tests.


other, result_name = self._convert_can_do_setop(other)

Expand All @@ -2940,12 +2952,14 @@ def difference(self, other):

label_diff = np.setdiff1d(np.arange(this.size), indexer,
assume_unique=True)
the_diff = this.values.take(label_diff)
the_diff = this._take_values.take(label_diff)
try:
the_diff = sorting.safe_sort(the_diff)
except TypeError:
pass

if is_period_dtype(this):
return this._shallow_copy(the_diff, name=result_name)
return this._shallow_copy(the_diff, name=result_name, freq=None)

def symmetric_difference(self, other, result_name=None):
Expand Down Expand Up @@ -2994,11 +3008,11 @@ def symmetric_difference(self, other, result_name=None):
common_indexer = indexer.take((indexer != -1).nonzero()[0])
left_indexer = np.setdiff1d(np.arange(this.size), common_indexer,
assume_unique=True)
left_diff = this.values.take(left_indexer)
left_diff = this._take_values.take(left_indexer)

# {other} minus {this}
right_indexer = (indexer == -1).nonzero()[0]
right_diff = other.values.take(right_indexer)
right_diff = other._take_values.take(right_indexer)

the_diff = _concat._concat_compat([left_diff, right_diff])
try:
Expand All @@ -3008,7 +3022,7 @@ def symmetric_difference(self, other, result_name=None):

attribs = self._get_attributes_dict()
attribs['name'] = result_name
if 'freq' in attribs:
if 'freq' in attribs and not is_period_dtype(self):
attribs['freq'] = None
return self._shallow_copy_with_infer(the_diff, **attribs)

Expand All @@ -3028,7 +3042,7 @@ def _get_unique_index(self, dropna=False):
if self.is_unique and not dropna:
return self

values = self.values
values = self._take_values

if not self.is_unique:
values = self.unique()
Expand Down Expand Up @@ -4678,7 +4692,7 @@ def dropna(self, how='any'):
raise ValueError("invalid how option: {0}".format(how))

if self.hasnans:
return self._shallow_copy(self.values[~self._isnan])
return self._shallow_copy(self._take_values[~self._isnan])
return self._shallow_copy()

def _evaluate_with_timedelta_like(self, other, op):
Expand Down
10 changes: 9 additions & 1 deletion pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from pandas.core.dtypes.common import (
ensure_int64,
ensure_platform_int,
is_period_dtype,
is_categorical_dtype,
is_object_dtype,
is_hashable,
Expand Down Expand Up @@ -1036,7 +1037,14 @@ def _get_level_values(self, level, unique=False):
labels = self.labels[level]
if unique:
labels = algos.unique(labels)
filled = algos.take_1d(values._values, labels,

if is_period_dtype(values):
take_vals = values._ndarray_values
else:
# TODO: Why is this _values where elsewhere it is values?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably responsible for using values._values here instead of .values. AFAIK, they should be equivalent for a MultiIndex (an ndarray of tuples).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll clear this out in the next pass, assuming this goes through.

# if it were values here we could use _take_values
take_vals = values._values
filled = algos.take_1d(take_vals, labels,
fill_value=values._na_value)
values = values._shallow_copy(filled)
return values
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/indexes/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,9 +327,9 @@ def __array_wrap__(self, result, context=None):
"""
if isinstance(context, tuple) and len(context) > 0:
func = context[0]
if (func is np.add):
if func is np.add:
pass
elif (func is np.subtract):
elif func is np.subtract:
name = self.name
left = context[1][0]
right = context[1][1]
Expand All @@ -350,7 +350,7 @@ def __array_wrap__(self, result, context=None):
return result
# the result is object dtype array of Period
# cannot pass _simple_new as it is
return self._shallow_copy(result, freq=self.freq, name=self.name)
return type(self)(result, freq=self.freq, name=self.name)

@property
def size(self):
Expand Down
3 changes: 2 additions & 1 deletion pandas/util/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -824,7 +824,8 @@ def _get_ilevel_values(index, level):
# accept level number only
unique = index.levels[level]
labels = index.labels[level]
filled = take_1d(unique.values, labels, fill_value=unique._na_value)
filled = take_1d(unique._take_values, labels,
fill_value=unique._na_value)
values = unique._shallow_copy(filled, name=index.names[level])
return values

Expand Down