Skip to content

BUG: names on union and intersection for Index were inconsistent (GH9943 GH9862) #19849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 6, 2018
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -411,7 +411,7 @@ Reshaping
- Bug in :func:`pandas.concat` when joining resampled DataFrames with timezone aware index (:issue:`13783`)
- Bug in :meth:`Series.combine_first` with ``datetime64[ns, tz]`` dtype which would return tz-naive result (:issue:`21469`)
- Bug in :meth:`Series.where` and :meth:`DataFrame.where` with ``datetime64[ns, tz]`` dtype (:issue:`21546`)
-
- Bug in :func:`Index.union` and :func:`Index.intersection` where name of the ``Index`` of the result was not computed correctly for certain cases (:issue:`9943`, :issue:`9862`)
-

Build Changes
Expand All @@ -427,4 +427,3 @@ Other
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`)
-
-
-
39 changes: 17 additions & 22 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
import pandas.core.sorting as sorting
from pandas.io.formats.printing import (
pprint_thing, default_pprint, format_object_summary, format_object_attrs)
from pandas.core.ops import make_invalid_op
from pandas.core.ops import make_invalid_op, get_op_result_name
from pandas.core.strings import StringMethods

__all__ = ['Index']
Expand Down Expand Up @@ -1200,7 +1200,7 @@ def _convert_can_do_setop(self, other):
other = Index(other, name=self.name)
result_name = self.name
else:
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)
return other, result_name

def _convert_for_op(self, value):
Expand Down Expand Up @@ -2643,19 +2643,15 @@ def __or__(self, other):
def __xor__(self, other):
return self.symmetric_difference(other)

def _get_consensus_name(self, other):
def _get_reconciled_name_object(self, other):
"""
Given 2 indexes, give a consensus name meaning
we take the not None one, or None if the names differ.
Return a new object if we are resetting the name
If the result of a set operation will be self,
return self, unless the name changes, in which
case make a shallow copy of self.
"""
if self.name != other.name:
if self.name is None or other.name is None:
name = self.name or other.name
else:
name = None
if self.name != name:
return self._shallow_copy(name=name)
name = get_op_result_name(self, other)
if self.name != name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think might be able to dispense with this entirely, and just always do

from pandas.core.ops import get_op_result_name

....

return self._shallow_copy(name=get_op_result_name(self, other))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in _wrap_setup_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I eliminate self.get_setop_name_object and use _wrap_setop_result, other tests fail, because those tests assume the corner cases return the same object.

return self._shallow_copy(name=name)
return self

def union(self, other):
Expand Down Expand Up @@ -2683,10 +2679,10 @@ def union(self, other):
other = _ensure_index(other)

if len(other) == 0 or self.equals(other):
return self._get_consensus_name(other)
return self._get_reconciled_name_object(other)

if len(self) == 0:
return other._get_consensus_name(self)
return other._get_reconciled_name_object(self)

# TODO: is_dtype_union_equal is a hack around
# 1. buggy set ops with duplicates (GH #13432)
Expand Down Expand Up @@ -2749,11 +2745,10 @@ def union(self, other):
stacklevel=3)

# for subclasses
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
return self.__class__(result, name=name)
def _wrap_setop_result(self, other, result):
return self._constructor(result, name=get_op_result_name(self, other))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would shallowcopy work here? Constructor isn’t used very often in the indexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel repeating (part of) reply from a few months ago:

the issue comes down to whether we use self._shallow_copy() to return the result of set operations. When I tried using self._shallow_copy() in Index._wrap_setop_result(), and looked at what was happening, then CategoricalIndex was the problem. So it was easier to leave the existing implementation of Index._wrap_setop_result. If we were to handle #10186, then I could experiment more with changing the implementation of Index._wrap_setop_result.


def intersection(self, other):
"""
Expand Down Expand Up @@ -2783,7 +2778,7 @@ def intersection(self, other):
other = _ensure_index(other)

if self.equals(other):
return self._get_consensus_name(other)
return self._get_reconciled_name_object(other)

if not is_dtype_equal(self.dtype, other.dtype):
this = self.astype('O')
Expand All @@ -2803,7 +2798,7 @@ def intersection(self, other):
if self.is_monotonic and other.is_monotonic:
try:
result = self._inner_indexer(lvals, rvals)[0]
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)
except TypeError:
pass

Expand Down Expand Up @@ -4072,7 +4067,7 @@ def _join_monotonic(self, other, how='left', return_indexers=False):
return join_index

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Index(joined, name=name)

def _get_string_slice(self, key, use_lhs=True, use_rhs=True):
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import pandas.core.common as com
import pandas.core.missing as missing
import pandas.core.indexes.base as ibase
from pandas.core.ops import get_op_result_name
from pandas.core.arrays.categorical import Categorical, contains

_index_doc_kwargs = dict(ibase._index_doc_kwargs)
Expand Down Expand Up @@ -298,6 +299,10 @@ def itemsize(self):
# Size of the items in categories, not codes.
return self.values.itemsize

def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _shallow_copy here, we never call _simple_new directly. (another reason why the super method should use _shallow_copy, but that's for new PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this method to Index.base? does it break anything?

return self._shallow_copy(result, name=name)

def get_values(self):
""" return the underlying data as an ndarray """
return self._data.get_values()
Expand Down
17 changes: 13 additions & 4 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@

from pandas.core.indexes.base import Index, _index_shared_docs
from pandas.core.indexes.numeric import Int64Index, Float64Index
from pandas.core.ops import get_op_result_name
import pandas.compat as compat
from pandas.tseries.frequencies import to_offset, get_period_alias, Resolution
from pandas.core.indexes.datetimelike import (
Expand Down Expand Up @@ -1018,6 +1019,10 @@ def union(self, other):
y : Index or DatetimeIndex
"""
self._assert_can_do_setop(other)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just call the super .union() here (after any conversion isneeded)? then you can inline the union_corner bizness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if len(other) == 0 or self.equals(other) or len(self) == 0:
return super(DatetimeIndex, self).union(other)

if not isinstance(other, DatetimeIndex):
try:
other = DatetimeIndex(other)
Expand Down Expand Up @@ -1119,7 +1124,7 @@ def _maybe_utc_convert(self, other):
return this, other

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
if (isinstance(other, DatetimeIndex) and
self.freq == other.freq and
self._can_fast_union(other)):
Expand Down Expand Up @@ -1193,11 +1198,11 @@ def _fast_union(self, other):
end=max(left_end, right_end),
freq=left.freq)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
if not timezones.tz_compare(self.tz, other.tz):
raise ValueError('Passed item and index have different timezone')
return self._simple_new(result, name=name, freq=None, tz=self.tz)
return self._shallow_copy(result, name=name, freq=None, tz=self.tz)

def intersection(self, other):
"""
Expand All @@ -1213,6 +1218,10 @@ def intersection(self, other):
y : Index or DatetimeIndex
"""
self._assert_can_do_setop(other)

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, DatetimeIndex):
try:
other = DatetimeIndex(other)
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
from pandas.core.indexes.base import (
Index, _ensure_index,
default_pprint, _index_shared_docs)
from pandas.core.ops import get_op_result_name

from pandas._libs import Timestamp, Timedelta
from pandas._libs.interval import (
Expand Down Expand Up @@ -1432,7 +1433,7 @@ def func(self, other):
raise TypeError(msg.format(op=op_name))

result = getattr(self._multiindex, op_name)(other._multiindex)
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)

# GH 19101: ensure empty results have correct dtype
if result.empty:
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/indexes/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from pandas.util._decorators import Appender, cache_readonly
import pandas.core.dtypes.concat as _concat
import pandas.core.indexes.base as ibase

from pandas.core.ops import get_op_result_name

_num_index_shared_docs = dict()

Expand Down Expand Up @@ -187,7 +187,7 @@ def _convert_scalar_indexer(self, key, kind=None):
._convert_scalar_indexer(key, kind=kind))

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Int64Index(joined, name=name)

@classmethod
Expand Down Expand Up @@ -264,7 +264,7 @@ def _convert_index_indexer(self, keyarr):
return keyarr

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return UInt64Index(joined, name=name)

@classmethod
Expand Down
5 changes: 3 additions & 2 deletions pandas/core/indexes/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
pandas_dtype,
_ensure_object)
from pandas.core.dtypes.generic import ABCSeries
from pandas.core.ops import get_op_result_name

import pandas.tseries.frequencies as frequencies
from pandas.tseries.frequencies import get_freq_code as _gfc
Expand Down Expand Up @@ -767,8 +768,8 @@ def _assert_can_do_setop(self, other):
msg = DIFFERENT_FREQ_INDEX.format(self.freqstr, other.freqstr)
raise IncompatibleFrequency(msg)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think could just remove this entirely here if _wrap_setop_result is defined in Index.base AND you move _apply_meta functionaily to inside of _shallow_copy for PI. (could also do this as a followup), but want to try to avoid duplicating _wrap_setup_result everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we can't move it to base from category is because in the case of category, we need to preserve the categories, and in the case of period, we need to preserve frequency.

So we could investigate in a separate PR about moving the _apply_meta() functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fine

result = self._apply_meta(result)
result.name = name
return result
Expand Down
14 changes: 9 additions & 5 deletions pandas/core/indexes/range.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,8 @@ def tolist(self):
@Appender(_index_shared_docs['_shallow_copy'])
def _shallow_copy(self, values=None, **kwargs):
if values is None:
return RangeIndex(name=self.name, fastpath=True,
name = kwargs.get("name", self.name)
return RangeIndex(name=name, fastpath=True,
**dict(self._get_data_as_items()))
else:
kwargs.setdefault('name', self.name)
Expand Down Expand Up @@ -341,6 +342,10 @@ def intersection(self, other):
-------
intersection : Index
"""

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, RangeIndex):
return super(RangeIndex, self).intersection(other)

Expand Down Expand Up @@ -421,10 +426,9 @@ def union(self, other):
union : Index
"""
self._assert_can_do_setop(other)
if len(other) == 0 or self.equals(other):
return self
if len(self) == 0:
return other
if len(other) == 0 or self.equals(other) or len(self) == 0:
return super(RangeIndex, self).union(other)

if isinstance(other, RangeIndex):
start_s, step_s = self._start, self._step
end_s = self._start + self._step * (len(self) - 1)
Expand Down
15 changes: 10 additions & 5 deletions pandas/core/indexes/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
from pandas.core.base import _shared_docs
from pandas.core.indexes.base import _index_shared_docs
import pandas.core.common as com
from pandas.core.ops import get_op_result_name
import pandas.core.dtypes.concat as _concat
from pandas.util._decorators import Appender, Substitution, deprecate_kwarg
from pandas.core.indexes.datetimelike import (
Expand Down Expand Up @@ -497,6 +498,10 @@ def union(self, other):
y : Index or TimedeltaIndex
"""
self._assert_can_do_setop(other)

if len(other) == 0 or self.equals(other) or len(self) == 0:
return super(TimedeltaIndex, self).union(other)

if not isinstance(other, TimedeltaIndex):
try:
other = TimedeltaIndex(other)
Expand Down Expand Up @@ -529,7 +534,7 @@ def join(self, other, how='left', level=None, return_indexers=False,
sort=sort)

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
if (isinstance(other, TimedeltaIndex) and self.freq == other.freq and
self._can_fast_union(other)):
joined = self._shallow_copy(joined, name=name)
Expand Down Expand Up @@ -589,10 +594,6 @@ def _fast_union(self, other):
else:
return left

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
return self._simple_new(result, name=name, freq=None)

def intersection(self, other):
"""
Specialized intersection for TimedeltaIndex objects. May be much faster
Expand All @@ -607,6 +608,10 @@ def intersection(self, other):
y : Index or TimedeltaIndex
"""
self._assert_can_do_setop(other)

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, TimedeltaIndex):
try:
other = TimedeltaIndex(other)
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/indexes/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
tm.makeTimedeltaIndex(100),
tm.makeIntIndex(100),
tm.makeUIntIndex(100),
tm.makeRangeIndex(100),
tm.makeFloatIndex(100),
Index([True, False]),
tm.makeCategoricalIndex(100),
Expand Down
Loading