Skip to content

BUG: names on union and intersection for Index were inconsistent (GH9943 GH9862) #19849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 6, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -837,6 +837,7 @@ Indexing
- Bug in :func:`IntervalIndex.symmetric_difference` where the symmetric difference with a non-``IntervalIndex`` did not raise (:issue:`18475`)
- Bug in :class:`IntervalIndex` where set operations that returned an empty ``IntervalIndex`` had the wrong dtype (:issue:`19101`)
- Bug in :meth:`DataFrame.drop_duplicates` where no ``KeyError`` is raised when passing in columns that don't exist on the ``DataFrame`` (issue:`19726`)
- Bug in :func:`Index.union` and :func:`Index.intersection` where name of the ``Index`` of the result was not computed correctly for certain cases (:issue:`9943`, :issue:`9862`)


MultiIndex
Expand Down
67 changes: 40 additions & 27 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
import pandas.core.algorithms as algos
import pandas.core.sorting as sorting
from pandas.io.formats.printing import pprint_thing
from pandas.core.ops import make_invalid_op
from pandas.core.ops import make_invalid_op, get_op_result_name
from pandas.core.config import get_option
from pandas.core.strings import StringMethods

Expand Down Expand Up @@ -1191,7 +1191,7 @@ def _convert_can_do_setop(self, other):
other = Index(other, name=self.name)
result_name = self.name
else:
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)
return other, result_name

def _convert_for_op(self, value):
Expand Down Expand Up @@ -1240,9 +1240,9 @@ def set_names(self, names, level=None, inplace=False):
Examples
--------
>>> Index([1, 2, 3, 4]).set_names('foo')
Int64Index([1, 2, 3, 4], dtype='int64')
Int64Index([1, 2, 3, 4], dtype='int64', name='foo')
>>> Index([1, 2, 3, 4]).set_names(['foo'])
Int64Index([1, 2, 3, 4], dtype='int64')
Int64Index([1, 2, 3, 4], dtype='int64', name='foo')
>>> idx = MultiIndex.from_tuples([(1, u'one'), (1, u'two'),
(2, u'one'), (2, u'two')],
names=['foo', 'bar'])
Expand Down Expand Up @@ -2263,21 +2263,37 @@ def __or__(self, other):
def __xor__(self, other):
return self.symmetric_difference(other)

def _get_consensus_name(self, other):
def _get_reconciled_name_object(self, other):
"""
Given 2 indexes, give a consensus name meaning
we take the not None one, or None if the names differ.
Return a new object if we are resetting the name
If the result of a set operation will be self,
return self, unless the name changes, in which
case make a shallow copy of self.
"""
if self.name != other.name:
if self.name is None or other.name is None:
name = self.name or other.name
else:
name = None
if self.name != name:
return self._shallow_copy(name=name)
name = get_op_result_name(self, other)
if self.name != name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think might be able to dispense with this entirely, and just always do

from pandas.core.ops import get_op_result_name

....

return self._shallow_copy(name=get_op_result_name(self, other))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in _wrap_setup_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I eliminate self.get_setop_name_object and use _wrap_setop_result, other tests fail, because those tests assume the corner cases return the same object.

return self._shallow_copy(name=name)
return self

def _union_corner_case(self, other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are creatijng this as a separate method? why not inline it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for inlining it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @TomAugspurger The subclasses for DateTimeIndex, RangeIndex and TimeDeltaIndex need to call this method from their union implementations. Should I inline it there too and have repeated code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this PR looks a lot better. I still don't like this hlper function. can we refactor to not use it

"""
If self or other have no length, or self and other
are the same, then return self, after reconciling names

Returns
-------
Tuple of (is_corner, result), where is_corner is True if
it is a corner case, and result is the reconciled result

"""
# GH 9943 9862
if len(other) == 0 or self.equals(other):
return (True, self._get_reconciled_name_object(other))

if len(self) == 0:
return (True, other._get_reconciled_name_object(self))

return (False, None)

def union(self, other):
"""
Form the union of two Index objects and sorts if possible.
Expand All @@ -2302,11 +2318,9 @@ def union(self, other):
self._assert_can_do_setop(other)
other = _ensure_index(other)

if len(other) == 0 or self.equals(other):
return self._get_consensus_name(other)

if len(self) == 0:
return other._get_consensus_name(self)
is_corner_case, corner_result = self._union_corner_case(other)
if is_corner_case:
return corner_result

# TODO: is_dtype_union_equal is a hack around
# 1. buggy set ops with duplicates (GH #13432)
Expand Down Expand Up @@ -2369,11 +2383,10 @@ def union(self, other):
stacklevel=3)

# for subclasses
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
return self.__class__(result, name=name)
def _wrap_setop_result(self, other, result):
return self.__class__(result, name=get_op_result_name(self, other))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self._shallow_copy this propagates meta-data, allows you to remove some sublcass implementations of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I've done some testing with self.shallow_copy() and it broke things with respect to CategoricalIndex, and then the discussion in #10186 becomes relevant, as we'd need to decide what to do when taking the union of two categorical indexes that have different categories. So I'm going to leave this implementation as it was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we consistently use self.__class__ or type(self) in this module? (I prefer the latter.

can you use ._shallow_copy() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no other usage of self.__class__ in this module. There is a _constructor() method that is the same as type(self), so I have switched to using that.

We had discussion 4 months ago about why _shallow_copy() doesn't work here.


def intersection(self, other):
"""
Expand Down Expand Up @@ -2403,7 +2416,7 @@ def intersection(self, other):
other = _ensure_index(other)

if self.equals(other):
return self._get_consensus_name(other)
return self._get_reconciled_name_object(other)

if not is_dtype_equal(self.dtype, other.dtype):
this = self.astype('O')
Expand All @@ -2423,7 +2436,7 @@ def intersection(self, other):
if self.is_monotonic and other.is_monotonic:
try:
result = self._inner_indexer(lvals, rvals)[0]
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)
except TypeError:
pass

Expand Down Expand Up @@ -3552,7 +3565,7 @@ def _join_monotonic(self, other, how='left', return_indexers=False):
return join_index

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Index(joined, name=name)

def _get_string_slice(self, key, use_lhs=True, use_rhs=True):
Expand Down
7 changes: 7 additions & 0 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import pandas.core.base as base
import pandas.core.missing as missing
import pandas.core.indexes.base as ibase
from pandas.core.ops import get_op_result_name

_index_doc_kwargs = dict(ibase._index_doc_kwargs)
_index_doc_kwargs.update(dict(target_klass='CategoricalIndex'))
Expand Down Expand Up @@ -300,6 +301,12 @@ def itemsize(self):
# Size of the items in categories, not codes.
return self.values.itemsize

def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _shallow_copy here, we never call _simple_new directly. (another reason why the super method should use _shallow_copy, but that's for new PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this method to Index.base? does it break anything?

return self._simple_new(result, name=name,
categories=self.categories,
ordered=self.ordered)

def get_values(self):
""" return the underlying data as an ndarray """
return self._data.get_values()
Expand Down
16 changes: 13 additions & 3 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@

from pandas.core.indexes.base import Index, _index_shared_docs
from pandas.core.indexes.numeric import Int64Index, Float64Index
from pandas.core.ops import get_op_result_name
import pandas.compat as compat
from pandas.tseries.frequencies import to_offset, get_period_alias, Resolution
from pandas.core.indexes.datetimelike import (
Expand Down Expand Up @@ -1136,6 +1137,11 @@ def union(self, other):
y : Index or DatetimeIndex
"""
self._assert_can_do_setop(other)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just call the super .union() here (after any conversion isneeded)? then you can inline the union_corner bizness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

is_corner_case, corner_result = self._union_corner_case(other)
if is_corner_case:
return corner_result

if not isinstance(other, DatetimeIndex):
try:
other = DatetimeIndex(other)
Expand Down Expand Up @@ -1237,7 +1243,7 @@ def _maybe_utc_convert(self, other):
return this, other

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
if (isinstance(other, DatetimeIndex) and
self.offset == other.offset and
self._can_fast_union(other)):
Expand Down Expand Up @@ -1333,8 +1339,8 @@ def __iter__(self):
box="timestamp")
return iter(converted)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
if not timezones.tz_compare(self.tz, other.tz):
raise ValueError('Passed item and index have different timezone')
return self._simple_new(result, name=name, freq=None, tz=self.tz)
Expand All @@ -1353,6 +1359,10 @@ def intersection(self, other):
y : Index or DatetimeIndex
"""
self._assert_can_do_setop(other)

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, DatetimeIndex):
try:
other = DatetimeIndex(other)
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from pandas.core.indexes.base import (
Index, _ensure_index,
default_pprint, _index_shared_docs)
from pandas.core.ops import get_op_result_name

from pandas._libs import Timestamp, Timedelta
from pandas._libs.interval import (
Expand Down Expand Up @@ -1351,7 +1352,7 @@ def func(self, other):
raise TypeError(msg.format(op=op_name))

result = getattr(self._multiindex, op_name)(other._multiindex)
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)

# GH 19101: ensure empty results have correct dtype
if result.empty:
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/indexes/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from pandas.util._decorators import Appender, cache_readonly
import pandas.core.dtypes.concat as _concat
import pandas.core.indexes.base as ibase

from pandas.core.ops import get_op_result_name

_num_index_shared_docs = dict()

Expand Down Expand Up @@ -187,7 +187,7 @@ def _convert_scalar_indexer(self, key, kind=None):
._convert_scalar_indexer(key, kind=kind))

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Int64Index(joined, name=name)

@classmethod
Expand Down Expand Up @@ -264,7 +264,7 @@ def _convert_index_indexer(self, keyarr):
return keyarr

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return UInt64Index(joined, name=name)

@classmethod
Expand Down
5 changes: 3 additions & 2 deletions pandas/core/indexes/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
_ensure_object)
from pandas.core.dtypes.dtypes import PeriodDtype
from pandas.core.dtypes.generic import ABCSeries
from pandas.core.ops import get_op_result_name

import pandas.tseries.frequencies as frequencies
from pandas.tseries.frequencies import get_freq_code as _gfc
Expand Down Expand Up @@ -996,8 +997,8 @@ def _assert_can_do_setop(self, other):
msg = _DIFFERENT_FREQ_INDEX.format(self.freqstr, other.freqstr)
raise IncompatibleFrequency(msg)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think could just remove this entirely here if _wrap_setop_result is defined in Index.base AND you move _apply_meta functionaily to inside of _shallow_copy for PI. (could also do this as a followup), but want to try to avoid duplicating _wrap_setup_result everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we can't move it to base from category is because in the case of category, we need to preserve the categories, and in the case of period, we need to preserve frequency.

So we could investigate in a separate PR about moving the _apply_meta() functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fine

result = self._apply_meta(result)
result.name = name
return result
Expand Down
15 changes: 10 additions & 5 deletions pandas/core/indexes/range.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,8 @@ def tolist(self):
@Appender(_index_shared_docs['_shallow_copy'])
def _shallow_copy(self, values=None, **kwargs):
if values is None:
return RangeIndex(name=self.name, fastpath=True,
name = kwargs.get("name", self.name)
return RangeIndex(name=name, fastpath=True,
**dict(self._get_data_as_items()))
else:
kwargs.setdefault('name', self.name)
Expand Down Expand Up @@ -337,6 +338,10 @@ def intersection(self, other):
-------
intersection : Index
"""

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, RangeIndex):
return super(RangeIndex, self).intersection(other)

Expand Down Expand Up @@ -417,10 +422,10 @@ def union(self, other):
union : Index
"""
self._assert_can_do_setop(other)
if len(other) == 0 or self.equals(other):
return self
if len(self) == 0:
return other
is_corner_case, corner_result = self._union_corner_case(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

if is_corner_case:
return corner_result

if isinstance(other, RangeIndex):
start_s, step_s = self._start, self._step
end_s = self._start + self._step * (len(self) - 1)
Expand Down
16 changes: 11 additions & 5 deletions pandas/core/indexes/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from pandas.core.base import _shared_docs
from pandas.core.indexes.base import _index_shared_docs
import pandas.core.common as com
from pandas.core.ops import get_op_result_name
import pandas.core.dtypes.concat as _concat
from pandas.util._decorators import Appender, Substitution, deprecate_kwarg
from pandas.core.indexes.datetimelike import TimelikeOps, DatetimeIndexOpsMixin
Expand Down Expand Up @@ -545,6 +546,11 @@ def union(self, other):
y : Index or TimedeltaIndex
"""
self._assert_can_do_setop(other)

is_corner_case, corner_result = self._union_corner_case(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

if is_corner_case:
return corner_result

if not isinstance(other, TimedeltaIndex):
try:
other = TimedeltaIndex(other)
Expand Down Expand Up @@ -577,7 +583,7 @@ def join(self, other, how='left', level=None, return_indexers=False,
sort=sort)

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
if (isinstance(other, TimedeltaIndex) and self.freq == other.freq and
self._can_fast_union(other)):
joined = self._shallow_copy(joined, name=name)
Expand Down Expand Up @@ -637,10 +643,6 @@ def _fast_union(self, other):
else:
return left

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
return self._simple_new(result, name=name, freq=None)

def intersection(self, other):
"""
Specialized intersection for TimedeltaIndex objects. May be much faster
Expand All @@ -655,6 +657,10 @@ def intersection(self, other):
y : Index or TimedeltaIndex
"""
self._assert_can_do_setop(other)

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, TimedeltaIndex):
try:
other = TimedeltaIndex(other)
Expand Down
Loading