Skip to content

BUG: names on union and intersection for Index were inconsistent (GH9943 GH9862) #19849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 6, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -837,6 +837,7 @@ Indexing
- Bug in :func:`IntervalIndex.symmetric_difference` where the symmetric difference with a non-``IntervalIndex`` did not raise (:issue:`18475`)
- Bug in :class:`IntervalIndex` where set operations that returned an empty ``IntervalIndex`` had the wrong dtype (:issue:`19101`)
- Bug in :meth:`DataFrame.drop_duplicates` where no ``KeyError`` is raised when passing in columns that don't exist on the ``DataFrame`` (issue:`19726`)
- Bug in :func:`Index.union` where resulting names were not computed correctly for certain cases (:issue:`9943`, :issue:`9862`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the only user facing change? e.g. intersection as well? 'resulting names' is not vey clear, maybe 'the name of the Index of the union'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now changed as you suggested. intersection was computing names correctly despite the original example reported in #9943.



MultiIndex
Expand Down
39 changes: 17 additions & 22 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
import pandas.core.algorithms as algos
import pandas.core.sorting as sorting
from pandas.io.formats.printing import pprint_thing
from pandas.core.ops import make_invalid_op
from pandas.core.ops import make_invalid_op, get_op_result_name
from pandas.core.config import get_option
from pandas.core.strings import StringMethods

Expand Down Expand Up @@ -1191,7 +1191,7 @@ def _convert_can_do_setop(self, other):
other = Index(other, name=self.name)
result_name = self.name
else:
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)
return other, result_name

def _convert_for_op(self, value):
Expand Down Expand Up @@ -2263,19 +2263,15 @@ def __or__(self, other):
def __xor__(self, other):
return self.symmetric_difference(other)

def _get_consensus_name(self, other):
def _get_setop_name_object(self, other):
"""
Given 2 indexes, give a consensus name meaning
we take the not None one, or None if the names differ.
Return a new object if we are resetting the name
Given 2 indexes, give a setop name and object, meaning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead why can't we just use _wrap_setop_result? instead of having 2 nearly identical funtions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think this is needed (and I'll rename it and document it accordingly) in the case you are computing i1.union(i2), where i1.equals(i2)==True. In this case, you don't want to use _shallow_copy() because you want to return i1, unless the names of i1 and i2 are different. There are some tests for joins where if you are doing a join and the index is the same, then you can just return the index, rather than returning a brand new index, which is what _shallow_copy() would do.

I'll also check the other Index subclasses to see if they are doing this right. For example, for RangeIndex, it is not handling the names correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we do this elsewhere as well (e.g. when you reindex to yourself). can you add docmentation about this to the setops themselves. @TomAugspurger @jorisvandenbossche what do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, but it'd be good to settle on consistent behavior between union and intersection (I think). What's the proposal here?

  1. named & same name -> name
  2. named & other name -> None?
  3. named & unnamed -> name?
  4. unnamed & named -> name?
  5. unamed & unnamed -> None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger not what I was talking about, the current logic is correct (and shouldn't change), what you have is right. This was more about whether if we have identical objects we should shallow copy or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad.

I don't have a strong preference, but I would tend towards always shallow copying? I hope people aren't relying on (idx | idx) is idx being true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger With respect to your naming, that is not what will happen. In the cases of named and unnamed, the result will be None. That's because if you do

j1 = pd.Index([1,2], name='j1')
j2 = pd.Index([], name='j2')
j3 = pd.Index([], name='j3')

then j1.union(j2).union(j3) returns Int64Index([1, 2], dtype='int64', name='j3'), while just changing the order to j3.union(j1).union(j2) returns Int64Index([1, 2], dtype='int64', name='j2'). So the behavior will be to only preserve the names if they are equal, and the result of these two union examples will have None as the name.

@jreback I can change to shallow copying, but I'll have to modify some other tests that are depending on the result not being a shallow copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think its better to _shallow_copy these, yes I understand some tests may change because of this (IOW they will no longer be is, just .equals)

we use get_op_result_name to return the name, and then
return a new object if we are resetting the name
"""
if self.name != other.name:
if self.name is None or other.name is None:
name = self.name or other.name
else:
name = None
if self.name != name:
return self._shallow_copy(name=name)
name = get_op_result_name(self, other)
if self.name != name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think might be able to dispense with this entirely, and just always do

from pandas.core.ops import get_op_result_name

....

return self._shallow_copy(name=get_op_result_name(self, other))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in _wrap_setup_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I eliminate self.get_setop_name_object and use _wrap_setop_result, other tests fail, because those tests assume the corner cases return the same object.

return self._shallow_copy(name=name)
return self

def union(self, other):
Expand Down Expand Up @@ -2303,10 +2299,10 @@ def union(self, other):
other = _ensure_index(other)

if len(other) == 0 or self.equals(other):
return self._get_consensus_name(other)
return self._get_setop_name_object(other)

if len(self) == 0:
return other._get_consensus_name(self)
return other._get_setop_name_object(self)

# TODO: is_dtype_union_equal is a hack around
# 1. buggy set ops with duplicates (GH #13432)
Expand Down Expand Up @@ -2369,11 +2365,10 @@ def union(self, other):
stacklevel=3)

# for subclasses
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
return self.__class__(result, name=name)
def _wrap_setop_result(self, other, result):
return self.__class__(result, name=get_op_result_name(self, other))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self._shallow_copy this propagates meta-data, allows you to remove some sublcass implementations of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I've done some testing with self.shallow_copy() and it broke things with respect to CategoricalIndex, and then the discussion in #10186 becomes relevant, as we'd need to decide what to do when taking the union of two categorical indexes that have different categories. So I'm going to leave this implementation as it was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we consistently use self.__class__ or type(self) in this module? (I prefer the latter.

can you use ._shallow_copy() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no other usage of self.__class__ in this module. There is a _constructor() method that is the same as type(self), so I have switched to using that.

We had discussion 4 months ago about why _shallow_copy() doesn't work here.


def intersection(self, other):
"""
Expand Down Expand Up @@ -2403,7 +2398,7 @@ def intersection(self, other):
other = _ensure_index(other)

if self.equals(other):
return self._get_consensus_name(other)
return self._get_setop_name_object(other)

if not is_dtype_equal(self.dtype, other.dtype):
this = self.astype('O')
Expand All @@ -2423,7 +2418,7 @@ def intersection(self, other):
if self.is_monotonic and other.is_monotonic:
try:
result = self._inner_indexer(lvals, rvals)[0]
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)
except TypeError:
pass

Expand Down Expand Up @@ -3552,7 +3547,7 @@ def _join_monotonic(self, other, how='left', return_indexers=False):
return join_index

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Index(joined, name=name)

def _get_string_slice(self, key, use_lhs=True, use_rhs=True):
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@

from pandas.core.indexes.base import Index, _index_shared_docs
from pandas.core.indexes.numeric import Int64Index, Float64Index
from pandas.core.ops import get_op_result_name
import pandas.compat as compat
from pandas.tseries.frequencies import to_offset, get_period_alias, Resolution
from pandas.core.indexes.datetimelike import (
Expand Down Expand Up @@ -1237,7 +1238,7 @@ def _maybe_utc_convert(self, other):
return this, other

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
if (isinstance(other, DatetimeIndex) and
self.offset == other.offset and
self._can_fast_union(other)):
Expand Down Expand Up @@ -1333,8 +1334,8 @@ def __iter__(self):
box="timestamp")
return iter(converted)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
if not timezones.tz_compare(self.tz, other.tz):
raise ValueError('Passed item and index have different timezone')
return self._simple_new(result, name=name, freq=None, tz=self.tz)
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from pandas.core.indexes.base import (
Index, _ensure_index,
default_pprint, _index_shared_docs)
from pandas.core.ops import get_op_result_name

from pandas._libs import Timestamp, Timedelta
from pandas._libs.interval import (
Expand Down Expand Up @@ -1351,7 +1352,7 @@ def func(self, other):
raise TypeError(msg.format(op=op_name))

result = getattr(self._multiindex, op_name)(other._multiindex)
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)

# GH 19101: ensure empty results have correct dtype
if result.empty:
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/indexes/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from pandas.util._decorators import Appender, cache_readonly
import pandas.core.dtypes.concat as _concat
import pandas.core.indexes.base as ibase

from pandas.core.ops import get_op_result_name

_num_index_shared_docs = dict()

Expand Down Expand Up @@ -187,7 +187,7 @@ def _convert_scalar_indexer(self, key, kind=None):
._convert_scalar_indexer(key, kind=kind))

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Int64Index(joined, name=name)

@classmethod
Expand Down Expand Up @@ -264,7 +264,7 @@ def _convert_index_indexer(self, keyarr):
return keyarr

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return UInt64Index(joined, name=name)

@classmethod
Expand Down
5 changes: 3 additions & 2 deletions pandas/core/indexes/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
_ensure_object)
from pandas.core.dtypes.dtypes import PeriodDtype
from pandas.core.dtypes.generic import ABCSeries
from pandas.core.ops import get_op_result_name

import pandas.tseries.frequencies as frequencies
from pandas.tseries.frequencies import get_freq_code as _gfc
Expand Down Expand Up @@ -996,8 +997,8 @@ def _assert_can_do_setop(self, other):
msg = _DIFFERENT_FREQ_INDEX.format(self.freqstr, other.freqstr)
raise IncompatibleFrequency(msg)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think could just remove this entirely here if _wrap_setop_result is defined in Index.base AND you move _apply_meta functionaily to inside of _shallow_copy for PI. (could also do this as a followup), but want to try to avoid duplicating _wrap_setup_result everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we can't move it to base from category is because in the case of category, we need to preserve the categories, and in the case of period, we need to preserve frequency.

So we could investigate in a separate PR about moving the _apply_meta() functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fine

result = self._apply_meta(result)
result.name = name
return result
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/indexes/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from pandas.core.base import _shared_docs
from pandas.core.indexes.base import _index_shared_docs
import pandas.core.common as com
from pandas.core.ops import get_op_result_name
import pandas.core.dtypes.concat as _concat
from pandas.util._decorators import Appender, Substitution, deprecate_kwarg
from pandas.core.indexes.datetimelike import TimelikeOps, DatetimeIndexOpsMixin
Expand Down Expand Up @@ -577,7 +578,7 @@ def join(self, other, how='left', level=None, return_indexers=False,
sort=sort)

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
if (isinstance(other, TimedeltaIndex) and self.freq == other.freq and
self._can_fast_union(other)):
joined = self._shallow_copy(joined, name=name)
Expand Down Expand Up @@ -637,8 +638,8 @@ def _fast_union(self, other):
else:
return left

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this one

return self._simple_new(result, name=name, freq=None)

def intersection(self, other):
Expand Down
51 changes: 44 additions & 7 deletions pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -730,7 +730,15 @@ def test_union(self):
union = Index([]).union(first)
assert union is first

# preserve names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make new tests

# preserve names only when they are the same
# GH 9943 9862

first = Index(list('ab'), name='A')
second = Index(list('abc'), name='A')
union = first.union(second)
expected = Index(list('abc'), name='A')
tm.assert_index_equal(union, expected)

first = Index(list('ab'), name='A')
second = Index(list('ab'), name='B')
union = first.union(second)
Expand All @@ -752,37 +760,66 @@ def test_union(self):
first = Index(list('ab'))
second = Index(list('ab'), name='B')
union = first.union(second)
expected = Index(list('ab'), name='B')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make new tests

expected = Index(list('ab'), name=None)
tm.assert_index_equal(union, expected)

# GH 9943 9862
first = Index(list('abc'))
second = Index(list('ab'), name='B')
union = first.union(second)
expected = Index(list('abc'), name=None)
tm.assert_index_equal(union, expected)

first = Index([])
second = Index(list('ab'), name='B')
union = first.union(second)
expected = Index(list('ab'), name='B')
expected = Index(list('ab'), name=None)
tm.assert_index_equal(union, expected)

first = Index(list('ab'))
second = Index([], name='B')
union = first.union(second)
expected = Index(list('ab'), name='B')
expected = Index(list('ab'), name=None)
tm.assert_index_equal(union, expected)

first = Index(list('ab'), name='A')
second = Index(list('ab'))
union = first.union(second)
expected = Index(list('ab'), name='A')
expected = Index(list('ab'), name=None)
tm.assert_index_equal(union, expected)

# GH 9943 9862
first = Index(list('ab'), name='A')
second = Index(list('abc'))
union = first.union(second)
expected = Index(list('abc'), name=None)
tm.assert_index_equal(union, expected)

first = Index(list('ab'), name='A')
second = Index([])
union = first.union(second)
expected = Index(list('ab'), name='A')
expected = Index(list('ab'), name=None)
tm.assert_index_equal(union, expected)

first = Index([], name='A')
second = Index(list('ab'))
union = first.union(second)
expected = Index(list('ab'), name='A')
expected = Index(list('ab'), name=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Not sure what you mean by "make new tests". The issue with the existing tests is that they were incorrect. So I had to fix them. For example,

first = Index([], name='A')
second = Index(list('ab'))
union = first.union(second)
expected = Index(list('ab'), name='A')

is incorrect, as if one index has a name, and the other has no name, the result should be None, based on your comment #9943 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that these tests are getting very long, so split up a bit.

We also need more general testing, all of these tests should be using a generically created index (all of this machinery already exists ia .create_index())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split the test_union test, and now have the corner cases tested for all indexes

tm.assert_index_equal(union, expected)

# Chained unions handles names correctly
i1 = Index([1, 2], name='i1')
i2 = Index([3, 4], name='i2')
i3 = Index([5, 6], name='i3')
union = i1.union(i2.union(i3))
expected = i1.union(i2).union(i3)
tm.assert_index_equal(union, expected)

j1 = Index([1, 2], name='j1')
j2 = Index([], name='j2')
j3 = Index([], name='j3')
union = j1.union(j2.union(j3))
expected = j1.union(j2).union(j3)
tm.assert_index_equal(union, expected)

with tm.assert_produces_warning(RuntimeWarning):
Expand Down