-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
fix inconsistent index naming with union/intersect #35847 #36413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/indexes/base.py
Outdated
|
||
def _wrap_setop_result(self, other, result): | ||
name = get_op_result_name(self, other) | ||
return self._shallow_copy(result, name=name) | ||
if isinstance(result, ABCIndexClass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check will include Index objects but exclude subclasses. is that intentional? if so, can you add a comment about why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit I don't actually understand this whole ABC business 100%, but I'm getting:
isinstance(pd.MultiIndex, pd.Index) -> False
isinstance(pd.MultiIndex, ABCIndexClass) -> True
issubclass(pd.MultiIndex, pd.Index) -> True
Should I be using issubclass
instead? Wasn't trying to do anything too fancy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah its not intuitive. basically isinstance(obj, ABCIndexClass)
is equivalent to type(obj) is Index
, so what you probably want is isinstance(obj, Index)
, which behaves like normal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but wouldn't we get isinstance(multiindex, ABCIndexClass) == True
and type(multiindex) != pd.Index
here? Either way, agreed that isinstance
will get us what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, i confused ABCIndexClass with ABCIndex, never mind.
pandas/core/indexes/base.py
Outdated
@@ -5926,3 +5931,22 @@ def _maybe_asobject(dtype, klass, data, copy: bool, name: Label, **kwargs): | |||
return index.astype(object) | |||
|
|||
return klass(data, dtype=dtype, copy=copy, name=name, **kwargs) | |||
|
|||
|
|||
def get_unanimous_names(*indexes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we ever get more than two here?
Would indexes
instead of *indexes
work>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we use this in union_indexes
which operates on a list of indices. but most uses are explicitly two, so I figured this would make the typical syntax easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type the input and output
pandas/core/indexes/multi.py
Outdated
@@ -3415,7 +3420,7 @@ def union(self, other, sort=None): | |||
other, result_names = self._convert_can_do_setop(other) | |||
|
|||
if len(other) == 0 or self.equals(other): | |||
return self | |||
return self._shallow_copy(names=result_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, makes sense. Certainly more legible, and seems like equivalent.
pandas/core/indexes/numeric.py
Outdated
@@ -180,7 +180,8 @@ def _union(self, other, sort): | |||
if needs_cast: | |||
first = self.astype("float") | |||
second = other.astype("float") | |||
return first._union(second, sort) | |||
result = first._union(second, sort) | |||
return Float64Index(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should already be a Float64Index. are there cases where it isnt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. At some point I was allowing _union
to return just an array, so I ended up needing that. But now there's a _shallow_copy
at return, so it's redundant. Will revert.
|
||
the_union = idx.union(idx[:0], sort=sort) | ||
assert the_union is idx | ||
assert tm.equalContents(the_union, idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use tm.assert_index_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/indexes/base.py
Outdated
@@ -5926,3 +5931,22 @@ def _maybe_asobject(dtype, klass, data, copy: bool, name: Label, **kwargs): | |||
return index.astype(object) | |||
|
|||
return klass(data, dtype=dtype, copy=copy, name=name, **kwargs) | |||
|
|||
|
|||
def get_unanimous_names(*indexes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type the input and output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good @iamlemec
only small concern is that we are likely not fully validating whether we are returning a copy for the 0 len cases or where indexes are equal.
but certainly can do as a followup.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -293,6 +293,7 @@ Indexing | |||
|
|||
- Bug in :meth:`PeriodIndex.get_loc` incorrectly raising ``ValueError`` on non-datelike strings instead of ``KeyError``, causing similar errors in :meth:`Series.__geitem__`, :meth:`Series.__contains__`, and :meth:`Series.loc.__getitem__` (:issue:`34240`) | |||
- Bug in :meth:`Index.sort_values` where, when empty values were passed, the method would break by trying to compare missing values instead of pushing them to the end of the sort order. (:issue:`35584`) | |||
- Harmonize resulting index names from :meth:`Index.union` and :meth:`Index.intersection` across various index types (:issue:`35847`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are significant changes that a user would want to know can you add that detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also migth warrant a sub-section if this is something easier to see via code-examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, added in a bit about name preservation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right add the issue number above and remove this note
@@ -124,6 +124,65 @@ def test_corner_union(self, index, fname, sname, expected_name): | |||
expected = index.drop(index).set_names(expected_name) | |||
tm.assert_index_equal(union, expected) | |||
|
|||
# test copy.union(subset) - need sort for unicode and string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this another test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Ok, added in a code example to whatsnew showing the new/consistent naming behavior when aggregating. Also split off those new tests. Regarding copying in the various empty/equality cases, copy here means a deep copy including values, right? Seems like some of the tests are using "is" which breaks when you add in copying. I can look into that and the nan issues in a subsequent PR if that works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc comment, ping on green.
Index/column name preservation when aggregating | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
When aggregating using :meth:`concat` or the :class:`DataFrame` constructor, Pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue reference here.
Also I would add the same text here in the docs, maybe in the reshaping section, in a note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, i ended up adding it as a note in the "merge, join, concatenate, and compare" section.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -293,6 +293,7 @@ Indexing | |||
|
|||
- Bug in :meth:`PeriodIndex.get_loc` incorrectly raising ``ValueError`` on non-datelike strings instead of ``KeyError``, causing similar errors in :meth:`Series.__geitem__`, :meth:`Series.__contains__`, and :meth:`Series.loc.__getitem__` (:issue:`34240`) | |||
- Bug in :meth:`Index.sort_values` where, when empty values were passed, the method would break by trying to compare missing values instead of pushing them to the end of the sort order. (:issue:`35584`) | |||
- Harmonize resulting index names from :meth:`Index.union` and :meth:`Index.intersection` across various index types (:issue:`35847`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right add the issue number above and remove this note
pandas/core/indexes/datetimes.py
Outdated
|
||
res_name = get_unanimous_names(self, *others)[0] | ||
if this.name != res_name: | ||
return this._shallow_copy(name=res_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename instead of shallow_copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, indeed. i actually went through and made the same change in base.py
and api.py
. seems like it works fine all around.
4ebc5a6
to
47d3ba1
Compare
@jreback fixed the docs indentation issue, that's green now. i'm having trouble figuring out went wrong with some of the other tests. is it some kind of internal CI error? |
CI failures were unrelated, should be fixed if you merge master |
0ddb3f2
to
4206057
Compare
pandas/core/indexes/base.py
Outdated
@@ -5935,3 +5941,22 @@ def _maybe_asobject(dtype, klass, data, copy: bool, name: Label, **kwargs): | |||
return index.astype(object) | |||
|
|||
return klass(data, dtype=dtype, copy=copy, name=name, **kwargs) | |||
|
|||
|
|||
def get_unanimous_names(*indexes: Type[Index]) -> Tuple[Any, ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type[Index]
seems really weird here. Shouldnt it be Tuple[Index]
? And i think the return type should use Label instead of Any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was wondering if there was a Label
notion! And now I see there's a _typing.py
, will keep that handy in the future.
Yeah, I don't know why that extraneous ended up Type
there. But actually with an *args situation, you just give the type of the individual elements, so I think it should just be Index
.
|
||
if self.equals(other): | ||
return self._get_reconciled_name_object(other) | ||
|
||
if len(self) == 0: | ||
return self.copy() | ||
return self.copy()._get_reconciled_name_object(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the base class you got rid of the _get_reconciled_name_object
usage. why going the other direction here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still keeping this around for the rare case where you want to return the same values but with a possible name change. In base.py it only shows up in the one intersection
return clause (and not in union
because of the union
/_union
split). It is interesting though that the version of intersection
in datetimelike.py has two more quick return clauses than the base class.
I think we're mostly good here, but the travis-ci build for ARM just kinda bailed without comment. |
pandas/core/reshape/concat.py
Outdated
get_objs_combined_axis, | ||
) | ||
import pandas.core.indexes.base as ibase | ||
from pandas.core.indexes.base import get_unanimous_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_unanimous_names is in indexes.api, should get it from there
@@ -471,7 +471,7 @@ def test_intersection_bug(self): | |||
def test_intersection_list(self): | |||
# GH#35876 | |||
values = [pd.Timestamp("2020-01-01"), pd.Timestamp("2020-02-01")] | |||
idx = pd.DatetimeIndex(values, name="a") | |||
idx = pd.DatetimeIndex(values) | |||
res = idx.intersection(values) | |||
tm.assert_index_equal(res, idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could keep the name in idx
and test for tm.assert_index_equal(res, idx.rename(None))
?
@@ -46,7 +46,7 @@ def test_join_level_corner_case(idx): | |||
|
|||
def test_join_self(idx, join_type): | |||
joined = idx.join(idx, how=join_type) | |||
assert idx is joined | |||
assert tm.equalContents(joined, idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does assert_index_equal not work here?
@@ -278,7 +278,7 @@ def test_intersection(idx, sort): | |||
|
|||
# corner case, pass self | |||
the_int = idx.intersection(idx, sort=sort) | |||
assert the_int is idx | |||
assert tm.equalContents(the_int, idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_index_equal?
Thanks @jbrockmendel! Pushed all your suggested changes. |
@iamlemec can you merge master; IIRC this was pretty close to ready |
cedd79c
to
17254c4
Compare
@jbrockmendel sure thing, just pushed a rebase onto master |
thanks @iamlemec very nice! |
thanks for all the help @jreback and @jbrockmendel! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This takes care of some inconsistency in how names are handled by
Index
functionsunion
andintersection
, as discussed in #35847. I believe this covers all index types, either through the base class or in the subclass when necessary.What I've implemented here actually uses the unanimous convention, wherein all input names must match to get assigned to the output. Originally, I was thinking consensus would be better (assign if there is only one non-None input name), but looking through the existing tests, it seems that unanimous was usually expected. I also had some worries about whether index names would become too "contagious" with consensus. Anyway, it's easy to change between the two if people have strong opinions on this.