-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST: Add test for union with duplicates #40967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
phofl
commented
Apr 15, 2021
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
pandas/core/algorithms.py
Outdated
left values which is ordered in front. | ||
rvals: np.ndarray | ||
rvals: ArrayLike |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're getting EA here, then the np.append call below is likely going to do something unwanted like cast to object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd_array casts back in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, but object casting is costly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got you now.
pandas/core/indexes/base.py
Outdated
# "Union[ExtensionArray, ndarray]"; expected "ndarray" | ||
# error: Argument 2 to "union_with_duplicates" has incompatible type | ||
# "Union[ExtensionArray, ndarray]"; expected "ndarray" | ||
result = algos.union_with_duplicates(lvals, rvals) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should only get here with ndarray[object] for both. if that can be confirmed, then can just do a cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately with categorical we get there with an EA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i think the right way to handle this will be to override ExtensionIndex._union (or maybe just CI._union)
|
pandas/core/algorithms.py
Outdated
right values ordered after lvals. | ||
|
||
Returns | ||
------- | ||
np.ndarray containing the unsorted union of both arrays | ||
ArrayLike containing the unsorted union of both arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im finding myself getting here on another branch. i think the pd_array call is causing problems. In the case I'm looking at, lvals and rvals are both all-string Categoricals, so the pd_array call gives a StringArray, which gums up the works later in Index.union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we say that we should not get here with an extension dtype except categorical, we could remove the pd.array call?
Alternatively if we overwrite the categorical _union, this would be solved too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively if we overwrite the categorical _union, this would be solved too?
I think we do this on ExtensionIndex, that way we can do cast(np.ndarray, lvals)
in Index._union
. All ExtensionIndex subclasses other than CategoricalIndex already override _union, so it should be equivalent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i think the following works for CategoricalIndex._union
def _union(self, other: CategoricalIndex, sort) -> CategoricalIndex:
# we only get here with matching dtypes
lidx = self.astype(self.categories.dtype)
ridx = other.astype(other.categories.dtype)
result = lidx._union(ridx, sort=sort)
cat = Categorical(result, dtype=self.dtype)
return type(self)._simple_new(cat, name=self.name)
(this might also make it possible to remove a Categorical kludge in Index._wrap_setop_result
)
The downside here is that the astype both makes a copy and loses potentially-cached is_monotonic/has_duplicates/is_unique information. It may be better to make a method just for _union_non_unique and only override that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i take it back, that _union messes up when self.dtype.order is non-standard
…p_union � Conflicts: � pandas/core/algorithms.py � pandas/core/indexes/base.py
Thx @jbrockmendel Could add a test additionally but no typing anymore |
thanks @phofl |