Skip to content

TST: Add test for union with duplicates #40967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 20, 2021
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Apr 15, 2021

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them

left values which is ordered in front.
rvals: np.ndarray
rvals: ArrayLike
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're getting EA here, then the np.append call below is likely going to do something unwanted like cast to object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd_array casts back in this case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but object casting is costly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got you now.

# "Union[ExtensionArray, ndarray]"; expected "ndarray"
# error: Argument 2 to "union_with_duplicates" has incompatible type
# "Union[ExtensionArray, ndarray]"; expected "ndarray"
result = algos.union_with_duplicates(lvals, rvals) # type: ignore[arg-type]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should only get here with ndarray[object] for both. if that can be confirmed, then can just do a cast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately with categorical we get there with an EA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i think the right way to handle this will be to override ExtensionIndex._union (or maybe just CI._union)

@jreback jreback added the Typing type annotations, mypy/pyright type checking label Apr 16, 2021
@jreback jreback added this to the 1.3 milestone Apr 16, 2021
@jreback
Copy link
Contributor

jreback commented Apr 16, 2021

pandas/core/indexes/base.py:3003: error: unused 'type: ignore' comment

right values ordered after lvals.

Returns
-------
np.ndarray containing the unsorted union of both arrays
ArrayLike containing the unsorted union of both arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im finding myself getting here on another branch. i think the pd_array call is causing problems. In the case I'm looking at, lvals and rvals are both all-string Categoricals, so the pd_array call gives a StringArray, which gums up the works later in Index.union

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we say that we should not get here with an extension dtype except categorical, we could remove the pd.array call?

Alternatively if we overwrite the categorical _union, this would be solved too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively if we overwrite the categorical _union, this would be solved too?

I think we do this on ExtensionIndex, that way we can do cast(np.ndarray, lvals) in Index._union. All ExtensionIndex subclasses other than CategoricalIndex already override _union, so it should be equivalent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i think the following works for CategoricalIndex._union

    def _union(self, other: CategoricalIndex, sort) -> CategoricalIndex:
        # we only get here with matching dtypes
        lidx = self.astype(self.categories.dtype)
        ridx = other.astype(other.categories.dtype)
        result = lidx._union(ridx, sort=sort)
        cat = Categorical(result, dtype=self.dtype)
        return type(self)._simple_new(cat, name=self.name)

(this might also make it possible to remove a Categorical kludge in Index._wrap_setop_result)

The downside here is that the astype both makes a copy and loses potentially-cached is_monotonic/has_duplicates/is_unique information. It may be better to make a method just for _union_non_unique and only override that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i take it back, that _union messes up when self.dtype.order is non-standard

phofl added 2 commits April 20, 2021 22:18
…p_union

� Conflicts:
�	pandas/core/algorithms.py
�	pandas/core/indexes/base.py
@phofl
Copy link
Member Author

phofl commented Apr 20, 2021

Thx @jbrockmendel

Could add a test additionally but no typing anymore

@phofl phofl changed the title Fix typing for union_with_duplicates TST: Add test for union with duplicates Apr 20, 2021
@phofl phofl added Testing pandas testing functions or related to the test suite and removed Typing type annotations, mypy/pyright type checking labels Apr 20, 2021
@jreback jreback merged commit 8b1430d into pandas-dev:master Apr 20, 2021
@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

thanks @phofl

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021
@phofl phofl deleted the typ_union branch April 21, 2021 20:40
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants