Skip to content

BUG: Index.union() drops Index component for duplicate Index elements not sorted #36289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
phofl opened this issue Sep 11, 2020 · 2 comments · Fixed by #36299
Closed
3 tasks done

BUG: Index.union() drops Index component for duplicate Index elements not sorted #36289

phofl opened this issue Sep 11, 2020 · 2 comments · Fixed by #36299
Labels
Bug Index Related to the Index class or subclasses
Milestone

Comments

@phofl
Copy link
Member

phofl commented Sep 11, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

idx1 = pd.Index([0, 0, 1])
idx2 = pd.Index([0, 1])

print(idx1.union(idx2))
# Int64Index([0, 0, 1], dtype='int64')
print(idx2.union(idx1))
# Int64Index([0, 0, 1], dtype='int64')

idx1 = pd.Index([1, 0, 0])

print(idx1.union(idx2))
# Int64Index([0, 0, 1], dtype='int64')
print(idx2.union(idx1))
# Int64Index([0, 1], dtype='int64')

Problem description

In the fourth call idx2.union(idx1) one of the 0 is dropped. Meaning union() depends on the call order and on the index order, which seems buggy.

Expected Output

Int64Index([0, 0, 1], dtype='int64') for every print Statement

Output of pd.show_versions()

master

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 11, 2020
@dsaxton dsaxton added Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 11, 2020
@dsaxton
Copy link
Member

dsaxton commented Sep 11, 2020

Agreed this seems like a bug @phofl. This may be a duplicate of #31326.

@phofl
Copy link
Member Author

phofl commented Sep 11, 2020

Similar but not a duplicate. If the indexes both have the same elements the error is at another place than the other example.

@jreback jreback added this to the 1.3 milestone Feb 15, 2021
HyukjinKwon pushed a commit to apache/spark that referenced this issue Aug 9, 2021
### What changes were proposed in this pull request?

This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.

Before:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
```

After:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
```

This bug is fixed in pandas-dev/pandas#36289.

### Why are the changes needed?

We should follow the behavior of pandas as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, the result for some cases have duplicates values will change.

### How was this patch tested?

Unit test.

Closes #33634 from itholic/SPARK-36369.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit to apache/spark that referenced this issue Aug 27, 2021
This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.

Before:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
```

After:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
```

This bug is fixed in pandas-dev/pandas#36289.

We should follow the behavior of pandas as much as possible.

Yes, the result for some cases have duplicates values will change.

Unit test.

Closes #33634 from itholic/SPARK-36369.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit a9f371c)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants