-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Index.union() drops Index component for duplicate Index elements not sorted #36289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
3 tasks done
Comments
Similar but not a duplicate. If the indexes both have the same elements the error is at another place than the other example. |
6 tasks
HyukjinKwon
pushed a commit
to apache/spark
that referenced
this issue
Aug 9, 2021
### What changes were proposed in this pull request? This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3. Before: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64') ``` After: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64') ``` This bug is fixed in pandas-dev/pandas#36289. ### Why are the changes needed? We should follow the behavior of pandas as much as possible. ### Does this PR introduce _any_ user-facing change? Yes, the result for some cases have duplicates values will change. ### How was this patch tested? Unit test. Closes #33634 from itholic/SPARK-36369. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon
pushed a commit
to apache/spark
that referenced
this issue
Aug 27, 2021
This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3. Before: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64') ``` After: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64') ``` This bug is fixed in pandas-dev/pandas#36289. We should follow the behavior of pandas as much as possible. Yes, the result for some cases have duplicates values will change. Unit test. Closes #33634 from itholic/SPARK-36369. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit a9f371c) Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
In the fourth call
idx2.union(idx1)
one of the0
is dropped. Meaning union() depends on the call order and on the index order, which seems buggy.Expected Output
Int64Index([0, 0, 1], dtype='int64')
for every print StatementOutput of
pd.show_versions()
master
The text was updated successfully, but these errors were encountered: