-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
6105102
7c9b8f5
b4c3b77
071be85
7129ced
9fcf794
9cce3ff
d0c4ea5
a96b262
2448326
e21939f
23344a2
a4e1851
354048d
ab8b03b
6fd368d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2149,3 +2149,21 @@ def _sort_tuples(values: np.ndarray[tuple]): | |
arrays, _ = to_arrays(values, None) | ||
indexer = lexsort_indexer(arrays, orders=True) | ||
return values[indexer] | ||
|
||
|
||
def make_duplicates_of_left_unique_in_right(left, right) -> np.ndarray: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you type the input args. can you add an example here. is this unit tested? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added unittests, forgot them obviously, and typed the inputs. Example is below |
||
""" | ||
Drops all duplicates values from left in right, so that they are | ||
unique in right. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code itself looks good, but this sentence isn't clear to a reader without context There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Improved it? |
||
|
||
Parameters | ||
---------- | ||
left: ndarray | ||
right: ndarray | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dtypes unrestricted? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could not think of anything, why they should be restricted. |
||
|
||
Returns | ||
------- | ||
Duplicates of left are unique in right | ||
""" | ||
left_duplicates = unique(left[duplicated(left)]) | ||
return right[~(duplicated(right) & np.isin(right, left_duplicates))] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any reason to prefer np.isin vs the algos.isin? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not that i remember, changed it. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ | |
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries | ||
from pandas.core.dtypes.missing import isna | ||
|
||
import pandas.core.algorithms as algos | ||
from pandas.core.arrays.categorical import ( | ||
factorize_from_iterable, | ||
factorize_from_iterables, | ||
|
@@ -501,6 +502,13 @@ def get_result(self): | |
# 1-ax to convert BlockManager axis to DataFrame axis | ||
obj_labels = obj.axes[1 - ax] | ||
if not new_labels.equals(obj_labels): | ||
# We have to remove the duplicates from obj_labels | ||
phofl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# in new labels to make them unique, otherwise we would | ||
# duplicate or duplicates again | ||
if not obj_labels.is_unique: | ||
new_labels = algos.make_duplicates_of_left_unique_in_right( | ||
obj_labels.values, new_labels.values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thx, done |
||
) | ||
indexers[ax] = obj_labels.reindex(new_labels)[1] | ||
|
||
mgrs_indexers.append((obj._mgr, indexers)) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -556,3 +556,15 @@ def test_concat_preserves_extension_int64_dtype(): | |
result = pd.concat([df_a, df_b], ignore_index=True) | ||
expected = DataFrame({"a": [-1, None], "b": [None, 1]}, dtype="Int64") | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_concat_duplicate_indexes(): | ||
# GH#36263 ValueError with non unique indexes | ||
df1 = DataFrame([1, 2, 3, 4], index=[0, 1, 1, 4], columns=["a"]) | ||
df2 = DataFrame([6, 7, 8, 9], index=[0, 0, 1, 3], columns=["b"]) | ||
result = concat([df1, df2], axis=1) | ||
expected = DataFrame( | ||
{"a": [1, 1, 2, 3, np.nan, 4], "b": [6, 7, 8, 8, 9, np.nan]}, | ||
index=Index([0, 0, 1, 1, 3, 4]), | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. extra file added below There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved the test if that is what you are referring to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done