-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
also pls merge master |
� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/reshape/test_concat.py
@jreback merged master |
Moved it to algorithms and added a benchmark. I looked through indexing and the with index associated modules but could not find anything, which would help here. The check for uniqueness is not necessary from a technical standpoint, but it should save some time for unique indices. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@phofl can you resolve merge conflicts |
� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/core/algorithms.py � pandas/tests/reshape/test_concat.py
Done |
pandas/core/reshape/concat.py
Outdated
# duplicate or duplicates again | ||
if not obj_labels.is_unique: | ||
new_labels = algos.make_duplicates_of_left_unique_in_right( | ||
obj_labels.values, new_labels.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.asarray
instead of .values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, done
pandas/core/algorithms.py
Outdated
@@ -2149,3 +2149,21 @@ def _sort_tuples(values: np.ndarray[tuple]): | |||
arrays, _ = to_arrays(values, None) | |||
indexer = lexsort_indexer(arrays, orders=True) | |||
return values[indexer] | |||
|
|||
|
|||
def make_duplicates_of_left_unique_in_right(left, right) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type the input args.
can you add an example here.
is this unit tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unittests, forgot them obviously, and typed the inputs. Example is below
{"a": [1, 1, 2, 3, np.nan, 4], "b": [6, 7, 8, 8, 9, np.nan]}, | ||
index=Index([0, 0, 1, 1, 3, 4]), | ||
) | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra file added below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the test if that is what you are referring to
Example: We have the 2 DataFrames
When looping over
while
Simply reindexing would return an indexer which would duplicate the pair
which leads us to the desired reindexing result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/tests/reshape/test_concat.py seems to be added here (i think we moved this), can you remove; pls merge master and ping on green.
cc @jbrockmendel if comments.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -567,6 +567,7 @@ Reshaping | |||
- Bug in :meth:`DataFrame.combine_first()` caused wrong alignment with dtype ``string`` and one level of ``MultiIndex`` containing only ``NA`` (:issue:`37591`) | |||
- Fixed regression in :func:`merge` on merging DatetimeIndex with empty DataFrame (:issue:`36895`) | |||
- Bug in :meth:`DataFrame.apply` not setting index of return value when ``func`` return type is ``dict`` (:issue:`37544`) | |||
- Bug in :func:`concat` resulted in a ``ValueError`` when at least one of both inputs had a non unique index (:issue:`36263`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/algorithms.py
Outdated
Duplicates of left are unique in right | ||
""" | ||
left_duplicates = unique(left[duplicated(left)]) | ||
return right[~(duplicated(right) & np.isin(right, left_duplicates))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to prefer np.isin vs the algos.isin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that i remember, changed it.
Parameters | ||
---------- | ||
left: ndarray | ||
right: ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtypes unrestricted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could not think of anything, why they should be restricted.
result = concat([df1, df2], axis=1) | ||
expected = DataFrame( | ||
{"a": [1, 1, 2, 3, np.nan, 4], "b": [6, 7, 8, 8, 9, np.nan]}, | ||
index=Index([0, 0, 1, 1, 3, 4]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make sure i understand this, for any df1
and df2
, we want result.index
to always satisfy:
vc = result.index.value_counts()
vc1 = df1.index.value_counts()
vc2 = df2.index.value_counts()
vc1b = vc1.reindex(vc.index, fill_value=0)
vc2b = vc2.reindex(vc.index, fill_value=0)
We expect vc
to be the pointwise maximum of vc1b
and vc2b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. Thats perfectly on point.
@@ -2149,3 +2149,23 @@ def _sort_tuples(values: np.ndarray[tuple]): | |||
arrays, _ = to_arrays(values, None) | |||
indexer = lexsort_indexer(arrays, orders=True) | |||
return values[indexer] | |||
|
|||
|
|||
def make_duplicates_of_left_unique_in_right( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this related to or useful for the index.union-with-duplicates stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass in the union as left and right, you would get the distinct result. Have to take a look if we can use this.
pandas/core/algorithms.py
Outdated
) -> np.ndarray: | ||
""" | ||
Drops all duplicates values from left in right, so that they are | ||
unique in right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code itself looks good, but this sentence isn't clear to a reader without context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved it?
Removed the test_concat file, probably got in through merging master the last time |
thanks @phofl very nice |
Thx, will try to use this for the Index.union with duplicates problems |
… duplicates (pandas-dev#36290)" This reverts commit b32febd.
reverted here: #38654 |
* Revert "[BUG]: Fix ValueError in concat() when at least one Index has duplicates (pandas-dev#36290)" This reverts commit b32febd.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
If obj_labes Index has duplicates and they are not removed from new_labels before redindexing, they are multiplied. So we would get a way too big index.