[BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290

phofl · 2020-09-11T15:27:36Z

closes pd.concat() crashes if dataframe contains duplicate indices but not df.join() #36263
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

If obj_labes Index has duplicates and they are not removed from new_labels before redindexing, they are multiplied. So we would get a way too big index.

pandas/core/reshape/concat.py

jreback · 2020-09-13T23:17:09Z

also pls merge master

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/reshape/test_concat.py

phofl · 2020-09-13T23:29:07Z

@jreback merged master

pandas/core/reshape/concat.py

phofl · 2020-10-04T20:23:31Z

Moved it to algorithms and added a benchmark. I looked through indexing and the with index associated modules but could not find anything, which would help here. The check for uniqueness is not necessary from a technical standpoint, but it should save some time for unique indices.

github-actions · 2020-11-04T00:10:28Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

simonjayhawkins · 2020-11-12T14:41:14Z

@phofl can you resolve merge conflicts

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/core/algorithms.py � pandas/tests/reshape/test_concat.py

phofl · 2020-11-12T23:59:10Z

Done

jreback · 2020-11-13T13:19:08Z

pandas/core/reshape/concat.py

+                        # duplicate or duplicates again
+                        if not obj_labels.is_unique:
+                            new_labels = algos.make_duplicates_of_left_unique_in_right(
+                                obj_labels.values, new_labels.values


use np.asarray instead of .values

jreback · 2020-11-13T13:19:32Z

pandas/core/algorithms.py

@@ -2149,3 +2149,21 @@ def _sort_tuples(values: np.ndarray[tuple]):
    arrays, _ = to_arrays(values, None)
    indexer = lexsort_indexer(arrays, orders=True)
    return values[indexer]
+
+
+def make_duplicates_of_left_unique_in_right(left, right) -> np.ndarray:


can you type the input args.

can you add an example here.

is this unit tested?

Added unittests, forgot them obviously, and typed the inputs. Example is below

jreback · 2020-11-13T13:19:47Z

pandas/tests/reshape/concat/test_concat.py

+        {"a": [1, 1, 2, 3, np.nan, 4], "b": [6, 7, 8, 8, 9, np.nan]},
+        index=Index([0, 0, 1, 1, 3, 4]),
+    )
+    tm.assert_frame_equal(result, expected)


extra file added below

Moved the test if that is what you are referring to

phofl · 2020-11-13T19:47:46Z

Example:

We have the 2 DataFrames

df1 = DataFrame([1, 2, 3, 4], index=[0, 1, 1, 4], columns=["a"])
df2 = DataFrame([6, 7, 8, 9], index=[0, 0, 1, 3], columns=["b"])

When looping over df1 (df2 case is similar)
In this case new_labels before calling make_duplicates_of_left_unique_in_right will be

new_labels=Index([0, 0, 1, 1, 3, 4])

while obj_labels will be

obj_labels=Index([0, 1, 1, 4])

Simply reindexing would return an indexer which would duplicate the pair [1, 1] again: np.array([0, 0, 1, 2, 1, 2, -1, 3]). (taking every 1 twice)
The new function takes the duplications of the 1 in new_labels away.

make_duplicates_of_left_unique_in_right returns

new_labels=Index([0, 0, 1, 3, 4])

which leads us to the desired reindexing result np.array([0, 0, 1, 2, -1, 3])

jreback

pandas/tests/reshape/test_concat.py seems to be added here (i think we moved this), can you remove; pls merge master and ping on green.

cc @jbrockmendel if comments.

jbrockmendel · 2020-11-18T19:08:22Z

doc/source/whatsnew/v1.2.0.rst

@@ -567,6 +567,7 @@ Reshaping
 - Bug in :meth:`DataFrame.combine_first()` caused wrong alignment with dtype ``string`` and one level of ``MultiIndex`` containing only ``NA`` (:issue:`37591`)
 - Fixed regression in :func:`merge` on merging DatetimeIndex with empty DataFrame (:issue:`36895`)
 - Bug in :meth:`DataFrame.apply` not setting index of return value when ``func`` return type is ``dict`` (:issue:`37544`)
+- Bug in :func:`concat` resulted in a ``ValueError`` when at least one of both inputs had a non unique index (:issue:`36263`)


jbrockmendel · 2020-11-18T19:09:21Z

pandas/core/algorithms.py

+    Duplicates of left are unique in right
+    """
+    left_duplicates = unique(left[duplicated(left)])
+    return right[~(duplicated(right) & np.isin(right, left_duplicates))]


any reason to prefer np.isin vs the algos.isin?

Not that i remember, changed it.

jbrockmendel · 2020-11-18T19:09:28Z

pandas/core/algorithms.py

+    Parameters
+    ----------
+    left: ndarray
+    right: ndarray


dtypes unrestricted?

Could not think of anything, why they should be restricted.

jbrockmendel · 2020-11-18T19:23:44Z

pandas/tests/reshape/concat/test_dataframe.py

+        result = concat([df1, df2], axis=1)
+        expected = DataFrame(
+            {"a": [1, 1, 2, 3, np.nan, 4], "b": [6, 7, 8, 8, 9, np.nan]},
+            index=Index([0, 0, 1, 1, 3, 4]),


just to make sure i understand this, for any df1 and df2, we want result.index to always satisfy:

vc = result.index.value_counts() vc1 = df1.index.value_counts() vc2 = df2.index.value_counts() vc1b = vc1.reindex(vc.index, fill_value=0) vc2b = vc2.reindex(vc.index, fill_value=0)

We expect vc to be the pointwise maximum of vc1b and vc2b?

Yes exactly. Thats perfectly on point.

jbrockmendel · 2020-11-18T19:24:43Z

pandas/core/algorithms.py

@@ -2149,3 +2149,23 @@ def _sort_tuples(values: np.ndarray[tuple]):
    arrays, _ = to_arrays(values, None)
    indexer = lexsort_indexer(arrays, orders=True)
    return values[indexer]
+
+
+def make_duplicates_of_left_unique_in_right(


is this related to or useful for the index.union-with-duplicates stuff?

If you pass in the union as left and right, you would get the distinct result. Have to take a look if we can use this.

jbrockmendel · 2020-11-18T19:26:04Z

pandas/core/algorithms.py

+) -> np.ndarray:
+    """
+    Drops all duplicates values from left in right, so that they are
+    unique in right.


The code itself looks good, but this sentence isn't clear to a reader without context

Improved it?

phofl · 2020-11-18T20:51:39Z

Removed the test_concat file, probably got in through merging master the last time

jreback · 2020-11-19T19:01:13Z

thanks @phofl very nice

phofl · 2020-11-19T19:24:01Z

Thx, will try to use this for the Index.union with duplicates problems

… duplicates (pandas-dev#36290)" This reverts commit b32febd.

* Revert "[BUG]: Fix ValueError in concat() when at least one Index has duplicates (#36290)" This reverts commit b32febd.

jreback · 2020-12-24T15:09:44Z

reverted here: #38654

* Revert "[BUG]: Fix ValueError in concat() when at least one Index has duplicates (pandas-dev#36290)" This reverts commit b32febd.

Fix crash in concat if non unique index

6105102

phofl mentioned this pull request Sep 11, 2020

BUG: Index.union() inconsistent with non-unique Indexes #36299

Merged

6 tasks

jreback requested changes Sep 13, 2020

View reviewed changes

pandas/core/reshape/concat.py Show resolved Hide resolved

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Bug labels Sep 13, 2020

jreback added this to the 1.2 milestone Sep 13, 2020

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36263

7c9b8f5

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/tests/reshape/test_concat.py

jreback requested changes Sep 19, 2020

View reviewed changes

pandas/core/reshape/concat.py Show resolved Hide resolved

phofl added 3 commits October 4, 2020 21:01

Creta function in algos

b4c3b77

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36263

071be85

Add benchmark

7129ced

github-actions bot added the Stale label Nov 4, 2020

phofl added 2 commits November 13, 2020 00:57

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36263

9fcf794

� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/core/algorithms.py � pandas/tests/reshape/test_concat.py

Add test in different file

9cce3ff

Fix pattern

d0c4ea5

simonjayhawkins removed the Stale label Nov 13, 2020

jreback requested changes Nov 13, 2020

View reviewed changes

phofl added 2 commits November 13, 2020 20:29

Use asarray

a96b262

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36263

2448326

phofl added 2 commits November 13, 2020 20:48

Adress review comments

e21939f

Delete brackets

23344a2

jreback requested changes Nov 18, 2020

View reviewed changes

jbrockmendel reviewed Nov 18, 2020

View reviewed changes

phofl added 4 commits November 18, 2020 21:39

Change whatsnew

a4e1851

Remove file

354048d

Add comment

ab8b03b

Merge branch 'master' of https://github.com/pandas-dev/pandas into 36263

6fd368d

jreback approved these changes Nov 19, 2020

View reviewed changes

jreback merged commit b32febd into pandas-dev:master Nov 19, 2020

phofl deleted the 36263 branch November 19, 2020 19:24

ivirshup mentioned this pull request Dec 22, 2020

BUG: concat on axis with both different and duplicate labels raising error #6963

Closed

ivirshup added a commit to ivirshup/pandas that referenced this pull request Dec 23, 2020

Revert "[BUG]: Fix ValueError in concat() when at least one Index has…

a998162

… duplicates (pandas-dev#36290)" This reverts commit b32febd.

jreback mentioned this pull request Dec 24, 2020

[BUG] Concat duplicates errors (or lack there of) #38654

Merged

6 tasks

jreback pushed a commit that referenced this pull request Dec 24, 2020

[BUG] Concat duplicates errors (or lack there of) (#38654)

fa478d3

* Revert "[BUG]: Fix ValueError in concat() when at least one Index has duplicates (#36290)" This reverts commit b32febd.

ivirshup mentioned this pull request Dec 30, 2020

API/ ENH: Unambiguous indexing should be allowed, even if duplicates are present #38797

Open

Uh oh!

[BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290

[BUG]: Fix ValueError in concat() when at least one Index has duplicates #36290

Uh oh!

Conversation

phofl commented Sep 11, 2020

Uh oh!

Uh oh!

jreback commented Sep 13, 2020

Uh oh!

phofl commented Sep 13, 2020

Uh oh!

Uh oh!

phofl commented Oct 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 4, 2020

Uh oh!

simonjayhawkins commented Nov 12, 2020

Uh oh!

phofl commented Nov 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl commented Nov 13, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Nov 19, 2020

Uh oh!

phofl commented Nov 19, 2020

Uh oh!

jreback commented Dec 24, 2020

Uh oh!

Uh oh!

phofl commented Oct 4, 2020 •

edited

Loading

phofl commented Nov 18, 2020 •

edited

Loading