Skip to content

REGR: concat on index with duplicate labels fails #35238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jul 11, 2020 · 10 comments · Fixed by #35277
Closed

REGR: concat on index with duplicate labels fails #35238

jorisvandenbossche opened this issue Jul 11, 2020 · 10 comments · Fixed by #35277
Labels
Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Noticed from failing geopandas tests:

df1 = pd.DataFrame({'a': [0, 1]}) 
df2 = pd.DataFrame({'b': [0, 1, 2]}, index=[0, 0, 1]) 
pd.concat([df1, df2], axis=1)

this started failing recently on master:

In [3]: pd.concat([df1, df2], axis=1) 
...
~/scipy/pandas/pandas/core/internals/managers.py in _verify_integrity(self)
    314         for block in self.blocks:
    315             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 316                 raise construction_error(tot_items, block.shape[1:], self.axes)
    317         if len(self.items) != tot_items:
    318             raise AssertionError(

ValueError: Shape of passed values is (3, 2), indices imply (2, 2)

While the expected result is (with automatic reindex of df1 because concat aligns on the index):

In [2]: pd.concat([df1, df2], axis=1)
Out[2]: 
   a  b
0  0  0
0  0  1
1  1  2
@jorisvandenbossche jorisvandenbossche added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version Blocker Blocking issue or pull request for an upcoming release labels Jul 11, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Jul 11, 2020
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 11, 2020

It seems this is caused by #35098, not sure how though (cc @AlexKirko)

@simonjayhawkins
Copy link
Member

It seems this is caused by #35098, not sure how though (cc @AlexKirko)

it appears result.union(other) and result.union(other, sort=False) do not return the same Index.

-> if sort:
(Pdb) a
indexes = [RangeIndex(start=0, stop=2, step=1), Int64Index([0, 0, 1], dtype='int64')]
sort = False
(Pdb) result.union(other, sort=sort)
Int64Index([0, 1], dtype='int64')
(Pdb) result.union(other)
Int64Index([0, 0, 1], dtype='int64')
(Pdb) result.union(other, sort=False)
Int64Index([0, 1], dtype='int64')
(Pdb) result.union(other, sort=True)
*** ValueError: The 'sort' keyword only takes the values of None or False; True was passed.
(Pdb)

@simonjayhawkins
Copy link
Member

it appears result.union(other) and result.union(other, sort=False) do not return the same Index.

maybe related #31326

When duplicates are present, the size of the result of Index.union() depends on sort is None or False.

@simonjayhawkins
Copy link
Member

it appears this is a long-standing issue #13432, maybe better to revert #35098 for 1.1

@TomAugspurger
Copy link
Contributor

Looking into this now.

@jorisvandenbossche
Copy link
Member Author

So to summarize, on master we have:

In [11]: pd.Index([0, 1]).union(pd.Index([0, 0, 1]))    
Out[11]: Int64Index([0, 0, 1], dtype='int64')

In [12]: pd.Index([0, 1]).union(pd.Index([0, 0, 1]), sort=False)  
Out[12]: Int64Index([0, 1], dtype='int64')

which indeed doesn't look correct (I would say that sort shouldn't have any effect on the number of elements, only on their order).
However, this bug is actually also present on 1.0 already. And the regression comes from the fact that #35098 started to pass through the sort=False.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 14, 2020

The actual bug with sort=False is indeed covered by #31326

The question is though, if there are duplicates: how do we sort with sort=False ?

In principle it should not sort, but what with the duplicate values that might be present in the right index?

>>> pd.Index([1, 2, 3]).union(pd.Index([1, 2, 2, 4]), sort=False)
## hypothetical results
Int64Index([1, 2, 2, 3, 4], dtype='int64')
# or
Int64Index([1, 2, 3, 2, 4], dtype='int64')

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 14, 2020

Yeah, the sort=False case with duplicates is tricky enough that it needs some discussion. I think the path forward is to revert #35098, fix duplicate set ops, and then reinstate 35092. (cc @AlexKirko)

@jorisvandenbossche
Copy link
Member Author

Agreed. And then we can keep the actual discussion about that in #31326

@TomAugspurger
Copy link
Contributor

I'll do the revert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants