REGR: concat on index with duplicate labels fails #35238

jorisvandenbossche · 2020-07-11T18:22:34Z

Noticed from failing geopandas tests:

df1 = pd.DataFrame({'a': [0, 1]}) 
df2 = pd.DataFrame({'b': [0, 1, 2]}, index=[0, 0, 1]) 
pd.concat([df1, df2], axis=1)

this started failing recently on master:

In [3]: pd.concat([df1, df2], axis=1) 
...
~/scipy/pandas/pandas/core/internals/managers.py in _verify_integrity(self)
    314         for block in self.blocks:
    315             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 316                 raise construction_error(tot_items, block.shape[1:], self.axes)
    317         if len(self.items) != tot_items:
    318             raise AssertionError(

ValueError: Shape of passed values is (3, 2), indices imply (2, 2)

While the expected result is (with automatic reindex of df1 because concat aligns on the index):

In [2]: pd.concat([df1, df2], axis=1)
Out[2]: 
   a  b
0  0  0
0  0  1
1  1  2

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-07-11T18:52:39Z

It seems this is caused by #35098, not sure how though (cc @AlexKirko)

simonjayhawkins · 2020-07-11T19:23:15Z

It seems this is caused by #35098, not sure how though (cc @AlexKirko)

it appears result.union(other) and result.union(other, sort=False) do not return the same Index.

-> if sort:
(Pdb) a
indexes = [RangeIndex(start=0, stop=2, step=1), Int64Index([0, 0, 1], dtype='int64')]
sort = False
(Pdb) result.union(other, sort=sort)
Int64Index([0, 1], dtype='int64')
(Pdb) result.union(other)
Int64Index([0, 0, 1], dtype='int64')
(Pdb) result.union(other, sort=False)
Int64Index([0, 1], dtype='int64')
(Pdb) result.union(other, sort=True)
*** ValueError: The 'sort' keyword only takes the values of None or False; True was passed.
(Pdb)

simonjayhawkins · 2020-07-11T19:36:04Z

it appears result.union(other) and result.union(other, sort=False) do not return the same Index.

maybe related #31326

When duplicates are present, the size of the result of Index.union() depends on sort is None or False.

simonjayhawkins · 2020-07-11T20:04:07Z

it appears this is a long-standing issue #13432, maybe better to revert #35098 for 1.1

TomAugspurger · 2020-07-14T19:55:51Z

Looking into this now.

jorisvandenbossche · 2020-07-14T20:07:48Z

So to summarize, on master we have:

In [11]: pd.Index([0, 1]).union(pd.Index([0, 0, 1]))    
Out[11]: Int64Index([0, 0, 1], dtype='int64')

In [12]: pd.Index([0, 1]).union(pd.Index([0, 0, 1]), sort=False)  
Out[12]: Int64Index([0, 1], dtype='int64')

which indeed doesn't look correct (I would say that sort shouldn't have any effect on the number of elements, only on their order).
However, this bug is actually also present on 1.0 already. And the regression comes from the fact that #35098 started to pass through the sort=False.

jorisvandenbossche · 2020-07-14T20:18:25Z

The actual bug with sort=False is indeed covered by #31326

The question is though, if there are duplicates: how do we sort with sort=False ?

In principle it should not sort, but what with the duplicate values that might be present in the right index?

>>> pd.Index([1, 2, 3]).union(pd.Index([1, 2, 2, 4]), sort=False)
## hypothetical results
Int64Index([1, 2, 2, 3, 4], dtype='int64')
# or
Int64Index([1, 2, 3, 2, 4], dtype='int64')

TomAugspurger · 2020-07-14T20:19:31Z

Yeah, the sort=False case with duplicates is tricky enough that it needs some discussion. I think the path forward is to revert #35098, fix duplicate set ops, and then reinstate 35092. (cc @AlexKirko)

jorisvandenbossche · 2020-07-14T20:21:08Z

Agreed. And then we can keep the actual discussion about that in #31326

TomAugspurger · 2020-07-14T20:22:07Z

I'll do the revert.

jorisvandenbossche added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version Blocker Blocking issue or pull request for an upcoming release labels Jul 11, 2020

jorisvandenbossche added this to the 1.1 milestone Jul 11, 2020

jorisvandenbossche mentioned this issue Jul 11, 2020

CI: failing explode tests on pandas master build geopandas/geopandas#1516

Closed

TomAugspurger mentioned this issue Jul 14, 2020

Revert "BUG: fix union_indexes not supporting sort=False for Index subclasses" #35277

Merged

jreback closed this as completed in #35277 Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: concat on index with duplicate labels fails #35238

REGR: concat on index with duplicate labels fails #35238

jorisvandenbossche commented Jul 11, 2020

jorisvandenbossche commented Jul 11, 2020 •

edited

Loading

simonjayhawkins commented Jul 11, 2020

simonjayhawkins commented Jul 11, 2020

simonjayhawkins commented Jul 11, 2020

TomAugspurger commented Jul 14, 2020

jorisvandenbossche commented Jul 14, 2020

jorisvandenbossche commented Jul 14, 2020 •

edited

Loading

TomAugspurger commented Jul 14, 2020 •

edited

Loading

jorisvandenbossche commented Jul 14, 2020

TomAugspurger commented Jul 14, 2020

REGR: concat on index with duplicate labels fails #35238

REGR: concat on index with duplicate labels fails #35238

Comments

jorisvandenbossche commented Jul 11, 2020

jorisvandenbossche commented Jul 11, 2020 • edited Loading

simonjayhawkins commented Jul 11, 2020

simonjayhawkins commented Jul 11, 2020

simonjayhawkins commented Jul 11, 2020

TomAugspurger commented Jul 14, 2020

jorisvandenbossche commented Jul 14, 2020

jorisvandenbossche commented Jul 14, 2020 • edited Loading

TomAugspurger commented Jul 14, 2020 • edited Loading

jorisvandenbossche commented Jul 14, 2020

TomAugspurger commented Jul 14, 2020

jorisvandenbossche commented Jul 11, 2020 •

edited

Loading

jorisvandenbossche commented Jul 14, 2020 •

edited

Loading

TomAugspurger commented Jul 14, 2020 •

edited

Loading