Skip to content

BUG: [REGRESSION] concat fails when concating two objects with overlapping MultiIndex IntervalIndex levels #54934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
johannes-mueller opened this issue Sep 1, 2023 · 1 comment · Fixed by #54945
Labels
Bug Interval Interval data type MultiIndex Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@johannes-mueller
Copy link
Contributor

johannes-mueller commented Sep 1, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

ivl1 = pd.IntervalIndex.from_breaks([0.0, 1.0, 2.0])
ivl2 = pd.IntervalIndex.from_breaks([0.5, 1.5, 2.5])

mi1 = pd.MultiIndex.from_product([ivl1, ivl1])
mi2 = pd.MultiIndex.from_product([ivl2, ivl2])
s1 = pd.Series(1, index=mi1)
s2 = pd.Series(2, index=mi2)

expected_idx = pd.MultiIndex.from_tuples(
    [
        (pd.Interval(0.0, 1.0), pd.Interval(0.0, 1.0)),
        (pd.Interval(0.0, 1.0), pd.Interval(1.0, 2.0)),
        (pd.Interval(1.0, 2.0), pd.Interval(0.0, 1.0)),
        (pd.Interval(1.0, 2.0), pd.Interval(1.0, 2.0)),
        (pd.Interval(0.5, 1.5), pd.Interval(0.5, 1.5)),
        (pd.Interval(0.5, 1.5), pd.Interval(1.5, 2.5)),
        (pd.Interval(1.5, 2.5), pd.Interval(0.5, 1.5)),
        (pd.Interval(1.5, 2.5), pd.Interval(1.5, 2.5))
    ]
)
expected = pd.Series([1, 1, 1, 1, 2, 2, 2, 2], index=expected_idx)

result = pd.concat([s1, s2])

pd.testing.assert_series_equal(result, expected)

Issue Description

The code crashes with

Traceback (most recent call last):
  File "/tmp/concattest.py", line 26, in <module>
    result = pd.concat([s1, s2])
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 393, in concat
    return op.get_result()
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 640, in get_result
    new_index = self.new_axes[0]
  File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
    val = self.fget(obj)
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 698, in new_axes
    return [
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 699, in <listcomp>
    self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
  File "pandas/_libs/properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
    val = self.fget(obj)
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 756, in _get_concat_axis
    concat_axis = _concat_indexes(indexes)
  File "/home/jmu3si/Devel/pandas/pandas/core/reshape/concat.py", line 774, in _concat_indexes
    return indexes[0].append(indexes[1:])
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/multi.py", line 2184, in append
    level_codes = [
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/multi.py", line 2185, in <listcomp>
    recode_for_categories(
  File "/home/jmu3si/Devel/pandas/pandas/core/arrays/categorical.py", line 2951, in recode_for_categories
    new_categories.get_indexer(old_categories), new_categories
  File "/home/jmu3si/Devel/pandas/pandas/core/indexes/base.py", line 3845, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique

This used to work in 2.0.3. After bisecting it turns out that the performance optimization of f989e1b is breaking it. @lukemanley: any ideas how to fix this reasonably?

Expected Behavior

The code should finish without error.

Installed Versions

INSTALLED VERSIONS ------------------ commit : c7325d7 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-159-generic Version : #176-Ubuntu SMP Mon Aug 14 12:04:20 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : de_DE.UTF-8

pandas : 2.2.0dev0+155.gc7325d7e7e
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 0.29.33
pytest : 7.4.0
hypothesis : 6.83.0
sphinx : 6.2.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.1.2
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.7
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.2
numba : 0.57.1
numexpr : 2.8.5
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : 1.2.3
pyxlsb : 1.0.10
s3fs : 2023.6.0
scipy : 1.11.2
sqlalchemy : 2.0.20
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.8.0
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@johannes-mueller johannes-mueller added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 1, 2023
@rhshadrach rhshadrach added Regression Functionality that used to work in a prior pandas version MultiIndex Interval Interval data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 1, 2023
@rhshadrach rhshadrach added this to the 2.1.1 milestone Sep 1, 2023
@lukemanley
Copy link
Member

Thanks for the report @johannes-mueller. Just opened #54945 to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Interval Interval data type MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants