BUG: Can't mix tuples and strings in column names with Numpy>=1.24 #50372

adrien-berchet · 2022-12-21T09:42:43Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# Use MultiIndex for columns
df = pd.DataFrame([range(0, 4), range(1, 5), range(2, 6)], columns=pd.MultiIndex.from_tuples([("a", "aa"), ("a", "ab"), ("b", "ba"), ("b", "bb")]))

# Adding a column with only one level
df["single_index"] = 0
print(df.columns)
print(df[[("a", "aa"), ("single_index", "")]])

# Use flat column index
df_flat = df.copy()
df_flat.columns = df_flat.columns.to_flat_index()

# Adding a column with only one level
df_flat["new_single_index"] = 0
print(df_flat.columns)
print(df_flat[[("a", "aa")]])  # This works
print(df_flat[["new_single_index"]])  # This works
print(df_flat[[("a", "aa"), "new_single_index"]])  # This fails

Here is the reported error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 print(df_flat[[("a", "aa"), "new_single_index"]])  # This fails

File /mnt/Data/Work/Perso/pandas/pandas/core/frame.py:3683, in DataFrame.__getitem__(self, key)
   3681     if is_iterator(key):
   3682         key = list(key)
-> 3683     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3685 # take() does not accept boolean indexers
   3686 if getattr(indexer, "dtype", None) == bool:

File /mnt/Data/Work/Perso/pandas/pandas/core/indexes/base.py:5629, in Index._get_indexer_strict(self, key, axis_name)
   5627 keyarr = key
   5628 if not isinstance(keyarr, Index):
-> 5629     keyarr = com.asarray_tuplesafe(keyarr)
   5631 if self._index_as_unique:
   5632     indexer = self.get_indexer_for(keyarr)

File /mnt/Data/Work/Perso/pandas/pandas/core/common.py:238, in asarray_tuplesafe(values, dtype)
    235 if isinstance(values, list) and dtype in [np.object_, object]:
    236     return construct_1d_object_array_from_listlike(values)
--> 238 result = np.asarray(values, dtype=dtype)
    240 if issubclass(result.dtype.type, str):
    241     result = np.asarray(values, dtype=object)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Issue Description

When mixing tuples and strings as column names, it is no more possible to select columns with mixed name types (e.g. df_flat[[("a", "aa"), "new_single_index"]]).

Expected Behavior

Selectin columns with both tuple and string names, i.e. df_flat[[("a", "aa"), "new_single_index"]] in the given example, should work.

Installed Versions

INSTALLED VERSIONS ------------------ commit : ca04349 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-56-generic Version : #62~20.04.1-Ubuntu SMP Tue Nov 22 21:24:20 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : fr_FR.UTF-8 LOCALE : fr_FR.UTF-8

pandas : 2.0.0.dev0+975.gca0434994e
numpy : 1.24.0
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.7.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

adrien-berchet · 2023-01-03T18:15:41Z

I performed a few tests and it seems that just changing (pandas/core/indexes/base.py:L5632):

            keyarr = com.asarray_tuplesafe(keyarr)

into

            try:
                keyarr = com.asarray_tuplesafe(keyarr)
            except ValueError:
                keyarr = com.asarray_tuplesafe(keyarr, dtype=_dtype_obj)

will solve this issue.
Nevertheless, it might create perf regression in some cases but I am not sure. I guess it would be cleaner to test the keyarr and provide the relevant dtype only when it is needed.

EDIT: The lines are given for the commit 32a261a39d

MarcoGorelli · 2023-03-27T18:43:33Z

moving off the 2.0 milestone

mroeschke · 2023-08-22T23:43:13Z

This looks to work on main now. Could use a test

DRiXD · 2023-08-23T01:32:20Z

I would love to take this as my first issue @mroeschke
I will try to write a clean test for this

DRiXD · 2023-08-23T06:21:57Z

Hi @mroeschke @adrien-berchet I've tried and uploaded a test for this issue.
I'd appreciate some feedback from your side. :)

adrien-berchet · 2023-08-23T08:00:29Z

Hi @DRiXD
Thanks for this!
The test looks good to me. I just checked and it properly reproduces the issue using the given commit and it works with more recent versions.

adrien-berchet added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2022

jorisvandenbossche added this to the 1.5.3 milestone Dec 23, 2022

datapythonista modified the milestones: 1.5.3, 1.5.4 Jan 18, 2023

datapythonista modified the milestones: 1.5.4, 2.0 Feb 27, 2023

MarcoGorelli modified the milestones: 2.0, 2.1 Mar 27, 2023

mroeschke removed this from the 2.1 milestone Aug 22, 2023

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Aug 22, 2023

DRiXD mentioned this issue Aug 23, 2023

TST: added test for fixed bug gh#50372 #54701

Merged

3 tasks

mroeschke closed this as completed in #54701 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Can't mix tuples and strings in column names with Numpy>=1.24 #50372

BUG: Can't mix tuples and strings in column names with Numpy>=1.24 #50372

adrien-berchet commented Dec 21, 2022 •

edited

Loading

adrien-berchet commented Jan 3, 2023 •

edited

Loading

MarcoGorelli commented Mar 27, 2023

mroeschke commented Aug 22, 2023

DRiXD commented Aug 23, 2023

DRiXD commented Aug 23, 2023

adrien-berchet commented Aug 23, 2023

BUG: Can't mix tuples and strings in column names with Numpy>=1.24 #50372

BUG: Can't mix tuples and strings in column names with Numpy>=1.24 #50372

Comments

adrien-berchet commented Dec 21, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

adrien-berchet commented Jan 3, 2023 • edited Loading

MarcoGorelli commented Mar 27, 2023

mroeschke commented Aug 22, 2023

DRiXD commented Aug 23, 2023

DRiXD commented Aug 23, 2023

adrien-berchet commented Aug 23, 2023

adrien-berchet commented Dec 21, 2022 •

edited

Loading

adrien-berchet commented Jan 3, 2023 •

edited

Loading