Skip to content

BUG: Can't mix tuples and strings in column names with Numpy>=1.24 #50372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
adrien-berchet opened this issue Dec 21, 2022 · 6 comments · Fixed by #54701
Closed
3 tasks done

BUG: Can't mix tuples and strings in column names with Numpy>=1.24 #50372

adrien-berchet opened this issue Dec 21, 2022 · 6 comments · Fixed by #54701
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@adrien-berchet
Copy link

adrien-berchet commented Dec 21, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# Use MultiIndex for columns
df = pd.DataFrame([range(0, 4), range(1, 5), range(2, 6)], columns=pd.MultiIndex.from_tuples([("a", "aa"), ("a", "ab"), ("b", "ba"), ("b", "bb")]))

# Adding a column with only one level
df["single_index"] = 0
print(df.columns)
print(df[[("a", "aa"), ("single_index", "")]])

# Use flat column index
df_flat = df.copy()
df_flat.columns = df_flat.columns.to_flat_index()

# Adding a column with only one level
df_flat["new_single_index"] = 0
print(df_flat.columns)
print(df_flat[[("a", "aa")]])  # This works
print(df_flat[["new_single_index"]])  # This works
print(df_flat[[("a", "aa"), "new_single_index"]])  # This fails

Here is the reported error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 print(df_flat[[("a", "aa"), "new_single_index"]])  # This fails

File /mnt/Data/Work/Perso/pandas/pandas/core/frame.py:3683, in DataFrame.__getitem__(self, key)
   3681     if is_iterator(key):
   3682         key = list(key)
-> 3683     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3685 # take() does not accept boolean indexers
   3686 if getattr(indexer, "dtype", None) == bool:

File /mnt/Data/Work/Perso/pandas/pandas/core/indexes/base.py:5629, in Index._get_indexer_strict(self, key, axis_name)
   5627 keyarr = key
   5628 if not isinstance(keyarr, Index):
-> 5629     keyarr = com.asarray_tuplesafe(keyarr)
   5631 if self._index_as_unique:
   5632     indexer = self.get_indexer_for(keyarr)

File /mnt/Data/Work/Perso/pandas/pandas/core/common.py:238, in asarray_tuplesafe(values, dtype)
    235 if isinstance(values, list) and dtype in [np.object_, object]:
    236     return construct_1d_object_array_from_listlike(values)
--> 238 result = np.asarray(values, dtype=dtype)
    240 if issubclass(result.dtype.type, str):
    241     result = np.asarray(values, dtype=object)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Issue Description

When mixing tuples and strings as column names, it is no more possible to select columns with mixed name types (e.g. df_flat[[("a", "aa"), "new_single_index"]]).

Expected Behavior

Selectin columns with both tuple and string names, i.e. df_flat[[("a", "aa"), "new_single_index"]] in the given example, should work.

Installed Versions

INSTALLED VERSIONS ------------------ commit : ca04349 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-56-generic Version : #62~20.04.1-Ubuntu SMP Tue Nov 22 21:24:20 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : fr_FR.UTF-8 LOCALE : fr_FR.UTF-8

pandas : 2.0.0.dev0+975.gca0434994e
numpy : 1.24.0
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.7.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

@adrien-berchet adrien-berchet added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2022
@jorisvandenbossche jorisvandenbossche added this to the 1.5.3 milestone Dec 23, 2022
@adrien-berchet
Copy link
Author

adrien-berchet commented Jan 3, 2023

I performed a few tests and it seems that just changing (pandas/core/indexes/base.py:L5632):

            keyarr = com.asarray_tuplesafe(keyarr)

into

            try:
                keyarr = com.asarray_tuplesafe(keyarr)
            except ValueError:
                keyarr = com.asarray_tuplesafe(keyarr, dtype=_dtype_obj)

will solve this issue.
Nevertheless, it might create perf regression in some cases but I am not sure. I guess it would be cleaner to test the keyarr and provide the relevant dtype only when it is needed.

EDIT: The lines are given for the commit 32a261a39d

@datapythonista datapythonista modified the milestones: 1.5.3, 1.5.4 Jan 18, 2023
@datapythonista datapythonista modified the milestones: 1.5.4, 2.0 Feb 27, 2023
@MarcoGorelli
Copy link
Member

moving off the 2.0 milestone

@MarcoGorelli MarcoGorelli modified the milestones: 2.0, 2.1 Mar 27, 2023
@mroeschke
Copy link
Member

This looks to work on main now. Could use a test

@mroeschke mroeschke removed this from the 2.1 milestone Aug 22, 2023
@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Aug 22, 2023
@DRiXD
Copy link
Contributor

DRiXD commented Aug 23, 2023

I would love to take this as my first issue @mroeschke
I will try to write a clean test for this

@DRiXD
Copy link
Contributor

DRiXD commented Aug 23, 2023

Hi @mroeschke @adrien-berchet I've tried and uploaded a test for this issue.
I'd appreciate some feedback from your side. :)

@adrien-berchet
Copy link
Author

Hi @DRiXD
Thanks for this!
The test looks good to me. I just checked and it properly reproduces the issue using the given commit and it works with more recent versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants