Skip to content

BUG: incorrect type when casting to nullable type in multiindex dataframe #46896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
rubenvdg opened this issue Apr 29, 2022 · 1 comment · Fixed by #48094
Closed
3 tasks done

BUG: incorrect type when casting to nullable type in multiindex dataframe #46896

rubenvdg opened this issue Apr 29, 2022 · 1 comment · Fixed by #48094
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@rubenvdg
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    columns=pd.MultiIndex.from_tuples([("a", "c"), ("a", "d")]),
    data=[[1, 2], [3, 4]],
)

df["a"] = df["a"].astype("Int64")

print(df.dtypes)  # results in type "object" for both columns

print(df["a"].astype("Int64").dtypes)  # here the dtypes are correct

Issue Description

When casting to a nullable type (like "Int64") in a multiindex column dataframe, the resulting type is "object" (where it should be "Int64").

If you cast to a numpy type (e.g. .astype(float)), we do see the expected behavior (namely, a dataframe with types float).

It seems that the problem is not with the .astype(..) method, but with the __setitem__ (i.e., with the re-assignment df["a"] = df["a"]...).

Happy to help, but it seems it requires debugging C code which is beyond the capabilities of my tiny brain.

Expected Behavior

We expect that the columns have the right type.

Installed Versions

INSTALLED VERSIONS

commit : 9f24918
python : 3.8.13.final.0
python-bits : 64
OS : Darwin
OS-release : 21.4.0
Version : Darwin Kernel Version 21.4.0: Fri Mar 18 00:47:26 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T8101
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.UTF-8

pandas : 1.5.0.dev0+725.g9f24918759.dirty
numpy : 1.21.6
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 62.1.0
Cython : 0.29.28
pytest : 7.1.2
hypothesis : 6.45.1
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.3.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli :
fastparquet : 0.8.0
fsspec : 2021.11.0
gcsfs : 2021.11.0
markupsafe : 2.1.1
matplotlib : 3.5.1
numba : 0.55.1
numexpr : 2.8.0
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : 1.1.5
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.36
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

@rubenvdg rubenvdg added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 29, 2022
@simonjayhawkins
Copy link
Member

Thanks @rubenvdg for the report

When casting to a nullable type (like "Int64") in a multiindex column dataframe, the resulting type is "object" (where it should be "Int64").

If you cast to a numpy type (e.g. .astype(float)), we do see the expected behavior (namely, a dataframe with types float).

It seems that the problem is not with the .astype(..) method, but with the __setitem__ (i.e., with the re-assignment df["a"] = df["a"]...).

does appear to related to __setitem__ not preseving dtypes and not limited to Int64 dtypes casting to object, e.g.

>>> import pandas as pd
>>> 
>>> df = pd.DataFrame(
...     columns=pd.MultiIndex.from_tuples([("a", "c"), ("a", "d")]),
...     data=[[1, 2], [3, 4]],
... )
>>> 
>>> df["a"] = df["a"].astype("category")
>>> print(df.dtypes)
a  c    int64
   d    int64
dtype: object
>>> 
>>> print(df["a"].astype("category").dtypes)
c    category
d    category
dtype: object
>>> 

where the category columns are cast to integer columns.

@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2022
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone May 4, 2022
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.5 Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
2 participants