-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame construction with columns argument set to a MultiIndex creates an empty DataFrame. #39904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not to do with MultiIndex In:
You are providing data for a column 'a', but specifically setting the columns as ['b'] for which there is no data. This yields Empty. In:
You are creating a DataFrame with column 'a' by default and then renaming the column 'b'. I suggest this is correct behaviour and requires no warning. |
This is similar to #39374 . Maybe add something in the docs? This does seem a bit odd without explanation, but I would agree that this is the expected behavior |
+1 on adding to the docs. Also, did some timing with
vs
the latter is much faster. Perhaps there is an opportunity to optimize here. |
Yep, I understand now. The thing that really confused me about the the API docs was the following case: >>> df = pd.DataFrame({'a': [1,2]}, columns=RangeIndex(1))
>>> df
Empty DataFrame
Columns: [0]
Index: [] is not the same as >>> df = pd.DataFrame({'a': [1,2]})
>>> df
a
0 1
1 2 even though columns However, after reading the user guide here. It becomes clear to me that So yes, expected behavior, but confusing documentation. |
@DriesSchaumont - Thanks for reporting this, a PR to fix/improve the docs here is certainly welcome! |
I will have a look and try to provide some clarification to the docs. So if there are no objections, I am going to take it. |
take |
@rhshadrach, am I understanding the behavior correctly:
>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
>>> pd.DataFrame(data) # Column labels can not be inferred from data
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
>>> pd.DataFrame(data, columns=['A', 'B', 'C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
>>> pd.DataFrame(data, columns=['A', 'B'])
ValueError: Shape of passed values is (3, 3), indices imply (3, 2)
>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], dtype=[("A", "i4"), ("B", "i4"), ("C", "i4")])
>>> pd.DataFrame(data) # Column labels inferred from data
A B C
0 1 2 3
1 4 5 6
2 7 8 9
>>> pd.DataFrame(data, columns=['A', 'B', 'C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
>>> pd.DataFrame(data, columns=['A', 'B'])
A B
0 1 2
1 4 5
2 7 8 |
Yep - that looks correct to me. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
If a MultiIndex is used to set the columns using the columns argument from the constructor, the MultiIndex is set correctly but the whole dataframe becomes empty. If the columns index is applied manually after construction by using df.columns, the dataframe does not lose its data. I think this is either a bug, or a clear warning or error message should be given why the dataframe becomes empty.
Expected Output
I expect the constructor to return a non-empty dataframe with the MultiIndex applied to the columns. The API reference mentions that is can be an 'Index or array-like', but I don't know if a MultiIndex is allowed.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 7d32926
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-58-generic
Version : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.2
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: