Skip to content

BUG: DataFrame construction with columns argument set to a MultiIndex creates an empty DataFrame. #39904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
DriesSchaumont opened this issue Feb 19, 2021 · 9 comments · Fixed by #40658
Closed
3 tasks done
Assignees
Milestone

Comments

@DriesSchaumont
Copy link
Member

DriesSchaumont commented Feb 19, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

>>> import pandas as pd
>>> from numpy import nan
>>>
>>>
>>> column_index = pd.MultiIndex.from_product([["sample1", "sample2"], [True], [True]],
...                                           names=["Name", "Sample_column", "Write"])
>>> table = pd.DataFrame(data={"sample1": [100.0, nan, 100.0, 10.0],
...                            "sample2": [80.0, 20.0, 100.0, 100.0]})
>>> table.columns = column_index
>>>
>>> table2 = pd.DataFrame(data={"sample1": [100.0, nan, 100.0, 10.0],
...                             "sample2": [80.0, 20.0, 100.0, 100.0]},
...                             columns=column_index)
>>> table
Name          sample1 sample2
Sample_column    True    True
Write            True    True
0               100.0    80.0
1                 NaN    20.0
2               100.0   100.0
3                10.0   100.0
>>> table2
Empty DataFrame
Columns: [(sample1, True, True), (sample2, True, True)]
Index:

Problem description

If a MultiIndex is used to set the columns using the columns argument from the constructor, the MultiIndex is set correctly but the whole dataframe becomes empty. If the columns index is applied manually after construction by using df.columns, the dataframe does not lose its data. I think this is either a bug, or a clear warning or error message should be given why the dataframe becomes empty.

Expected Output

I expect the constructor to return a non-empty dataframe with the MultiIndex applied to the columns. The API reference mentions that is can be an 'Index or array-like', but I don't know if a MultiIndex is allowed.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit : 7d32926
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-58-generic
Version : #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.2
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@DriesSchaumont DriesSchaumont added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@attack68
Copy link
Contributor

This is not to do with MultiIndex

In:

df = pd.DataFrame({'a': [1,2]}, columns=['b'])

You are providing data for a column 'a', but specifically setting the columns as ['b'] for which there is no data. This yields Empty.

In:

df = pd.DataFrame({'a': [1,2]})
df.columns = ['b']

You are creating a DataFrame with column 'a' by default and then renaming the column 'b'.

I suggest this is correct behaviour and requires no warning.

@attack68 attack68 added Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@phofl
Copy link
Member

phofl commented Feb 19, 2021

This is similar to #39374 . Maybe add something in the docs? This does seem a bit odd without explanation, but I would agree that this is the expected behavior

@rhshadrach
Copy link
Member

+1 on adding to the docs. Also, did some timing with mapping = {k: 10 * [1] for k in range(5000)} and comparing

DataFrame(mapping, columns=[0, 1])

vs

DataFrame({k: mapping[k] for k in [0, 1]})

the latter is much faster. Perhaps there is an opportunity to optimize here.

@DriesSchaumont
Copy link
Member Author

Yep, I understand now. The thing that really confused me about the the API docs was the following case:

>>> df = pd.DataFrame({'a': [1,2]}, columns=RangeIndex(1))
>>> df
Empty DataFrame
Columns: [0]
Index: []

is not the same as

>>> df = pd.DataFrame({'a': [1,2]})
>>> df
   a
0  1
1  2

even though columns will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.

However, after reading the user guide here. It becomes clear to me that a dict of Series plus a specific index will discard all data not matching up to the passed index.

So yes, expected behavior, but confusing documentation.
Thank you for the explanation! I really appreciate it.

@rhshadrach rhshadrach added Docs Enhancement and removed Bug Closing Candidate May be closeable, needs more eyeballs labels Feb 20, 2021
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Feb 20, 2021
@rhshadrach
Copy link
Member

@DriesSchaumont - Thanks for reporting this, a PR to fix/improve the docs here is certainly welcome!

@DriesSchaumont
Copy link
Member Author

I will have a look and try to provide some clarification to the docs. So if there are no objections, I am going to take it.

@DriesSchaumont
Copy link
Member Author

take

@DriesSchaumont
Copy link
Member Author

@rhshadrach, am I understanding the behavior correctly:

  • When column labels can not be inferred from the input data, the column argument will set the labels.
  • When column labels are inferred from the input data, the column argument selects columns to include in the resulting frame?
>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
>>> pd.DataFrame(data) # Column labels can not be inferred from data
   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9
>>> pd.DataFrame(data, columns=['A', 'B', 'C'])
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
>>> pd.DataFrame(data, columns=['A', 'B'])
ValueError: Shape of passed values is (3, 3), indices imply (3, 2)

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], dtype=[("A", "i4"), ("B", "i4"), ("C", "i4")])
>>> pd.DataFrame(data) # Column labels inferred from data
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
>>> pd.DataFrame(data, columns=['A', 'B', 'C'])
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
>>> pd.DataFrame(data, columns=['A', 'B'])
   A  B
0  1  2
1  4  5
2  7  8

@rhshadrach
Copy link
Member

Yep - that looks correct to me.

DriesSchaumont added a commit to DriesSchaumont/pandas that referenced this issue Mar 27, 2021
DriesSchaumont added a commit to DriesSchaumont/pandas that referenced this issue Mar 27, 2021
DriesSchaumont added a commit to DriesSchaumont/pandas that referenced this issue Mar 27, 2021
DriesSchaumont added a commit to DriesSchaumont/pandas that referenced this issue Mar 29, 2021
DriesSchaumont added a commit to DriesSchaumont/pandas that referenced this issue Mar 29, 2021
@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 1.3 Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants