Skip to content

Dataframe constructor misinterprets columns argument if nested list is passed in as the data parameter. #14467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
madphysicist opened this issue Oct 21, 2016 · 8 comments · Fixed by #41493
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@madphysicist
Copy link

madphysicist commented Oct 21, 2016

This issue is based on Stack Overflow question http://stackoverflow.com/q/40182072/2988730.

A small, complete example of the issue

df = pd.DataFrame([[1, 2, 3], [4, 5, 6]],
                  index=[['gibberish']*2, [0, 1]],
                  columns=[['baldersash']*3, [10, 20, 30]])

The result is

  File "<ipython-input-321-2695882ac68b>", line 3, in <module>
    columns=[['baldersash']*3, [10, 20, 30]])

  File "/home/jfoxrabi/miniconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 263, in __init__
    arrays, columns = _to_arrays(data, columns, dtype=dtype)

  File "/home/jfoxrabi/miniconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5352, in _to_arrays
    dtype=dtype)

  File "/home/jfoxrabi/miniconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5431, in _list_to_arrays
    coerce_float=coerce_float)

  File "/home/jfoxrabi/miniconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5489, in _convert_object_array
    'columns' % (len(columns), len(content)))

AssertionError: 2 columns passed, passed data had 3 columns

Expected Output

            baldersash   
                    10 20 30
gibberish 0          1  2  3
          1          4  5  6

The surprising thing here is that any of the following seem to work just fine:

  1. Supplying a numpy array as data:

    df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]),
                      index=[['gibberish']*2, [0, 1]],
                      columns=[['baldersash']*3, [10, 20, 30]])

    Results in

                baldersash      
                        10 20 30
    gibberish 0          1  2  3
              1          4  5  6
    
  2. Reducing the size of the input array to have two columns:

    df = pd.DataFrame([[1, 2], [3, 4]],
                      index=[['gibberish']*2, [0, 1]],
                      columns=[['baldersash']*2, [10, 20]])

    Results in

                baldersash   
                        10 20
    gibberish 0          1  2
              1          3  4
    
  3. Omitting the columns argument:

    df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]),
                      index=[['gibberish']*2, [0, 1]])

    Results in

                 0  1  2
    gibberish 0  1  2  3
              1  4  5  6
    
  4. Using a single-level list for the columns argument:

    df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]),
                      index=[['gibberish']*2, [0, 1]],
                      columns=[10, 20, 30])

    Results in

                 10  20  30
    gibberish 0   1   2   3
              1   4   5   6
    

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.29.2.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.5
xlrd: None
xlwt: None
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Oct 21, 2016

you have several issues.

  • nested lists are not meaningful for indexes. you generally want a MultiIndex, which you need to explicitly create
  • the data is 3 columns, 2 on the index (you have the reverse)
In [13]: pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=list('ABC'), index=['foo', 'bar'])
Out[13]:
     A  B  C
foo  1  2  3
bar  4  5  6

@jreback jreback closed this as completed Oct 21, 2016
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Oct 21, 2016
@jreback jreback added this to the No action milestone Oct 21, 2016
@madphysicist
Copy link
Author

Then why do cases 1 and 2 work? The problem is that the behavior is inconsistent.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2016

the shapes are correct there

@madphysicist
Copy link
Author

Case 1 only differs from the "broken" case in that is passes in an ndarray for data instead of a nested list. The multiindex is passed in as a nested list. I do not think that the shape matches there (3 columns vs two element list). My question for that specific one is then how does the exact type of the input affect the way columns gets interpreted?

@jorisvandenbossche
Copy link
Member

@jreback I didn't know you could pass a list of lists to represent a MultiIndex (like you would pass to MultiIndex.from_arrays, but in any case it works in certain cases and it not working in the certain example of @madphysicist is inconsistent:

  • So passing a nested list or corresponding array as the data gives the same result
In [78]: pd.DataFrame([[1, 2, 3], [4, 5, 6]])
Out[78]: 
   0  1  2
0  1  2  3
1  4  5  6

In [79]: pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
Out[79]: 
   0  1  2
0  1  2  3
1  4  5  6
  • The same as above, but passing a nested list to index= -> nested list is interpreted as a MultiIndex:
In [82]: pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=[['A', 'A'], ['a', 'b']])
Out[82]: 
     0  1  2
A a  1  2  3
  b  4  5  6

In [83]: pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), index=[['A', 'A'], ['a', 'b']])
Out[83]: 
     0  1  2
A a  1  2  3
  b  4  5  6

In [84]: pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), index=[['A', 'A'], ['a', 'b']]).index
Out[84]: 
MultiIndex(levels=[['A'], ['a', 'b']],
           labels=[[0, 0], [0, 1]])

  • The same for columns= -> but now there is a difference between having a list vs array for the data:
In [80]: pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=[['A', 'A', 'A'], ['a', 'b', 'c']])
...
AssertionError: 2 columns passed, passed data had 3 columns

In [81]: pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]), columns=[['A', 'A', 'A'], ['a', 'b', 'c']])
Out[81]: 
   A      
   a  b  c
0  1  2  3
1  4  5  6

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, No action Nov 23, 2016
@harri471
Copy link

harri471 commented Mar 8, 2020

take

harri471 added a commit to CSCD01-team01/pandas that referenced this issue Mar 11, 2020
harri471 added a commit to CSCD01-team01/pandas that referenced this issue Mar 11, 2020
harri471 added a commit to CSCD01-team01/pandas that referenced this issue Mar 18, 2020
harri471 added a commit to CSCD01-team01/pandas that referenced this issue Mar 18, 2020
@mroeschke
Copy link
Member

The columns case looks to work on master now. Could use a test

In [16]: In [80]: pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=[['A', 'A', 'A'], ['a', 'b', 'c']])
Out[16]:
   A
   a  b  c
0  1  2  3
1  4  5  6

In [17]: pd.__version__
Out[17]: '1.3.0.dev0+1485.g6abb567cb1'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 2, 2021
@madphysicist
Copy link
Author

@mroeschke I tried with version 1.2.3 on Arch Linux and it worked fine. I'd be happy to add a test like my original example if you point me to where. It was interesting to see your message pop up. Took me a few minutes to figure out what it was about.

@mroeschke mroeschke removed this from the Contributions Welcome milestone May 16, 2021
@mroeschke mroeschke added this to the 1.3 milestone May 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
5 participants