Skip to content

Error Creating DataFrame with Single MultiIndexed Column #12457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
woztheproblem opened this issue Feb 26, 2016 · 22 comments
Closed

Error Creating DataFrame with Single MultiIndexed Column #12457

woztheproblem opened this issue Feb 26, 2016 · 22 comments
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Error Reporting Incorrect or improved errors from pandas good first issue

Comments

@woztheproblem
Copy link

Attempting to create a DataFrame with a single column that is multiindexed, I get "IndexError: list index out of range".

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(data=zip(range(100)), columns=[['a','b']])

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-9-5b88e876b2de> in <module>()
----> 1 df = pd.DataFrame(data=zip(range(100)), columns=[['a','b']])

/Users/ewozniak/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    273 
    274                     mgr = _arrays_to_mgr(arrays, columns, index, columns,
--> 275                                          dtype=dtype)
    276                 else:
    277                     mgr = self._init_ndarray(data, index, columns, dtype=dtype,

/Users/ewozniak/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5236     axes = [_ensure_index(columns), _ensure_index(index)]
   5237 
-> 5238     return create_block_manager_from_arrays(arrays, arr_names, axes)
   5239 
   5240 

/Users/ewozniak/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in create_block_manager_from_arrays(arrays, names, axes)
   3894 
   3895     try:
-> 3896         blocks = form_blocks(arrays, names, axes)
   3897         mgr = BlockManager(blocks, axes)
   3898         mgr._consolidate_inplace()

/Users/ewozniak/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in form_blocks(arrays, names, axes)
   3929 
   3930         k = names[name_idx]
-> 3931         v = arrays[name_idx]
   3932 
   3933         if is_sparse(v):

IndexError: list index out of range

Note, I use zip() in the example above in order to match what the data would look like when creating one dataframe from the data of another data frame using df.to_json(orient='split') (which is what I'm trying to do). If I don't use zip(), then I get:

df = pd.DataFrame(data=range(100), columns=[['a','b']])
ValueError: Shape of passed values is (1, 100), indices imply (2, 100)

This works fine with two (or more) columns:

df = pd.DataFrame(data=zip(range(100),range(100)), columns=[['a','b'],['c','d']])

output of pd.show_versions()

I have tried with both pandas 0.17.1 and 0.18.0rc1.

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0rc1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
jinja2: 2.8

@jreback
Copy link
Contributor

jreback commented Feb 27, 2016

So this is a resaonable error message

In [5]: DataFrame(data=zip(range(100)), columns=['a','b'])
AssertionError: 2 columns passed, passed data had 1 columns

but because of this, it doesn't error properly
This should error as creating an Index with a list-of-anything but tuples is invalid

In [6]: Index([['a','b']])
Out[6]: Index([[u'a', u'b']], dtype='object')

@jreback
Copy link
Contributor

jreback commented Feb 27, 2016

@woztheproblem want to do a PR for the above and see if can fix?

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Novice Error Reporting Incorrect or improved errors from pandas labels Feb 27, 2016
@jreback jreback added this to the Next Major Release milestone Feb 27, 2016
@woztheproblem
Copy link
Author

I do now see that it works if I use a list of tuples for the columns, but I guess I don't follow why a list of lists wouldn't also work when the outer list is of length 1, since it works fine when the outer list is of length two (or more).

#works!
df = pd.DataFrame(data=zip(range(100), range(100)), columns=[['a','b'],['c','d']])

#List index out of range error
df = pd.DataFrame(data=zip(range(100)), columns=[['a','b']])

The reason that I'd like to use list of lists instead of list of tuples, is that my goal is to do this:

#Create some data
df = pd.DataFrame(data=range(100), columns=[('a','b')])

#convert to JSON
myjson = df.to_json(orient='split')

#send over the network to another computer

#load json into dataframe
df2 = pd.DataFrame(**json.loads(myjson))

to_json() converts the columns to a list of lists when converting to JSON, so without the ability to load the columns from a list of lists even when there is only one column, a more complicated workflow is needed. (which is fine if loading columns from a list (of length 1) of lists really isn't feasible for some reason)

@jreback
Copy link
Contributor

jreback commented Feb 27, 2016

why are you using json.loads at all?
does pd.read_json(...) not work for you?

@jreback
Copy link
Contributor

jreback commented Feb 27, 2016

as I said you are specifying 2 columns but have a single list of single element values, its the wrong shape. , you are passing a list-of-lists for the Index which is not allowed, you can create a multi-index if you want.

@woztheproblem
Copy link
Author

Thanks for your responses. However, I'm having a hard time connecting what you are saying and what I'm seeing. You said it's the wrong shape, but this works:

df = pd.DataFrame(data=zip(range(100)), columns=[('a','b')])

while this doesn't:

df = pd.DataFrame(data=zip(range(100)), columns=[['a','b']])

Yet they have the same shape for the columns argument.

You also said list of lists isn't allowed, but this works:

df = pd.DataFrame(data=zip(range(100), range(100)), columns=[['a','b'],['c','d']])

while this doesn't:

df = pd.DataFrame(data=zip(range(100)), columns=[['a','b']])

Yet both have list of lists for the columns argument.

So the failure is specific to list of lists for the columns argument when the outer list is of length 1.

I had forgotten about pd.read_json(), but it appears to have the same problem. This works:

#example 1
df = pd.DataFrame(data=zip(range(100), range(100)), columns=[['a','b'],['c','d']])
df2 = pd.read_json(df.to_json(orient='split'), orient='split')

But this doesn't:

#example 2
df = pd.DataFrame(data=zip(range(100)), columns=[('a','b')])
df2 = pd.read_json(df.to_json(orient='split'), orient='split')

The first line in example 2 succeeds due to the use of a list of tuples when the number of columns is 1, but the second line gets "IndexError: list index out of range" because to_json() creates a list of lists for the columns element, where the outer list is of length 1, which fails per the previous examples.

@javpaw
Copy link

javpaw commented Feb 28, 2016

Hi.

I was trying to make my first contribution to Pandas with this ticket.

This is what I got:

  • The function to_json stores the columns of the DataFrame as a list of list (LL), but the correct interpretation is as a list of tuples (LT)
  • given that this is Json, we have to store tuples as lists, so in our json, we store a LL as, for example, [['a','b']].
  • The function read_json uses the DataFrame constructor, if we pass [['a','b']] to the columns parameter we get an index of length 2 given that, internally, MultiIndex.from_arrays is called with the LL.

-For this specific case, I mean to_json <-> read_json, the method that should be called is MultiIndex.from_tuples (or Index alone, is the same thing) to be consistent on how the columns are stored in Json.

One solution

A solution for this specific case is to change the method read_json to check if the columns contains a list of list and if that's the case, create the MultiIndex before the DataFrame constructor is called, something like:

...
if result['columns']  and all(type(i) == list for i in result['columns']:
   result['columns'] = MultiIndex.from_tuples( results['columns'])

self.obj = DataFrame(**result)
...

Hacky solution

A hacky solution is to change the method MultiIndex.from_arrays to MultiIndex.to_arrays in the line:
https://github.com/pydata/pandas/blob/master/pandas/indexes/base.py#L3285-3285

Notice that logic in other sides have to change as well, for example in this line:
https://github.com/pydata/pandas/blob/master/pandas/core/frame.py#L5492-5492

because the len of columns and the len of the data can differ but still the multiindex generated from columns can be consistent with the data, for example:

DataFrame(data= [1], columns=[['a'],['b']] ) 

can generate a dataframe of one column with a multiindex of two levels, but for this case len(data) != len(columns) so right now this is not allowed throwing the error:

ValueError: Shape of passed values is (1, 1), indices imply (2, 1)

A better solution

IMHO we need extra information in the DataFrame constructor to know how to interpret the columns parameter.

this could imply create new params for the DataFrame constructor, for example three params called column_from_list, column_from_tuple and columns_from_product (It doesn't feel right to me but it's an option).

We can also create an extra parameter, for example colums_info, to indicate the correct interpretation of the columns.

Maybe the best solution is to decide that it's better to create first the index and the pass it to the DataFrame constructor and change the code in the parts a new DataFrame is generated.

If you guys agree that there is something to do here I would be happy to contribute, if not, I'll be looking for other issues to start with.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2016

This is simply invalid input, the Index is not allowed to have a list-of-lists. The point of this issue is to have a more informative error here. This has nothing to do with JSON construction, rather Index construction.

In [9]: DataFrame([[1],[2]],columns=[['a', 'b']])
IndexError: list index out of range

You would need to explicity construct a list-of-tuples

In [8]: DataFrame([[1],[2]],columns=pd.MultiIndex.from_tuples([('a', 'b')]))
Out[8]: 
   a
   b
0  1
1  2

@javpaw
Copy link

javpaw commented Feb 29, 2016

@jreback, the problem is that in one side to_json generates a list-of-lists, on the other side pd.read_json calls the DataFrame constructor with this list-of-lists; this generates the error.

One solution:

Change the code in read_json method to do what you say: create the MultiIndex before call the DataFrame constructor.

Other option:

Be able to send extra information to the DataFrame constructor so it can interpret the parameter columns as expected (list-of-list, list-of-tuples, etc...) , basically add one or more new parameters to the DataFrame constructor.

This is, in summary, what I was talking about in my last comment.

@jreback
Copy link
Contributor

jreback commented Feb 29, 2016

@javpaw this is a simple issue. the JSON stuff is not relevant here. There is an incorrect construction before anything gets sent to JSON.

@javpaw
Copy link

javpaw commented Feb 29, 2016

@jreback thanks for your answer.
This code should work out of the box, it doesn't :

import pandas as pd

#data frame with one row and one multiindex column
df = pd.DataFrame(data = [1], columns= pd.MultiIndex.from_tuples([('a','b')]))

#Store as Json, notice the list-of-lists generated for columns:
df_json = simple.to_json(orient='split') # '{"columns":[["a","b"]],"index":[0],"data":[[1]]}'

#This fails as `columns` is a list-of-lists and hence readed with the method MultiIndex.from_arrays
#internally, the correct method in this case would be MultiIndex.from_tuples.
copy_df = pd.read_json(df_json, orient='split')

Am I missing something?

@jreback
Copy link
Contributor

jreback commented Feb 29, 2016

@javpaw see here: #4889. This is not implemented ATM. you are welcome to submit a PR for that issue.

@woztheproblem
Copy link
Author

@jreback Can you clarify for me why df = pd.DataFrame(data=zip(range(100), range(100)), columns=[['a','b'],['c','d']]) works with no problem if list of lists for index is not allowed? Is it just that it happens to work despite not really being supported? Also, is there a particular reason why list of lists is not allowed when list of tuples is? (i.e. what issues would it cause if there was an attempt to allow it) Thanks!

@vlfom
Copy link

vlfom commented Mar 2, 2016

@jreback
Creating an Index with list of lists is also allowed this way:
Index([[1,2]])
Should it be prohibited then, too?

@jreback
Copy link
Contributor

jreback commented Mar 3, 2016

a list-of-list is not allowed, and I think we could raise on this. A list-of-tuples is by-definition a MultiIindex. (Though for compat we do allow you to actually create a straight Index and not an actual MultiIndex, but its not very useful).

@jreback
Copy link
Contributor

jreback commented Mar 3, 2016

@vlfom tests! and you don't need to intercept it there, rather in Index itself as that will catch ALL types of creations.

@vlfom
Copy link

vlfom commented Mar 3, 2016

@jreback
I found the appropriate way to deal with this problem in Index (if I got you right): https://github.com/vlfom/pandas/blob/fix-index/pandas/indexes/base.py#L3275
Here are the tests: https://github.com/vlfom/pandas/blob/fix-index/pandas/tests/frame/test_constructors.py#L2003

However, lots of tests contain list-of-lists and fail after the fix: https://github.com/pydata/pandas/blob/master/pandas/tests/test_indexing.py#L1916 , so should lists be changed to MultiIndex'es there?

@simonjayhawkins
Copy link
Member

This issue is about a better error message for incorrect DataFrame construction.

as @jreback said:

as I said you are specifying 2 columns but have a single list of single element values, its the wrong shape.

pd.DataFrame([(0,)], columns=[['a','b']])  # IndexError: list index out of range
pd.DataFrame([[0]], columns=[['a','b']])  # IndexError: list index out of range
(0,)[1]  # IndexError: tuple index out of range
[0][1]  # IndexError: list index out of range
len((0,))  # 1

@simonjayhawkins simonjayhawkins removed Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 21, 2019
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mannawar
Copy link

mannawar commented Sep 9, 2019

In python as we know the list is mutable. And if we try to access the element out of range or which we are not allowed to do so, is giving index out of range error
example: if we have, 5 -items as items=[1,"apple", 2, "boy", 3], here index start with zero as we know
items[0] will give 1
items[1] will give "apple"
items[2] will give 2
items[3] will give 2"boy"
items[4] will give 3
items[5] will give "index out of range error"
Hope it helps!

@fokoid
Copy link
Contributor

fokoid commented Sep 25, 2020

Found this issue while looking for things to do for Hacktoberfest. As of #32202 the behavior has changed. Passing a list of lists for columns parameter no longer raises an exception at all, but does behave differently to a list of tuples:

  • list of tuples: tuples interpreted as multi-index values
  • list of lists: each inner list interpreted as the values of one level of the multi-index

Tests were already added in that PR for the list of lists behavior but I actually couldn't find a test for the standard list of tuples -> multi-index behavior. I can add one.

I don't think anything else still needs to be done here, but let me know if you disagree and I can give it a go.

@mroeschke
Copy link
Member

I think the original issue raises an appropriate error message now and is tested in test_constructor_error_msgs, so I think this issue can be closed

In [32]: df = pd.DataFrame(data=zip(range(100)), columns=[['a','b']])
ValueError: 2 columns passed, passed data had 1 columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Error Reporting Incorrect or improved errors from pandas good first issue
Projects
None yet
Development

No branches or pull requests

10 participants