-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Error Creating DataFrame with Single MultiIndexed Column #12457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So this is a resaonable error message
but because of this, it doesn't error properly
|
@woztheproblem want to do a PR for the above and see if can fix? |
I do now see that it works if I use a list of tuples for the columns, but I guess I don't follow why a list of lists wouldn't also work when the outer list is of length 1, since it works fine when the outer list is of length two (or more).
The reason that I'd like to use list of lists instead of list of tuples, is that my goal is to do this:
to_json() converts the columns to a list of lists when converting to JSON, so without the ability to load the columns from a list of lists even when there is only one column, a more complicated workflow is needed. (which is fine if loading columns from a list (of length 1) of lists really isn't feasible for some reason) |
why are you using |
as I said you are specifying 2 columns but have a single list of single element values, its the wrong shape. , you are passing a list-of-lists for the Index which is not allowed, you can create a multi-index if you want. |
Thanks for your responses. However, I'm having a hard time connecting what you are saying and what I'm seeing. You said it's the wrong shape, but this works:
while this doesn't:
Yet they have the same shape for the columns argument. You also said list of lists isn't allowed, but this works:
while this doesn't:
Yet both have list of lists for the columns argument. So the failure is specific to list of lists for the columns argument when the outer list is of length 1. I had forgotten about
But this doesn't:
The first line in example 2 succeeds due to the use of a list of tuples when the number of columns is 1, but the second line gets "IndexError: list index out of range" because to_json() creates a list of lists for the columns element, where the outer list is of length 1, which fails per the previous examples. |
Hi. I was trying to make my first contribution to Pandas with this ticket. This is what I got:
-For this specific case, I mean to_json <-> read_json, the method that should be called is MultiIndex.from_tuples (or Index alone, is the same thing) to be consistent on how the columns are stored in Json. One solutionA solution for this specific case is to change the method read_json to check if the columns contains a list of list and if that's the case, create the MultiIndex before the DataFrame constructor is called, something like: ...
if result['columns'] and all(type(i) == list for i in result['columns']:
result['columns'] = MultiIndex.from_tuples( results['columns'])
self.obj = DataFrame(**result)
... Hacky solutionA hacky solution is to change the method MultiIndex.from_arrays to MultiIndex.to_arrays in the line: Notice that logic in other sides have to change as well, for example in this line: because the len of columns and the len of the data can differ but still the multiindex generated from columns can be consistent with the data, for example: DataFrame(data= [1], columns=[['a'],['b']] ) can generate a dataframe of one column with a multiindex of two levels, but for this case len(data) != len(columns) so right now this is not allowed throwing the error: ValueError: Shape of passed values is (1, 1), indices imply (2, 1) A better solutionIMHO we need extra information in the DataFrame constructor to know how to interpret the columns parameter. this could imply create new params for the DataFrame constructor, for example three params called column_from_list, column_from_tuple and columns_from_product (It doesn't feel right to me but it's an option). We can also create an extra parameter, for example colums_info, to indicate the correct interpretation of the columns. Maybe the best solution is to decide that it's better to create first the index and the pass it to the DataFrame constructor and change the code in the parts a new DataFrame is generated. If you guys agree that there is something to do here I would be happy to contribute, if not, I'll be looking for other issues to start with. |
This is simply invalid input, the
You would need to explicity construct a list-of-tuples
|
@jreback, the problem is that in one side One solution:Change the code in Other option:Be able to send extra information to the DataFrame constructor so it can interpret the parameter This is, in summary, what I was talking about in my last comment. |
@javpaw this is a simple issue. the JSON stuff is not relevant here. There is an incorrect construction before anything gets sent to JSON. |
@jreback thanks for your answer. import pandas as pd
#data frame with one row and one multiindex column
df = pd.DataFrame(data = [1], columns= pd.MultiIndex.from_tuples([('a','b')]))
#Store as Json, notice the list-of-lists generated for columns:
df_json = simple.to_json(orient='split') # '{"columns":[["a","b"]],"index":[0],"data":[[1]]}'
#This fails as `columns` is a list-of-lists and hence readed with the method MultiIndex.from_arrays
#internally, the correct method in this case would be MultiIndex.from_tuples.
copy_df = pd.read_json(df_json, orient='split') Am I missing something? |
@jreback Can you clarify for me why |
@jreback |
a list-of-list is not allowed, and I think we could raise on this. A list-of-tuples is by-definition a |
@vlfom tests! and you don't need to intercept it there, rather in |
@jreback However, lots of tests contain list-of-lists and fail after the fix: https://github.com/pydata/pandas/blob/master/pandas/tests/test_indexing.py#L1916 , so should lists be changed to MultiIndex'es there? |
This issue is about a better error message for incorrect DataFrame construction. as @jreback said:
pd.DataFrame([(0,)], columns=[['a','b']]) # IndexError: list index out of range
pd.DataFrame([[0]], columns=[['a','b']]) # IndexError: list index out of range
(0,)[1] # IndexError: tuple index out of range
[0][1] # IndexError: list index out of range
len((0,)) # 1 |
In python as we know the list is mutable. And if we try to access the element out of range or which we are not allowed to do so, is giving index out of range error |
Found this issue while looking for things to do for Hacktoberfest. As of #32202 the behavior has changed. Passing a list of lists for columns parameter no longer raises an exception at all, but does behave differently to a list of tuples:
Tests were already added in that PR for the list of lists behavior but I actually couldn't find a test for the standard list of tuples -> multi-index behavior. I can add one. I don't think anything else still needs to be done here, but let me know if you disagree and I can give it a go. |
I think the original issue raises an appropriate error message now and is tested in
|
Attempting to create a DataFrame with a single column that is multiindexed, I get "IndexError: list index out of range".
Code Sample, a copy-pastable example if possible
Note, I use zip() in the example above in order to match what the data would look like when creating one dataframe from the data of another data frame using
df.to_json(orient='split')
(which is what I'm trying to do). If I don't use zip(), then I get:This works fine with two (or more) columns:
output of
pd.show_versions()
I have tried with both pandas 0.17.1 and 0.18.0rc1.
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.0rc1
nose: 1.3.7
pip: 8.0.3
setuptools: 20.1.1
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.2.6
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
jinja2: 2.8
The text was updated successfully, but these errors were encountered: