Skip to content

BUG: read_excel with multi-indexed column ignores index_col=None #11733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Dec 1, 2015 · 18 comments
Closed
Labels
Bug IO Excel read_excel, to_excel MultiIndex
Milestone

Comments

@jorisvandenbossche
Copy link
Member

From SO: http://stackoverflow.com/questions/34020061/excel-to-pandas-dataframe-using-first-column-as-index

@chris-b1 another one on the multi-index excel issues .. :-)

Small test case: content of excel file:

A A B B
key val key val
1 2 3 4
1 2 3 4

gives:

In [2]: pd.read_excel("test_excel_index_col.xlsx", header=[0,1], index_col=None)
Out[2]:
A     A    B
key val key  val
1     2    3   4
1     2    3   4

It's not super clear in the formatting of the dataframe, but the [1, 1] is the index and [A, key] are seen as the level names of the multi-indexed columns.

@jorisvandenbossche jorisvandenbossche added this to the 0.18.0 milestone Dec 1, 2015
@chris-b1
Copy link
Contributor

chris-b1 commented Dec 1, 2015

We don't support writing to this format (multi-index columns w/ no row index) because it's ambiguous on the way back in. But it seems reasonable to support reading it, I'll take a look.

@jreback jreback modified the milestones: Next Major Release, 0.18.0 Jan 24, 2016
@pheman
Copy link

pheman commented Jan 20, 2017

I got the same problem

@chris-b1
Copy link
Contributor

From #15180 (comment) - consider changing default of index_col to some kind of sentinel (e.g. 'infer') so that passing None means something.

@stephenrauch
Copy link
Contributor

@chris-b1 @jreback I have a proposed fix here: stephenrauch@1204b31 Before I do all the docs and stuff I wanna make sure I am headed in the right direction. Thanks.

@stephenrauch
Copy link
Contributor

Per review comment from @jreback here is proposed api for read_excel. It is the same as read_csv.

index_col : int or sequence or False, default None

Column (0-indexed) to use as the row labels of the DataFrame. If a
sequence is given, those columns will be combined into a MultiIndex.
If None (default), pandas will use the first column as the
index. If False, force pandas to not use the first column as the index
(row names).

Updates are here:

stephenrauch@9b37ff9

So if this proposed API looks ok, I will do the PR.

@chris-b1
Copy link
Contributor

Thanks @stephenrauch, that api looks good to me, please open it as a PR. One subtlety for the docs, if not using a MultiIndex header, index_col=None won't always use the first column as an index, it takes an inference path based on the shape of the data (same as for read_csv IIRC). e.g.

In [31]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 5]})

In [32]: df.to_excel('temp.xlsx', index=False)

In [33]: pd.read_excel('temp.xlsx')

   a  b
0  1  4
1  2  5
2  3  5

@stephenrauch
Copy link
Contributor

it takes an inference path based on the shape of the data

@chris-b1 @jreback Are there any test cases (or maybe some other description) that show how index_col=None works for something other than the index in the first column? Either for the csv or excel cases? I am having trouble grokking what is meant by the shape of the data in this case.

@chris-b1
Copy link
Contributor

Not sure csv actually works this way. But for current Excel behavior:

Sheet 1
image

parses a row index

In [297]: pd.read_excel('temp.xlsx', sheetname='Sheet1')
Out[297]: 
      a  b  c
row1  1  2  4
row2  1  2  4
row3  1  2  4

Sheet2
image

No row index

In [298]: pd.read_excel('temp.xlsx', sheetname='Sheet2')
Out[298]: 
   a  b  c
0  1  2  4
1  1  2  4
2  1  2  4

@chenweisomebody126
Copy link

Has this problem been solved? I met the same question.

@araespahan
Copy link

As of now I still see the same issue. when using multi headers with read_excel, pandas always assigns the first column as index.

@sdicker8
Copy link

sdicker8 commented Apr 3, 2018

Another vote for a fix. I'm running 0.22 and it seems so hackish to have to write the code to save the 'index' to a named column and then reset the index.

Pandas is a 'great' module - thank you -

@the-rccg
Copy link

Vote for index_col=False to fix this

@rileymcdowell
Copy link

Just encountered this issue and I am looking for a fix. @stephenrauch has the PR been made for this?

@WillAyd
Copy link
Member

WillAyd commented Jul 13, 2018

@rileymcdowell no PRs have been made but would welcome any if you are interested

@rileymcdowell
Copy link

I'll put together a PR from @stephenrauch's work in the next couple of days.

@rileymcdowell
Copy link

I've dug into this and ran into a decision point. Consider the following spreadsheet (Taken from the test suite).

snip

Right now, this is interpreted by the read_excel function as a MultiIndex where sheet cells A1 and A2 (values c1 and c2) represent the level names of the MultiIndex.

MultiIndex(levels=[['bar', 'foo'], ['a', 'b']],
           labels=[[1, 1, 0, 0], [0, 1, 0, 1]],
           names=['c1', 'c2'])

The situation that brought me to find this github issue is that in this example, I expect cells A1 and A2 (values c1 and c2) to be interpreted as another set of levels in the MultiIndex. For example

MultiIndex(levels=[['bar', 'c1', 'foo'], ['a', 'b', 'c2']],
           labels=[[1, 2, 2, 0, 0], [2, 0, 1, 0, 1]],
           names=['', ''])

The test suite explicitly covers the existing functionality of the former. This behavior differs from that of the read_csv function, which behaves most similarly to the latter.

A workaround is to allow a sentinel value of index_col=False as suggested in this issue, but a change to support this would also touch some of the core logic in pandas.io.parsers. For example this check which limits non-null index_col settings to have numeric values. That seems like it could have far-reaching consequences.

Any thoughts about how best to tackle this?

@WillAyd
Copy link
Member

WillAyd commented Jul 16, 2018

The situation that brought me to find this github issue is that in this example, I expect cells A1 and A2 (values c1 and c2) to be interpreted as another set of levels in the MultiIndex. For example

Hmm well I disagree since this representation matches what you'd see with a normal data frame representation, but regardless of opinions I think it just speaks to what @chris-b1 mentioned earlier that this is really ambiguous so there's not necessarily a right answer

This behavior differs from that of the read_csv function, which behaves most similarly to the latter.

Can you clarify this with an example? CSV doesn't have the concept of a merged cell like you have with the foo and bar cells so I don't think this is apples-to-apples

A workaround is to allow a sentinel value of index_col=False as suggested in this issue, but a change to support this would also touch some of the core logic in pandas.io.parsers. For example this check which limits non-null index_col settings to have numeric values. That seems like it could have far-reaching consequences.

Isn't False a valid value for read_csv? If so then there's got to be something with that implementation that isn't limited by what you've found

Thanks for the investigation!

@ian-contiamo
Copy link

@rileymcdowell I would agree with @WillAyd that the former of the two behaviors you describe is the most intuitive way to interpret an Excel file. If you can I would encourage you to submit a PR with this behavior.

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 14, 2018
@jreback jreback removed this from the Contributions Welcome milestone Nov 14, 2018
@jreback jreback added this to the 0.24.0 milestone Nov 14, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 14, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel MultiIndex
Projects
None yet
Development

No branches or pull requests