Unable to control import of 1st column from Excel file (with multi-row header) #15180

wikiped · 2017-01-20T12:16:47Z

Let's assume there are two excel files with almost identical content:

file_1.xlsx:

file_2.xlsx:

Then reading first file in pandas:

df1 = pd.read_excel('file_1.xlsx', 
                    header=0, index_col=None, 
                    converters={'A0': str, 'B0': str})
print(df1)

Would produce expected result:

     A0    B0   C0  D0 E0
0  0001  0004  0.1   1  a
1  0002  0005  0.2   2  b
2  0003  0006  0.3   3  c

However trying the same with the second file:

df2 = pd.read_excel('file_2.xlsx', 
                    header=[0,1], index_col=None,
                    converters={('A0', 'A1'): str, 
                                ('A0', 'B1'): str},
                   )

print(df2)

Would yield somewhat different and unexpected (in comparison with previous example) result:

A0    A0   C0    E0
A1    B1   C1 D1 E1
1   0004  0.1  1  a
2   0005  0.2  2  b
3   0006  0.3  3  c

Since it is not possible to use has_index_names=False as it has been depreciated since 0.16.2, there seems to be no way to have control over how pandas imports this first column (i.e. no way to convert values before original formatting is lost).

And there is no way to tell pandas DO_NOT_ASSIGN first column to index as it ignores index_col=None when header is a list.

So the question is what would be the sensible way to regain control over import process of first columns with multi-index header:

revive or de-depreciate (would that be appreciate?) has_index_names;
make index_col play a role in parsing header?

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-20T12:18:33Z

cc @chris-b1

chris-b1 · 2017-01-20T12:51:18Z

Thanks for the report and detailed example - this is a duplicate of #11733. I'm entirely in favor of supporting this, though it is tricky to handle all the various formats and not break back-compat.

One idea I'm not sure I had explored - right now the default for index_col is None, so passing index_col=None can't give any information. We possibly could change the default to 'infer' (since that is what is really happening) - so that passing index_col=None is meaningful. PRs / additional ideas welcome!

jreback added the IO Excel read_excel, to_excel label Jan 20, 2017

chris-b1 closed this as completed Jan 20, 2017

chris-b1 mentioned this issue Jan 20, 2017

BUG: read_excel with multi-indexed column ignores index_col=None #11733

Closed

chris-b1 added the Duplicate Report Duplicate issue or pull request label Jan 20, 2017

chris-b1 added this to the No action milestone Jan 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to control import of 1st column from Excel file (with multi-row header) #15180

Unable to control import of 1st column from Excel file (with multi-row header) #15180

wikiped commented Jan 20, 2017 •

edited

Loading

jreback commented Jan 20, 2017

chris-b1 commented Jan 20, 2017

Unable to control import of 1st column from Excel file (with multi-row header) #15180

Unable to control import of 1st column from Excel file (with multi-row header) #15180

Comments

wikiped commented Jan 20, 2017 • edited Loading

jreback commented Jan 20, 2017

chris-b1 commented Jan 20, 2017

wikiped commented Jan 20, 2017 •

edited

Loading