Skip to content

read_excel with multi-index columns and index_col=None, returns an index column #15660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stephenrauch opened this issue Mar 12, 2017 · 2 comments
Labels
Duplicate Report Duplicate issue or pull request IO Excel read_excel, to_excel

Comments

@stephenrauch
Copy link
Contributor

read_excel() seems to give the same results for a multi-index column file if index_col=None or index_col=0. Using the file from /pandas/test/io/data:

>>> df = pd.read_excel(
        "testmultiindex.xlsx", header=[0, 1], index_col=None,
        sheetname='mi_column_name')
       
>>> print(df.columns)
MultiIndex(levels=[[u'bar', u'foo'], [u'a', u'b']],
       labels=[[1, 1, 0, 0], [0, 1, 0, 1]],
       names=[u'c1', u'c2'])

>>> print(df)
c1 foo             bar       
c2   a    b          a      b
0    1  2.5 2015-01-01   True
1    2  3.5 2015-01-02  False
2    3  4.5 2015-01-03  False
3    4  5.5 2015-01-04   True

Expected Output

With index_col=None I would have expected the result to look something like:

MultiIndex(levels=[[u'bar', u'c1', u'foo'], [u'a', u'b', u'c2']],
           labels=[[1, 2, 2, 0, 0], [2, 0, 1, 0, 1]],
           names=[u'', u''])
  c1 foo             bar       
  c2   a    b          a      b
0  0   1  2.5 2015-01-01   True
1  1   2  3.5 2015-01-02  False
2  2   3  4.5 2015-01-03  False
3  3   4  5.5 2015-01-04   True

Problem description

This behavior is due to pandas.io.excel._pop_header_name().

def _pop_header_name(row, index_col):
    """ (header, new_data) for header rows in MultiIndex parsing"""
    none_fill = lambda x: None if x == '' else x

    if index_col is None:
        # no index col specified, trim data for inference path
        return none_fill(row[0]), row[1:]
    else:
        # pop out header name and fill w/ blank
        i = index_col if not is_list_like(index_col) else max(index_col)
        return none_fill(row[i]), row[:i] + [''] + row[i + 1:]

There are two things in the function that I do not understand

  1. index_col is None pulls out the first column and then treats it as an index.
  2. index_col == 0 returns something similar but adds a column of '' to the front of the return row. Why are these almost the same?

Output of pd.show_versions()

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 30.3.0
Cython: 0.24
numpy: 1.12.0
scipy: 0.13.3
xarray: None
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: 1.3.2
bottleneck: 1.0.0
tables: 3.1.0
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.4
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.0

pandas_datareader: 0.3.0.post

@jreback
Copy link
Contributor

jreback commented Mar 13, 2017

@chris-b1 if you'd have a look.

@jreback jreback added the IO Excel read_excel, to_excel label Mar 13, 2017
@chris-b1
Copy link
Contributor

Thanks for the report, this is a duplicate of #11733, definitely would like to solve this.

index_col=None, contrary to what you could might expect, is treated as "infer whether or not there is an index column" and is also the default for read_excel. So as mentioned in that issue, I think the right approach is change the default for index_col to a sentinel (e.g. 'infer') so that index_col=None can mean something. PRs / ideas for the api welcome!

@chris-b1 chris-b1 modified the milestones: no acc, No action Mar 13, 2017
@chris-b1 chris-b1 added the Duplicate Report Duplicate issue or pull request label Mar 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

3 participants