BUG: misplaced index_label with DF.to_excel() #6260

iled · 2014-02-05T01:50:13Z

When saving an excel file from a DataFrame with the method to_excel, there is a misbehaviour with index_label: it is placed in the 2nd row.

For instance

>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
>>> df.index.name = 'n'
>>> df
   a  b  c
n         
0  1  2  3
1  4  5  6

[2 rows x 3 columns]
>>> df.to_excel(path, index_label='test')

Produces the following result within the xls file:

Even if I force index_label=None, the behaviour persists

>>> df.to_excel('/home/julio/Testes/cost-home/test.xls', index_label=None)

Which is written as

This is a problem because then I can't read the same DF after I write it:

>>> df2 = pd.read_excel('/home/julio/Testes/cost-home/test.xls', 0)
>>> df2
    a   b   c
n NaN NaN NaN
0   1   2   3
1   4   5   6

[3 rows x 3 columns]

I'm using pandas 0.13.0. In version 0.12 the label index was inserted in the first row.

>>> show_versions()

INSTALLED VERSIONS
------------------
Python: 2.7.4.final.0
OS: Linux
Release: 3.8.0-19-generic
Processor: x86_64
byteorder: little
LC_ALL: None
LANG: pt_PT.UTF-8

pandas: 0.13.0
Cython: 0.19.1
Numpy: 1.8.0
Scipy: 0.13.2
statsmodels: 0.5.0
    patsy: 0.2.1
scikits.timeseries: Not installed
dateutil: 2.2
pytz: 2013.9
bottleneck: 0.7.0
PyTables: 3.0.0
    numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
sqlalchemy: Not installed
lxml: 3.2.3
bs4: Not installed
html5lib: Not installed
bigquery: Not installed
apiclient: Not installed

TomAugspurger · 2014-02-05T02:06:46Z

I'm not familiar with this part of the code base, but it looks like placing the index_label in the second row was intentional, probably to mimic the output in the terminal.

The problem here seems to be that if df.index has a name, there's no way to not have an index_label in the excel output (short of modifying df.index before writing).

From the docstring:

        index_label : string or sequence, default None
            Column label for index column(s) if desired. If None is given, and
            `header` and `index` are True, then the index names are used. A
            sequence should be given if the DataFrame uses MultiIndex.

Could we keep the current behavior of index_label=None defaulting to the index name when header and index are True, but look for index_label=False. if index_label=False then don't include and index label. Would treating False and None differently break too many rules?

TomAugspurger · 2014-02-05T02:09:33Z

@iled as a workaround for now, you can use the skiprows keyword in read_excel: pd.read_excel('tst.xls', sheetname=0, skiprows=[1]) to avoid the row of NaNs.

jreback · 2014-02-05T12:19:03Z

@TomAugspurger you think this is a bug?
@jtratner ?

iled · 2014-02-05T12:21:58Z

I guess that treating False and None differently would not be a good thing... I understand what you have quoted from the docstring and it makes sense that it would write the label even if index_label=None.

If it is intentional that the index_label is written in the 2nd row, then I would consider that a regression. I would say that more important than mimicking the output in the terminal, is to keep consistency. Here are two reasons:

As shown in the first message, it is not possible to read the file in the same manner as it was written;
When reading a file for instance with df = pd.DataFrame.from_csv(filepath), if the first cell, in the first row, has the value LABEL, then you will have df.index.name equal to LABEL.

TomAugspurger · 2014-02-05T13:49:18Z

git blame puts this section on locojaydev. Does anyone know who that is? The current behavior doesn't seem great, but I'd like to know what the original reason for the blank row was before changing anything.

jreback · 2014-02-05T13:55:27Z

IIRC he did a lot of the ofiginal to_excel stuff. I though index_label was just a convient way of saying make this the index label instead of the existing one?

cc @locojay

ghost · 2014-02-07T14:35:24Z

#2370 -> #2396 (comment) -> #5423

cc @jmcnamara

jmcnamara · 2014-02-07T16:03:42Z

There are a few issues involved here and I'll try to keep them separate. GitHub's lack of threads in long discussions like this is a pain. :-(

I am responsible for this change (not locojay). It was implemented as part of ENH: Better handling of MultiIndex with Excel #5254 (ENH: Better handling of MultiIndex with Excel).
This is the intended behaviour. It is meant to mimic the repl output. Whether that is right or wrong I can't say but it went through a lot of code reviewing at the time with screenshots so someone should have shouted before now. :-)
You can get the previous (0.12) behaviour by setting the merge_cells option in to_excel():
```
df.to_excel('test.xlsx', index_label='test', merge_cells=False)
```

The 0.13 output file can be re-read correctly using the has_index_names option:

 >>> pd.read_excel('test.xlsx', 0, has_index_names=True)
       a  b  c
 test
 0     1  2  3
 1     4  5  6

TomAugspurger · 2014-02-07T16:16:27Z

Thanks for the explanation. I don't think I see the has_index_names kwarg in the read_excel docstring. It could also use a mention in the prose docs. Perhaps a warning box describing how to round trip a dataframe with index names.

iled · 2014-02-07T16:19:49Z

I appreciate your help, jmcnamara. That was really helpful. The problem here is like TomAugspurger said: there is a lack of documentation.

I would consider this issue closed. Thank you all

jreback · 2014-02-07T17:15:19Z

@iled @jmcnamara

does anyone want to update the doc string / io.rst or just close?

iled · 2014-02-07T17:17:57Z

I never used Sphinx nor reST, but I want to, so I can try and update the doc strings and other documentation.

dsm054 · 2014-06-21T04:45:01Z

Since the docs have been updated, and everyone seems happy, can we close this?

iled · 2014-06-21T12:48:12Z

I thought it was already closed as my commit have been merged.

jreback · 2014-06-21T12:50:41Z

depending on how things r merged sometimes the ref issue is not auto closed
thanks!

jreback added this to the 0.14.0 milestone Feb 5, 2014

jreback added Bug labels Feb 5, 2014

jreback added Docs and removed Bug labels Feb 7, 2014

iled mentioned this issue Feb 8, 2014

DOC: clarified to_excel and read_excel (GH6260) #6302

Merged

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback closed this as completed Jun 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: misplaced index_label with DF.to_excel() #6260

BUG: misplaced index_label with DF.to_excel() #6260

iled commented Feb 5, 2014

TomAugspurger commented Feb 5, 2014

TomAugspurger commented Feb 5, 2014

jreback commented Feb 5, 2014

iled commented Feb 5, 2014

TomAugspurger commented Feb 5, 2014

jreback commented Feb 5, 2014

ghost commented Feb 7, 2014

jmcnamara commented Feb 7, 2014

TomAugspurger commented Feb 7, 2014

iled commented Feb 7, 2014

jreback commented Feb 7, 2014

iled commented Feb 7, 2014

dsm054 commented Jun 21, 2014

iled commented Jun 21, 2014

jreback commented Jun 21, 2014

BUG: misplaced index_label with DF.to_excel() #6260

BUG: misplaced index_label with DF.to_excel() #6260

Comments

iled commented Feb 5, 2014

TomAugspurger commented Feb 5, 2014

TomAugspurger commented Feb 5, 2014

jreback commented Feb 5, 2014

iled commented Feb 5, 2014

TomAugspurger commented Feb 5, 2014

jreback commented Feb 5, 2014

ghost commented Feb 7, 2014

jmcnamara commented Feb 7, 2014

TomAugspurger commented Feb 7, 2014

iled commented Feb 7, 2014

jreback commented Feb 7, 2014

iled commented Feb 7, 2014

dsm054 commented Jun 21, 2014

iled commented Jun 21, 2014

jreback commented Jun 21, 2014