Skip to content

BUG: misplaced index_label with DF.to_excel() #6260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
iled opened this issue Feb 5, 2014 · 15 comments
Closed

BUG: misplaced index_label with DF.to_excel() #6260

iled opened this issue Feb 5, 2014 · 15 comments
Labels
Docs IO Excel read_excel, to_excel
Milestone

Comments

@iled
Copy link
Contributor

iled commented Feb 5, 2014

When saving an excel file from a DataFrame with the method to_excel, there is a misbehaviour with index_label: it is placed in the 2nd row.

For instance

>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a', 'b', 'c'])
>>> df.index.name = 'n'
>>> df
   a  b  c
n         
0  1  2  3
1  4  5  6

[2 rows x 3 columns]
>>> df.to_excel(path, index_label='test')

Produces the following result within the xls file:

screenshot from 2014-02-05 01 36 35

Even if I force index_label=None, the behaviour persists

>>> df.to_excel('/home/julio/Testes/cost-home/test.xls', index_label=None)

Which is written as

screenshot from 2014-02-05 01 39 10

This is a problem because then I can't read the same DF after I write it:

>>> df2 = pd.read_excel('/home/julio/Testes/cost-home/test.xls', 0)
>>> df2
    a   b   c
n NaN NaN NaN
0   1   2   3
1   4   5   6

[3 rows x 3 columns]

I'm using pandas 0.13.0. In version 0.12 the label index was inserted in the first row.

>>> show_versions()

INSTALLED VERSIONS
------------------
Python: 2.7.4.final.0
OS: Linux
Release: 3.8.0-19-generic
Processor: x86_64
byteorder: little
LC_ALL: None
LANG: pt_PT.UTF-8

pandas: 0.13.0
Cython: 0.19.1
Numpy: 1.8.0
Scipy: 0.13.2
statsmodels: 0.5.0
    patsy: 0.2.1
scikits.timeseries: Not installed
dateutil: 2.2
pytz: 2013.9
bottleneck: 0.7.0
PyTables: 3.0.0
    numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
sqlalchemy: Not installed
lxml: 3.2.3
bs4: Not installed
html5lib: Not installed
bigquery: Not installed
apiclient: Not installed
@TomAugspurger
Copy link
Contributor

I'm not familiar with this part of the code base, but it looks like placing the index_label in the second row was intentional, probably to mimic the output in the terminal.

The problem here seems to be that if df.index has a name, there's no way to not have an index_label in the excel output (short of modifying df.index before writing).

From the docstring:

        index_label : string or sequence, default None
            Column label for index column(s) if desired. If None is given, and
            `header` and `index` are True, then the index names are used. A
            sequence should be given if the DataFrame uses MultiIndex.

Could we keep the current behavior of index_label=None defaulting to the index name when header and index are True, but look for index_label=False. if index_label=False then don't include and index label. Would treating False and None differently break too many rules?

@TomAugspurger
Copy link
Contributor

@iled as a workaround for now, you can use the skiprows keyword in read_excel: pd.read_excel('tst.xls', sheetname=0, skiprows=[1]) to avoid the row of NaNs.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2014

@TomAugspurger you think this is a bug?
@jtratner ?

@iled
Copy link
Contributor Author

iled commented Feb 5, 2014

I guess that treating False and None differently would not be a good thing... I understand what you have quoted from the docstring and it makes sense that it would write the label even if index_label=None.

If it is intentional that the index_label is written in the 2nd row, then I would consider that a regression. I would say that more important than mimicking the output in the terminal, is to keep consistency. Here are two reasons:

  1. As shown in the first message, it is not possible to read the file in the same manner as it was written;
  2. When reading a file for instance with df = pd.DataFrame.from_csv(filepath), if the first cell, in the first row, has the value LABEL, then you will have df.index.name equal to LABEL.

@jreback jreback added this to the 0.14.0 milestone Feb 5, 2014
@TomAugspurger
Copy link
Contributor

git blame puts this section on locojaydev. Does anyone know who that is? The current behavior doesn't seem great, but I'd like to know what the original reason for the blank row was before changing anything.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2014

IIRC he did a lot of the ofiginal to_excel stuff. I though index_label was just a convient way of saying make this the index label instead of the existing one?

cc @locojay

@ghost
Copy link

ghost commented Feb 7, 2014

#2370 -> #2396 (comment) -> #5423

cc @jmcnamara

@jmcnamara
Copy link
Contributor

There are a few issues involved here and I'll try to keep them separate. GitHub's lack of threads in long discussions like this is a pain. :-(

  1. I am responsible for this change (not locojay). It was implemented as part of ENH: Better handling of MultiIndex with Excel #5254 (ENH: Better handling of MultiIndex with Excel).

  2. This is the intended behaviour. It is meant to mimic the repl output. Whether that is right or wrong I can't say but it went through a lot of code reviewing at the time with screenshots so someone should have shouted before now. :-)

  3. You can get the previous (0.12) behaviour by setting the merge_cells option in to_excel():

    df.to_excel('test.xlsx', index_label='test', merge_cells=False)
    
  4. The 0.13 output file can be re-read correctly using the has_index_names option:

     >>> pd.read_excel('test.xlsx', 0, has_index_names=True)
           a  b  c
     test
     0     1  2  3
     1     4  5  6
    

@TomAugspurger
Copy link
Contributor

Thanks for the explanation. I don't think I see the has_index_names kwarg in the read_excel docstring. It could also use a mention in the prose docs. Perhaps a warning box describing how to round trip a dataframe with index names.

@iled
Copy link
Contributor Author

iled commented Feb 7, 2014

I appreciate your help, jmcnamara. That was really helpful. The problem here is like TomAugspurger said: there is a lack of documentation.

I would consider this issue closed. Thank you all

@jreback
Copy link
Contributor

jreback commented Feb 7, 2014

@iled @jmcnamara

does anyone want to update the doc string / io.rst or just close?

@iled
Copy link
Contributor Author

iled commented Feb 7, 2014

I never used Sphinx nor reST, but I want to, so I can try and update the doc strings and other documentation.

@dsm054
Copy link
Contributor

dsm054 commented Jun 21, 2014

Since the docs have been updated, and everyone seems happy, can we close this?

@iled
Copy link
Contributor Author

iled commented Jun 21, 2014

I thought it was already closed as my commit have been merged.

@jreback jreback closed this as completed Jun 21, 2014
@jreback
Copy link
Contributor

jreback commented Jun 21, 2014

depending on how things r merged sometimes the ref issue is not auto closed
thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

5 participants