NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

mpuels · 2016-06-25T09:54:49Z

Given a DataFrame which has a MultiIndex. When a label of the MultiIndex has the value NaN and the DataFrame is written to an excel file, the label will have a value which is not NaN in the excel file.

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'c1': [1,1,2,2],
                   'c2': [None] + "b a b".split()})
df

returns a DataFrame where the first element of column 'c2' is NaN:

Set both columns as the index:

df_midx = df.set_index(['c1', 'c2'])
df_midx

returns

Write DataFrame to excel file and read it back in:

df_midx.to_excel('df_midx.xlsx')
df_midx_from_xlsx = pd.read_excel('df_midx.xlsx')
df_midx_from_xlsx

returns

The first element of column 'c2' is now set to 'b' instead of NaN.

Expected Output

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: ac174349b0e1525475c2354e1c0b8ee1ed1cabad
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-88-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 1.5.4
setuptools: 2.2
Cython: None
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.1
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.5.2
matplotlib: 1.5.0
openpyxl: 2.3.5
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-06-25T10:53:13Z

This indeed seems like a bug (your example is a bit strange, as you end up with a empty dataframe with a mutli-index (as you did set all columns as the index), but it occurs as well for a non-empty dataframe).

Interested in trying to look for a fix?

jreback · 2016-06-25T19:08:58Z

this is prob a duplicate of #6322 or #5286

mpuels · 2016-06-27T17:56:07Z

I think the bug is in the method pandas.formats.format.ExcelFormatter._format_hierarchical_rows() which is called by the public method pandas.formats.format.ExcelFormatter.get_formatted_cells(). More precisely the problem is the statement

values = levels.take(labels)

(here) where each of levels and labels correspond to the same level in a MultiIndex. levels contains the possible labels for that level (e.g. ['bar', 'foo']) and labels contains indices (e.g. [0,0,0,1,1,1,1]) which point to elements in levels. If a label is null, it is represented by -1 in labels (e.g. [0,0,-1,1,1,1], if the label of the third row is null). The problem in the above statement is that it doesn't treat the index -1 as a special value.

To fix the bug, the above statement could be replaced by

if levels._can_hold_na:
    values = levels.take(labels, fill_value=True)
else:
    values = levels.take(labels)

What follows is an example:

df = (pd.DataFrame({'c1': [1,1,2,2],
                    'c2': [None] + "b c d".split(),
                    'v' : [6,7,8,9]})
        .set_index(['c1', 'c2']))

df

yields

c1  c2  v
1       6
1   b   7
2   c   8
2   d   9

df.index

yields

MultiIndex(levels=[[1, 2], [u'b', u'c', u'd']],
           labels=[[0, 0, 1, 1], [-1, 0, 1, 2]],
           names=[u'c1', u'c2'])

for levels, labels in zip(df.index.levels, df.index.labels):
    print levels.take(labels)
    print levels._can_hold_na
    if levels._can_hold_na:
        print levels.take(labels, fill_value=True)
    print levels
    print labels
    print "------"

yields

Int64Index([1, 1, 2, 2], dtype='int64', name=u'c1')
False
Int64Index([1, 2], dtype='int64', name=u'c1')
FrozenNDArray([0, 0, 1, 1], dtype='int8')
------
Index([u'd', u'b', u'c', u'd'], dtype='object', name=u'c2')
True
Index([nan, u'b', u'c', u'd'], dtype='object', name=u'c2')
Index([u'b', u'c', u'd'], dtype='object', name=u'c2')
FrozenNDArray([-1, 0, 1, 2], dtype='int8')
------

I'd like to fix that bug, but I haven't found any unit tests for pandas.formats.format.ExcelFormatter. Shall I create one in pandas.tests.formats.test_format.py? And do you have any suggestions on how to assert that the correct ExcelCells are returned by get_formatted_cells()?

jorisvandenbossche · 2016-06-27T18:35:35Z

That looks like a very reasonable explanation! PR very welcome.

For the tests, I don't think we have tests for ExcelFormatter directly, but tests for read_excel/to_excel are in https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_excel.py. The basic tests are eg here: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_excel.py#L1311. Those typically read the written file back in to check it's correctness. That approach should be possible here as well.

…#13511

…3551)

jorisvandenbossche added Bug IO Excel read_excel, to_excel MultiIndex labels Jun 25, 2016

mpuels mentioned this issue Jul 3, 2016

BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 #13551

Merged

4 tasks

mpuels added a commit to mpuels/pandas that referenced this issue Jul 24, 2016

BUG: Fix .to_excel() for MultiIndex containing a NaN value pandas-dev…

9abc4e8

…#13511

mpuels added a commit to mpuels/pandas that referenced this issue Jul 24, 2016

BUG: Fix .to_excel() for MultiIndex containing a NaN value pandas-dev…

335cf86

…#13511

jreback added this to the 0.19.0 milestone Jul 25, 2016

jorisvandenbossche closed this as completed in #13551 Jul 25, 2016

jorisvandenbossche pushed a commit that referenced this issue Jul 25, 2016

BUG: Fix .to_excel() for MultiIndex containing a NaN value #13511 (#1…

4c2840e

…3551)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

mpuels commented Jun 25, 2016

jorisvandenbossche commented Jun 25, 2016

jreback commented Jun 25, 2016

mpuels commented Jun 27, 2016

jorisvandenbossche commented Jun 27, 2016

NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

Comments

mpuels commented Jun 25, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Jun 25, 2016

jreback commented Jun 25, 2016

mpuels commented Jun 27, 2016

jorisvandenbossche commented Jun 27, 2016

output of `pd.show_versions()`