BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

phofl · 2021-12-10T09:43:29Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [np.nan, 1, np.nan], "c": 1})

df.to_excel("test.xlsx", index=False)

pd.read_excel("test.xlsx", index_col=[0, 1])

Issue Description

The code snippet returns

Based on the discussions when this was implemented I think the filling of NaNs was implemented to handle merged cells, but in this case this is quite surprising and I think should be considered a bug.

Expected Behavior

I would expect:

       c
a b     
1 NaN  1
2 1.0  1
3 NaN  1

cc @rhshadrach I think you have done a few things with read_excel in the past?

Installed Versions

INSTALLED VERSIONS ------------------ commit : ff84f69 python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-41-generic Version : #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.0.dev0+1349.gff84f69269
numpy : 1.21.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.1
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.28.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.1
gcsfs : 2021.05.0
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.2
sqlalchemy : 1.4.25
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
None

The text was updated successfully, but these errors were encountered:

abatomunkuev · 2021-12-11T12:43:43Z

Hello @phofl ! I would like to work on this issue.

abatomunkuev · 2021-12-11T14:18:52Z

@phofl I have done some root cause analysis and found a potential problem that causes filling missing values in read_excel() method.

I have found the cause of the problem in file: pandas/io/excel/_base.py

Specifically, in this portion of code:

pandas/pandas/io/excel/_base.py

Lines 608 to 618 in cbb6c73

    
           # Check if we have an empty dataset 
        
           # before trying to collect data. 
        
           if offset < len(data): 
        
               for col in index_col: 
        
                   last = data[offset][col] 
        
                   for row in range(offset + 1, len(data)): 
        
                       if data[row][col] == "" or data[row][col] is None: 
        
                           data[row][col] = last 
        
                       else: 
        
                           last = data[row][col]

I have some questions regarding the portion of code above:

What does the variable offset do ?
In the loop below, why do we define the range from offset + 1 to length of the DataFrame.
pandas/pandas/io/excel/_base.py

Line 614 in cbb6c73

for row in range(offset + 1, len(data)):
Since we start with offset + 1, the iteration starts with the second row of the DataFrame. I have logged the row. It is starting from the second row of the DataFrame, skipping the first row containing the missing value (N/A).

Since, it skipped the first row, it didn't find the first missing value (N/A)

Secondly, could you explain the portion of code below?

pandas/pandas/io/excel/_base.py

Lines 608 to 618 in cbb6c73

    
           # Check if we have an empty dataset 
        
           # before trying to collect data. 
        
           if offset < len(data): 
        
               for col in index_col: 
        
                   last = data[offset][col] 
        
                   for row in range(offset + 1, len(data)): 
        
                       if data[row][col] == "" or data[row][col] is None: 
        
                           data[row][col] = last 
        
                       else: 
        
                           last = data[row][col]

Question is should we remove that portion of code to allow having missing values in read_excel function.
After removing that portion of the code, the method read_excel() doesn't fill the missing values.

So my concerns are:

Should we remove the following portion of code:

pandas/pandas/io/excel/_base.py

Lines 613 to 619 in cbb6c73

    
           for row in range(offset + 1, len(data)): 
        
               if data[row][col] == "" or data[row][col] is None: 
        
                   data[row][col] = last 
        
               else: 
        
                   last = data[row][col]

If we remove the portion of code above, what will be possible bad effects of it?
Is there a way to not removing, just modifying that portion of code? Let me know, I can work on it with your assistance.

abatomunkuev · 2021-12-11T14:20:58Z

@phofl I will share Google Colab notebook, where I have been doing the root cause analysis.

https://colab.research.google.com/drive/1_TC3Vd4fE2xzEoUJbuqnxtsFeH2nQW0a?usp=sharing

rhshadrach · 2021-12-11T18:44:41Z

This behavior was introduced in #10967. pandas is forward filling index columns under the assumption that the excel sheet was written from a pandas MultiIndex. It seems we could write a function that tries to determine if a given set of columns arises this way. It won't ever able to be perfect. I think better would be to have it be user specified, but there are already an overwhelming number of arguments to read_excel.

abatomunkuev · 2021-12-12T23:00:58Z

@rhshadrach Thanks for clarification.

I am going to try to implement one of the mentioned ideas

abatomunkuev · 2021-12-15T00:14:18Z

@rhshadrach @phofl

I have found out that the variable last gets overridden when the current MultiIndex cell is not empty (not N/A).

pandas/pandas/io/excel/_base.py

Lines 614 to 618 in cbb6c73

    
           for row in range(offset + 1, len(data)): 
        
               if data[row][col] == "" or data[row][col] is None: 
        
                   data[row][col] = last 
        
               else: 
        
                   last = data[row][col]

Consider the DataFrame that was provided to trigger the issue by @phofl. Suppose that columns a and b are MultiIndexes.

Once the iteration reaches the second row of the B index, it will override the last variable with the value of 1. After that, it finds the third row of the B index has a missing value (N/A), it assigns the value of the variable last to it. Hence, the missing value fills with 1.

To possibly solve this issue, can we add an argument to the read_excel() method so that it will not override Indexes with N/A?

@rhshadrach Could you please clarify this sentence:
It seems we could write a function that tries to determine if a given set of columns arises this way. It won't ever able to be perfect. #44837 (comment)

In which way does the specified set of columns arise? in case we will choose to implement this function.

phofl · 2021-12-15T15:31:59Z

I am not sure if we want to add an argument to read_excel for this, an easy workaround would be to simply call set_index afterwards.

Implementing a function guessing this sounds nice, but not sure if we can do this in a way to avoid inconsistencies?

abatomunkuev · 2021-12-15T17:38:43Z

@phofl How we would call the set_index() method, if we already specified the MultiIndexes using the argument index_col?

Do you mean that we could use set_index() method inside the read_excel() method?

phofl · 2021-12-15T17:46:35Z

No after calling read_excel

abatomunkuev · 2021-12-15T17:50:47Z

The method chaining works.

I think we still need to solve the issue with index_col

rhshadrach · 2021-12-16T02:54:42Z

@rhshadrach Could you please clarify this sentence: It seems we could write a function that tries to determine if a given set of columns arises this way. It won't ever able to be perfect. #44837 (comment)

In which way does the specified set of columns arise? in case we will choose to implement this function.

The following two sheets are the same (ignoring formatting)

df = pd.DataFrame({'a': [1., 1., 2.], 'b': [4, 5, 6], 'c': [7, 8, 9]}).set_index(list('ab'))
df.to_excel('test.xlsx')

df = pd.DataFrame({'a': [1., np.nan, 2.], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df.to_excel('test2.xlsx', index=False)

so it is impossible to tell (at least, using cell values) whether a frame was written with a MultiIndex and therefore blanks should be filled in, or if the blanks correspond to null values.

rhshadrach · 2021-12-16T03:02:21Z

-1 on adding an argument to excel for this. Instead, I think we should enable it to be easily accomplished. First, the behavior of read_excel should be changed to not ffill: ffill is not invertible. It is then up to the user whether they need to apply ffill themselves. Now, Index does not have a ffill, but if it did, I think we could do something like

df = pd.read_excel("test.xlsx", index_col=[0, 1])
df.index = df.index.ffill()  # for when blank cells are from writing a MultiIndex rather than null values

abatomunkuev · 2021-12-16T12:50:47Z

@rhshadrach I think we can take this approach.

So, as I understood, we will leave the empty indexes when parsing the MultiIndex data using read_excel(). (By default we will leave the empty indexes)

We will implement ffill() method to allow a user to fill all the empty indexes. So, it will be up to the user whether the user wants to fill empty indexes or not.

phofl · 2021-12-16T13:32:13Z

@rhshadrach

Actually I was leaning a bit into the other direction. We could also point the user to not use index_col if missing values are expected and just call set_index afterwards. Do you have a strong opinion here?

rhshadrach · 2021-12-18T03:42:40Z

@phofl - ~~Agreed~~ (Edit: I misread your comment, previously I was agreeing with your 2nd sentence. I realize now I don't understand your first). And your comment makes me wonder how much index_col is needed at all.

Completely tangential, but I also wonder if users really want index values missing by default. For me, one of the primary things I do with excel files is filter them to values I'm interested in, for which missing index values doesn't work.

phofl · 2022-02-14T15:52:54Z

Coming back to this:

The intention for index_col was (I think) to allow roundtripping:

df = pd.DataFrame({"a": 1, "b": [4, 5, 6], "c": [10, 11, 12]}).set_index(["a", "b"])

df.to_excel("test.xlsx")

pd.read_excel("test.xlsx", index_col=[0, 1])

The values in row 2 and 3 look empty, because they are written as a merged cell. If we would not fill them, the MultiIndex could no be recovered. I think the best course of action is to document, that using index_col fills missing values with the previous valid value and keep the current behavior. If you want actual missing values in the Index, you have to call set_index afterwards

rhshadrach · 2022-02-14T16:27:38Z

+1 - docs could use some improvement here. Something akin to adding "missing values will be forward filled, to allow for round tripping with to_excel when merge_cells is True. Use set_index after reading instead of index_col if you want to avoid forward filling".

phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member IO Excel read_excel, to_excel labels Dec 10, 2021

simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 17, 2021

phofl added Docs and removed Bug labels Feb 14, 2022

phofl mentioned this issue Feb 14, 2022

DOCS: Add comment about index_col and forward filling missing values #45989

Merged

4 tasks

rhshadrach added this to the 1.5 milestone Feb 14, 2022

rhshadrach closed this as completed in #45989 Feb 14, 2022

phofl mentioned this issue Nov 16, 2022

DOC: read_excel example for forward filling with index_col noatamir/pyladies-berlin-sprints#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

phofl commented Dec 10, 2021

abatomunkuev commented Dec 11, 2021 •

edited

Loading

abatomunkuev commented Dec 11, 2021

abatomunkuev commented Dec 11, 2021

rhshadrach commented Dec 11, 2021

abatomunkuev commented Dec 12, 2021

abatomunkuev commented Dec 15, 2021

phofl commented Dec 15, 2021

abatomunkuev commented Dec 15, 2021

phofl commented Dec 15, 2021

abatomunkuev commented Dec 15, 2021

rhshadrach commented Dec 16, 2021

rhshadrach commented Dec 16, 2021 •

edited

Loading

abatomunkuev commented Dec 16, 2021

phofl commented Dec 16, 2021

rhshadrach commented Dec 18, 2021 •

edited

Loading

phofl commented Feb 14, 2022

rhshadrach commented Feb 14, 2022

BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

Comments

phofl commented Dec 10, 2021

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

abatomunkuev commented Dec 11, 2021 • edited Loading

abatomunkuev commented Dec 11, 2021

abatomunkuev commented Dec 11, 2021

rhshadrach commented Dec 11, 2021

abatomunkuev commented Dec 12, 2021

abatomunkuev commented Dec 15, 2021

phofl commented Dec 15, 2021

abatomunkuev commented Dec 15, 2021

phofl commented Dec 15, 2021

abatomunkuev commented Dec 15, 2021

rhshadrach commented Dec 16, 2021

rhshadrach commented Dec 16, 2021 • edited Loading

abatomunkuev commented Dec 16, 2021

phofl commented Dec 16, 2021

rhshadrach commented Dec 18, 2021 • edited Loading

phofl commented Feb 14, 2022

rhshadrach commented Feb 14, 2022

abatomunkuev commented Dec 11, 2021 •

edited

Loading

rhshadrach commented Dec 16, 2021 •

edited

Loading

rhshadrach commented Dec 18, 2021 •

edited

Loading