Skip to content

BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
phofl opened this issue Dec 10, 2021 · 17 comments · Fixed by #45989
Closed
3 tasks done

BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837

phofl opened this issue Dec 10, 2021 · 17 comments · Fixed by #45989
Labels
Docs IO Excel read_excel, to_excel
Milestone

Comments

@phofl
Copy link
Member

phofl commented Dec 10, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [np.nan, 1, np.nan], "c": 1})

df.to_excel("test.xlsx", index=False)

pd.read_excel("test.xlsx", index_col=[0, 1])

Issue Description

The code snippet returns

       c
a b     
1 NaN  1
2 1.0  1
3 1.0  1

Based on the discussions when this was implemented I think the filling of NaNs was implemented to handle merged cells, but in this case this is quite surprising and I think should be considered a bug.

Expected Behavior

I would expect:

       c
a b     
1 NaN  1
2 1.0  1
3 NaN  1

cc @rhshadrach I think you have done a few things with read_excel in the past?

Installed Versions

INSTALLED VERSIONS ------------------ commit : ff84f69 python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.11.0-41-generic Version : #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.0.dev0+1349.gff84f69269
numpy : 1.21.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.1
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.28.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.1
gcsfs : 2021.05.0
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.2
sqlalchemy : 1.4.25
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
None

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member IO Excel read_excel, to_excel labels Dec 10, 2021
@abatomunkuev
Copy link
Contributor

abatomunkuev commented Dec 11, 2021

Hello @phofl ! I would like to work on this issue.

@abatomunkuev
Copy link
Contributor

@phofl I have done some root cause analysis and found a potential problem that causes filling missing values in read_excel() method.

I have found the cause of the problem in file: pandas/io/excel/_base.py

Specifically, in this portion of code:

# Check if we have an empty dataset
# before trying to collect data.
if offset < len(data):
for col in index_col:
last = data[offset][col]
for row in range(offset + 1, len(data)):
if data[row][col] == "" or data[row][col] is None:
data[row][col] = last
else:
last = data[row][col]

I have some questions regarding the portion of code above:

  • What does the variable offset do ?
  • In the loop below, why do we define the range from offset + 1 to length of the DataFrame.
  • for row in range(offset + 1, len(data)):
  • Since we start with offset + 1, the iteration starts with the second row of the DataFrame. I have logged the row. It is starting from the second row of the DataFrame, skipping the first row containing the missing value (N/A).

image

  • Since, it skipped the first row, it didn't find the first missing value (N/A)
  • Secondly, could you explain the portion of code below?
    # Check if we have an empty dataset
    # before trying to collect data.
    if offset < len(data):
    for col in index_col:
    last = data[offset][col]
    for row in range(offset + 1, len(data)):
    if data[row][col] == "" or data[row][col] is None:
    data[row][col] = last
    else:
    last = data[row][col]
  • Question is should we remove that portion of code to allow having missing values in read_excel function.
  • After removing that portion of the code, the method read_excel() doesn't fill the missing values.
    image

So my concerns are:

  • Should we remove the following portion of code:
  • for row in range(offset + 1, len(data)):
    if data[row][col] == "" or data[row][col] is None:
    data[row][col] = last
    else:
    last = data[row][col]
  • If we remove the portion of code above, what will be possible bad effects of it?
  • Is there a way to not removing, just modifying that portion of code? Let me know, I can work on it with your assistance.

@abatomunkuev
Copy link
Contributor

@phofl I will share Google Colab notebook, where I have been doing the root cause analysis.

https://colab.research.google.com/drive/1_TC3Vd4fE2xzEoUJbuqnxtsFeH2nQW0a?usp=sharing

@rhshadrach
Copy link
Member

This behavior was introduced in #10967. pandas is forward filling index columns under the assumption that the excel sheet was written from a pandas MultiIndex. It seems we could write a function that tries to determine if a given set of columns arises this way. It won't ever able to be perfect. I think better would be to have it be user specified, but there are already an overwhelming number of arguments to read_excel.

@abatomunkuev
Copy link
Contributor

@rhshadrach Thanks for clarification.

I am going to try to implement one of the mentioned ideas

@abatomunkuev
Copy link
Contributor

@rhshadrach @phofl

I have found out that the variable last gets overridden when the current MultiIndex cell is not empty (not N/A).

for row in range(offset + 1, len(data)):
if data[row][col] == "" or data[row][col] is None:
data[row][col] = last
else:
last = data[row][col]

Consider the DataFrame that was provided to trigger the issue by @phofl. Suppose that columns a and b are MultiIndexes.
Screen Shot 2021-12-14 at 6 56 07 PM

Once the iteration reaches the second row of the B index, it will override the last variable with the value of 1. After that, it finds the third row of the B index has a missing value (N/A), it assigns the value of the variable last to it. Hence, the missing value fills with 1.
image

To possibly solve this issue, can we add an argument to the read_excel() method so that it will not override Indexes with N/A?

@rhshadrach Could you please clarify this sentence:
It seems we could write a function that tries to determine if a given set of columns arises this way. It won't ever able to be perfect. #44837 (comment)

In which way does the specified set of columns arise? in case we will choose to implement this function.

@phofl
Copy link
Member Author

phofl commented Dec 15, 2021

I am not sure if we want to add an argument to read_excel for this, an easy workaround would be to simply call set_index afterwards.

Implementing a function guessing this sounds nice, but not sure if we can do this in a way to avoid inconsistencies?

@abatomunkuev
Copy link
Contributor

@phofl How we would call the set_index() method, if we already specified the MultiIndexes using the argument index_col?

Do you mean that we could use set_index() method inside the read_excel() method?

@phofl
Copy link
Member Author

phofl commented Dec 15, 2021

No after calling read_excel

@abatomunkuev
Copy link
Contributor

The method chaining works.
image

I think we still need to solve the issue with index_col

@rhshadrach
Copy link
Member

@rhshadrach Could you please clarify this sentence: It seems we could write a function that tries to determine if a given set of columns arises this way. It won't ever able to be perfect. #44837 (comment)

In which way does the specified set of columns arise? in case we will choose to implement this function.

The following two sheets are the same (ignoring formatting)

df = pd.DataFrame({'a': [1., 1., 2.], 'b': [4, 5, 6], 'c': [7, 8, 9]}).set_index(list('ab'))
df.to_excel('test.xlsx')

df = pd.DataFrame({'a': [1., np.nan, 2.], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df.to_excel('test2.xlsx', index=False)

so it is impossible to tell (at least, using cell values) whether a frame was written with a MultiIndex and therefore blanks should be filled in, or if the blanks correspond to null values.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 16, 2021

-1 on adding an argument to excel for this. Instead, I think we should enable it to be easily accomplished. First, the behavior of read_excel should be changed to not ffill: ffill is not invertible. It is then up to the user whether they need to apply ffill themselves. Now, Index does not have a ffill, but if it did, I think we could do something like

df = pd.read_excel("test.xlsx", index_col=[0, 1])
df.index = df.index.ffill()  # for when blank cells are from writing a MultiIndex rather than null values

@abatomunkuev
Copy link
Contributor

@rhshadrach I think we can take this approach.

So, as I understood, we will leave the empty indexes when parsing the MultiIndex data using read_excel(). (By default we will leave the empty indexes)

We will implement ffill() method to allow a user to fill all the empty indexes. So, it will be up to the user whether the user wants to fill empty indexes or not.

@phofl
Copy link
Member Author

phofl commented Dec 16, 2021

@rhshadrach

Actually I was leaning a bit into the other direction. We could also point the user to not use index_col if missing values are expected and just call set_index afterwards. Do you have a strong opinion here?

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 17, 2021
@rhshadrach
Copy link
Member

rhshadrach commented Dec 18, 2021

@phofl - Agreed (Edit: I misread your comment, previously I was agreeing with your 2nd sentence. I realize now I don't understand your first). And your comment makes me wonder how much index_col is needed at all.

Completely tangential, but I also wonder if users really want index values missing by default. For me, one of the primary things I do with excel files is filter them to values I'm interested in, for which missing index values doesn't work.

@phofl
Copy link
Member Author

phofl commented Feb 14, 2022

Coming back to this:

The intention for index_col was (I think) to allow roundtripping:

df = pd.DataFrame({"a": 1, "b": [4, 5, 6], "c": [10, 11, 12]}).set_index(["a", "b"])

df.to_excel("test.xlsx")

pd.read_excel("test.xlsx", index_col=[0, 1])

The values in row 2 and 3 look empty, because they are written as a merged cell. If we would not fill them, the MultiIndex could no be recovered. I think the best course of action is to document, that using index_col fills missing values with the previous valid value and keep the current behavior. If you want actual missing values in the Index, you have to call set_index afterwards

@rhshadrach
Copy link
Member

+1 - docs could use some improvement here. Something akin to adding "missing values will be forward filled, to allow for round tripping with to_excel when merge_cells is True. Use set_index after reading instead of index_col if you want to avoid forward filling".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants