-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_excel surprisingly filling empty levels in MultiIndex after first value #44837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @phofl ! I would like to work on this issue. |
@phofl I have done some root cause analysis and found a potential problem that causes filling missing values in I have found the cause of the problem in file: Specifically, in this portion of code: pandas/pandas/io/excel/_base.py Lines 608 to 618 in cbb6c73
I have some questions regarding the portion of code above:
So my concerns are:
|
@phofl I will share Google Colab notebook, where I have been doing the root cause analysis. https://colab.research.google.com/drive/1_TC3Vd4fE2xzEoUJbuqnxtsFeH2nQW0a?usp=sharing |
This behavior was introduced in #10967. pandas is forward filling index columns under the assumption that the excel sheet was written from a pandas MultiIndex. It seems we could write a function that tries to determine if a given set of columns arises this way. It won't ever able to be perfect. I think better would be to have it be user specified, but there are already an overwhelming number of arguments to read_excel. |
@rhshadrach Thanks for clarification. I am going to try to implement one of the mentioned ideas |
I have found out that the variable pandas/pandas/io/excel/_base.py Lines 614 to 618 in cbb6c73
Consider the DataFrame that was provided to trigger the issue by @phofl. Suppose that columns Once the iteration reaches the second row of the To possibly solve this issue, can we add an argument to the @rhshadrach Could you please clarify this sentence: In which way does the specified set of columns arise? in case we will choose to implement this function. |
I am not sure if we want to add an argument to read_excel for this, an easy workaround would be to simply call set_index afterwards. Implementing a function guessing this sounds nice, but not sure if we can do this in a way to avoid inconsistencies? |
@phofl How we would call the Do you mean that we could use |
No after calling read_excel |
The following two sheets are the same (ignoring formatting)
so it is impossible to tell (at least, using cell values) whether a frame was written with a MultiIndex and therefore blanks should be filled in, or if the blanks correspond to null values. |
-1 on adding an argument to excel for this. Instead, I think we should enable it to be easily accomplished. First, the behavior of read_excel should be changed to not ffill: ffill is not invertible. It is then up to the user whether they need to apply ffill themselves. Now, Index does not have a ffill, but if it did, I think we could do something like
|
@rhshadrach I think we can take this approach. So, as I understood, we will leave the empty indexes when parsing the MultiIndex data using We will implement |
Actually I was leaning a bit into the other direction. We could also point the user to not use index_col if missing values are expected and just call set_index afterwards. Do you have a strong opinion here? |
@phofl - Completely tangential, but I also wonder if users really want index values missing by default. For me, one of the primary things I do with excel files is filter them to values I'm interested in, for which missing index values doesn't work. |
Coming back to this: The intention for index_col was (I think) to allow roundtripping:
The values in row 2 and 3 look empty, because they are written as a merged cell. If we would not fill them, the MultiIndex could no be recovered. I think the best course of action is to document, that using index_col fills missing values with the previous valid value and keep the current behavior. If you want actual missing values in the Index, you have to call set_index afterwards |
+1 - docs could use some improvement here. Something akin to adding "missing values will be forward filled, to allow for round tripping with to_excel when merge_cells is True. Use set_index after reading instead of index_col if you want to avoid forward filling". |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
The code snippet returns
Based on the discussions when this was implemented I think the filling of NaNs was implemented to handle merged cells, but in this case this is quite surprising and I think should be considered a bug.
Expected Behavior
I would expect:
cc @rhshadrach I think you have done a few things with read_excel in the past?
Installed Versions
pandas : 1.4.0.dev0+1349.gff84f69269
numpy : 1.21.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.1
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.28.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.1
gcsfs : 2021.05.0
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.2
sqlalchemy : 1.4.25
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
None
The text was updated successfully, but these errors were encountered: