-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: xl.parse index_col ignoring skiprows #50953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I reproduced the issue. It seems like the headers or labels are being propagated across multiple rows before they are phased out with skiprows. Honestly the best solution would be to implement the skiprows functionality at the highest level possible in the process, but that might require some refactoring. So instead what needs to happen is the skiprows offset needs to be taken into account when assigning headers. I think the issue might be present around here: pandas/pandas/io/excel/_base.py Lines 794 to 833 in 2e218d1
I'll keep doing some digging on this and let you know if I find a fix. |
Hi, I did some debugging and analysis on this issue. Below are my findings : Issue 1: The argument passed in the sample code provided by @musshorn to parse the excel is incorrect as the header argument is spelled as "headers". Due to this, the skip rows are non-functional in the above case. However, if I read the same data using the correct parameters. This data looks meaningful. Issue 2: Even if the header is not provided to the input, it should allow us to skip rows for reading the data frame. So, for that, we may need to apply some fixes. @axeleschholz @rhshadrach - Let me know your thoughts on this. Also, can I work on this issue? |
If I remove the first 8 rows of the excel file and then run
I get
This appears the same as the output in #50953 (comment), except for the numbers (the numbers here agree with those in the OP file, so maybe they got changed?). This makes me think In any case, passing the invalid argument
Sure! Just comment |
/take |
Issue 1 : The reason why this is being ignored is that the call to parse function is having keyword arguments (**kwds) which allows passing the The sample code used to reproduce the bug directly uses the parse function to read the excel.
However, the recommended way to perform the same action is :
The
But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided. Sample Code to reproduce this :
After running the above code if we check the data frame, the But the reason it's happening is due to the implementation of forward-filling values if the MultiIndex index is provided. pandas/pandas/io/excel/_base.py Lines 835 to 862 in 2e218d1
@rhshadrach Let me know if this makes sense. Which one will take higher priority in this case index_cols or skiprows. |
take |
Sorry @pacificdragon - there is no / in the command. I've assigned the issue to you (but in the future, just
What happens to the
No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in #50953 (comment) I get the same result when removing the top eight rows and the |
No its not used anywhere, the kwargs are passed from parse function to TextParser which further are passed to TextFileReader in io/parsers/readers.py pandas/pandas/io/excel/_base.py Lines 866 to 889 in 2e218d1
Yes, you are correct if we compare that is same because issue happens only when @rhshadrach - Let me know if its clear now. |
I see - thanks. Agreed on this issue. Also passing invalid kwargs like |
Just to align on the changes. I did these tests again. File : TestPandas.xlsx Case 1:
Comments:
Case 2:
MultiIndex([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2'), Comments:
Case 3:
MultiIndex([('NORTH AMERICA', 'Test text 1'), Comments :
@rhshadrach - Please suggest if my understanding is correct and if implementing skip rows at the top of the hierarchy will be a good option or not.
Sure, creating a PR for this. |
I'm not sure if I am 100% following, but there is definitely a problem in To motivate why one might want to do something like described in #58899 (comment), consider an excel sheet with multiple datasets having their own headers and also a shared set of headers:
Ideally, one should be able to load this data into two data frames, the first with headers Class, ID, and X1/Y1 and data rows including a-h, and the second with headers Class, ID, and X2/Y2 and data rows including i-p. This should intuitively be possible with the commands: y1 = pd.read_excel(..., index_col=0, header=[0,1,3], skiprows = lambda row: row not in [0,1,3,4,5])
y2 = pd.read_excel(..., index_col=0, header=[0,1,7], skiprows = lambda row: row not in [0,1,7,8,9]) However, as described in #58899 (comment), this does NOT work. |
I'd just note that everything works fine if the header rows are contiguous and are the first few rows of the file. So this seems to be an implicit assumption that is limiting. |
One more observation. If the first data row (non-header row) has no values in any of the non-index columns, but a value in the index column, then that row is treated as the name of the index? I find this non-intuitive. Is this behavior described? |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Run the sample code on this excel sheet. Parsing is expected to begin from row 9 however in the index column data is pulled from rows that should've been skipped.
TestPandas.xlsx
Expected Behavior
Parsing to begin from the first non-skipped row
Installed Versions
INSTALLED VERSIONS
commit : 7cb7592
python : 3.10.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_Australia.1252
pandas : 2.0.0.dev0+1147.g7cb7592523
numpy : 1.22.4
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 57.4.0
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : 0.8.9
xarray : None
xlrd : None
zstandard : 0.17.0
tzdata : None
qtpy : 2.0.1
pyqt5 : None
The text was updated successfully, but these errors were encountered: