Skip to content

Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jankaWIS opened this issue Feb 14, 2021 · 23 comments
Closed

Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

jankaWIS opened this issue Feb 14, 2021 · 23 comments
Assignees
Labels
Closing Candidate May be closeable, needs more eyeballs good first issue IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jankaWIS
Copy link

jankaWIS commented Feb 14, 2021

Is your feature request related to a problem?

Reading Stack Overflow, there has been several posts trying to skip blank lines when reading (post 1, post 2, post 3, ...). The proposed solution works for pandas v<0.24 but there is no mention about this feature and trying to use it I get:
In pandas 1.1.4 and 1.2: TypeError: read_excel() got an unexpected keyword argument 'skip_blank_lines' and no mention in the documentation posted above.

Describe the solution you'd like

Pandas read_excel and/or ExcelFile should get back this argument and functionality

@jankaWIS jankaWIS added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 14, 2021
@jankaWIS jankaWIS changed the title Add back skip_blank_lines Add back skip_blank_lines to read_excel in pandas 1.1.4 Feb 14, 2021
@jankaWIS jankaWIS changed the title Add back skip_blank_lines to read_excel in pandas 1.1.4 Add back skip_blank_lines to read_excel in pandas v>1.1.4 Feb 14, 2021
@lithomas1
Copy link
Member

lithomas1 commented Feb 14, 2021

@jankaWIS Thanks for the bug report! I think this is due to #34464, which dropped support for *kwds that were passed to TextFileReader(skip_blank_lines being one of them). Labeling as regression for now.
cc @topper-123 @WillAyd

The fix for this would actually be very simple if the core devs wished to add this back(Adding *kwds here

mangle_dupe_cols=mangle_dupe_cols,
and
storage_options: StorageOptions = None,

@lithomas1 lithomas1 added Regression Functionality that used to work in a prior pandas version IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member Enhancement labels Feb 14, 2021
@lithomas1 lithomas1 added this to the 1.3 milestone Feb 14, 2021
@WillAyd
Copy link
Member

WillAyd commented Feb 14, 2021

We don’t need to add back in the keyword arguments - can just add this parameter to read_excel and document

@lithomas1
Copy link
Member

lithomas1 commented Feb 22, 2021

If anyone wants to take this up, the fix would be to add skip_blank_lines to the read_excel signature and to the io.parse call. Do note that you'll also need to add back the docstring(see variable _read_excel_doc) in file pandas/io/excel/_base.py.

@ahawryluk
Copy link
Contributor

take

@ahawryluk
Copy link
Contributor

Will this require a new test, or is it enough that the underlying TextParser already has tests for skip_blank_lines?

@lithomas1
Copy link
Member

lithomas1 commented Feb 22, 2021

@ahawryluk Thanks for taking this up. I think you need to add a test specific for excel at pandas/tests/io/excel/test_readers.py with a new excel file with blank lines.

@ahawryluk
Copy link
Contributor

ahawryluk commented Feb 23, 2021

I've figured out how to pass skip_blank_lines (default True) from read_excel down to TextParser, but I was surprised to discover that it only changes the returned result if the spreadsheet has exactly one column. If the spreadsheet has multiple columns, a blank row is passed to TextParser as ['', '', ...] and TextParser won't skip it. The TextParser behaviour makes sense because a blank line in a CSV (\n) is clearly a different thing than a line of empty cells (,,,\n).

Should we proceed as planned, restoring the <0.24 behaviour, but clarify the docstring? (e.g. "skips blank rows if the worksheet has a single column") I can think of other options, but they're not backwards compatible:

  1. Proceed as planned but change the read_excel default to skip_blank_lines=False to eliminate the surprise of single-column and multi-column sheets behaving differently.
  2. Pass skip_blank_lines=False to TextParser but don't expose the option from read_excel. In this case, the rationale would be that spreadsheets are structured around specific cell locations and read_excel should never conceal that. Users can always drop_na after the fact.
  3. Change the default to skip_blank_lines=False but change BaseExcelReader so that the argument works for multi-column worksheets as well.

@jankaWIS
Copy link
Author

I'd vote for 2. but I'm not really in charge :).

@ahawryluk
Copy link
Contributor

The more I think about it, I also lean towards option 2. The current behaviour (read_excel drops empty cells from single-column spreadsheets) seems more like a bug than a feature, and most of the SO posts that @jankaWIS provided are looking for a way to turn it off.

@WillAyd what do you think of this option? It's not really a deprecation, but would it require a FutureWarning?

@WillAyd
Copy link
Member

WillAyd commented Feb 24, 2021 via email

@ahawryluk
Copy link
Contributor

If the CSV is generated by pandas (or any other library that depends on the standard library csv module), then this nuance is specific to excel. Python csv writes a single-element empty row as ""\n so read_csv will not skip it even if skip_blank_lines=True. In contrast, a single-element empty row from a spreadsheet is passed to TextParser as [] and will be skipped.

So, a single-column of data with empty elements can be saved to CSV with index=False and reloaded intact. Saving the same data to excel with index=False and reloading will drop all the empty rows.

Python 3.5 had a bug in csv that represented single-element empty rows as \n, so single-column CSV data would not round-trip for that version. See #12443, #18676, and Python bug 32255. There are probably other CSVs in the wild that represent single-element empty rows as \n but I suppose pandas should at least be internally consistent.

@WillAyd
Copy link
Member

WillAyd commented Feb 25, 2021

Hmm sorry but I think I'm missing the big picture as to how this all relates. So ideally since we have a default of skip_blank_lines=True in CSV I think we should strive for that same default on the Excel side. It might be helpful if you've tried to that to just push what you have with the failing test case and we can debug together

@ahawryluk ahawryluk mentioned this issue Feb 27, 2021
4 tasks
@ahawryluk
Copy link
Contributor

The pull request resolves this issue and the new tests pass, so it's definitely an improvement. Someone might read the new docstring and think, "that's an odd option," but at least they can turn it off now.

@rhshadrach
Copy link
Member

#40214 fixed the buggy behavior of read_excel sometimes dropping blank lines and sometimes not. I'm -1 on exposing the skip_blank_lines argument to the user. It would need it's own implementation for excel (passing through to the parser is not sufficient), and df.dropna(how='all') accomplishes the same. With CSV files, one cannot differentiate between a truly blank line and ,,,,,. That does not exist for excel files

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Mar 9, 2021
@jmathew
Copy link

jmathew commented Mar 19, 2021

Was there a work around in 1.1 or 1.2 to get skip_blank_lines=False functionality?

@rhshadrach
Copy link
Member

In both 1.1 and 1.2, I believe blank lines will not be skipped unless a single column is being read. When a single column is read, there is no way currently to not skip blank lines. Is that what you're experiencing @jmathew?

@jmathew
Copy link

jmathew commented Mar 20, 2021

Yeah it has only one column.

@rhshadrach
Copy link
Member

Thanks @jmathew. I haven't been able to find any workaround for 1.1/1.2, it seems blank lines are removed before any of the other processing in done by the parser.

The only viable option I see to not skipping blank lines in 1.1./1.2 would be to backport #40214. Since the buggy behavior is not a regression (although this fact is complicated by having the possible workaround for this buggy behavior removed in 1.1), I do not think the backport should be done in this case. This opinion is also partially based on the thought (hope?) that the number of users impacted is small, and the severity of the impact on those users is also small.

If you have any thoughts on this - please do share.

@jmathew
Copy link

jmathew commented Mar 21, 2021

Thanks for looking @rhshadrach

I only need a solution to tide me over until 1.3 is released. So Im fine with the decision not to backport.

Off the top of my head, the fix seems really simple. Is monkey patching the fix from my code at runtime an option? I don't know how to do it in python but it seems like it would be a good fit for this scenario.

It is blocking for my use case as I need to know the original cell location for each value. However, worst case is I reopen the spreadsheet in a different reader and translate the data frame location to the original for this one narrow situation.

@jmathew
Copy link

jmathew commented Mar 23, 2021

We ended up making a build of 1.1 with #40214 applied. That seemed to work well.

We also explored patching it in at runtime. Works on my machine, but hasn't been tested significantly:

import pandas as pd 
import ast
import inspect

read_excel_ast = ast.parse(inspect.getsource(pd.read_excel))

# Clear the decorators because they cause errors when exec.
read_excel_ast.body[0].decorator_list.clear()

# Modify the return call to include the skip_blank_lines arg
return_args = read_excel_ast.body[0].body[1].value.keywords
new_arg = ast.keyword('skip_blank_lines',ast.Constant(False))
new_arg.value.col_offset = 27
new_arg.value.lineno = return_args[-1].value.lineno + 1
return_args.append(new_arg)

exec(compile(read_excel_ast, '<string>', 'exec'), pd.__dict__)

@rhshadrach
Copy link
Member

Thanks @jmathew - glad to hear that worked. I'll guess you already considered this, but adding an additional column (e.g. "dropme") to the excel files is another potential workaround, but that is not viable in many situations.

@ahawryluk
Copy link
Contributor

@rhshadrach Did you have any other thoughts on this issue before it can be closed? #40214 (comment)

@rhshadrach
Copy link
Member

Thanks for the ping @ahawryluk - with 1.3 around the corner I agree this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs good first issue IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants