Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

jankaWIS · 2021-02-14T14:38:16Z

Is your feature request related to a problem?

Reading Stack Overflow, there has been several posts trying to skip blank lines when reading (post 1, post 2, post 3, ...). The proposed solution works for pandas v<0.24 but there is no mention about this feature and trying to use it I get:
In pandas 1.1.4 and 1.2: TypeError: read_excel() got an unexpected keyword argument 'skip_blank_lines' and no mention in the documentation posted above.

Describe the solution you'd like

Pandas read_excel and/or ExcelFile should get back this argument and functionality

The text was updated successfully, but these errors were encountered:

lithomas1 · 2021-02-14T18:00:57Z

@jankaWIS Thanks for the bug report! I think this is due to #34464, which dropped support for *kwds that were passed to TextFileReader(skip_blank_lines being one of them). Labeling as regression for now.
cc @topper-123 @WillAyd

The fix for this would actually be very simple if the core devs wished to add this back(Adding *kwds here

pandas/pandas/io/excel/_base.py

Line 373 in fc9fdba

mangle_dupe_cols=mangle_dupe_cols,

and

pandas/pandas/io/excel/_base.py

Line 336 in fc9fdba

storage_options: StorageOptions = None,

WillAyd · 2021-02-14T19:37:29Z

We don’t need to add back in the keyword arguments - can just add this parameter to read_excel and document

lithomas1 · 2021-02-22T01:54:47Z

If anyone wants to take this up, the fix would be to add skip_blank_lines to the read_excel signature and to the io.parse call. Do note that you'll also need to add back the docstring(see variable _read_excel_doc) in file pandas/io/excel/_base.py.

ahawryluk · 2021-02-22T04:14:52Z

take

ahawryluk · 2021-02-22T04:17:38Z

Will this require a new test, or is it enough that the underlying TextParser already has tests for skip_blank_lines?

lithomas1 · 2021-02-22T14:39:13Z

@ahawryluk Thanks for taking this up. I think you need to add a test specific for excel at pandas/tests/io/excel/test_readers.py with a new excel file with blank lines.

ahawryluk · 2021-02-23T04:01:22Z

I've figured out how to pass skip_blank_lines (default True) from read_excel down to TextParser, but I was surprised to discover that it only changes the returned result if the spreadsheet has exactly one column. If the spreadsheet has multiple columns, a blank row is passed to TextParser as ['', '', ...] and TextParser won't skip it. The TextParser behaviour makes sense because a blank line in a CSV (\n) is clearly a different thing than a line of empty cells (,,,\n).

Should we proceed as planned, restoring the <0.24 behaviour, but clarify the docstring? (e.g. "skips blank rows if the worksheet has a single column") I can think of other options, but they're not backwards compatible:

Proceed as planned but change the read_excel default to skip_blank_lines=False to eliminate the surprise of single-column and multi-column sheets behaving differently.
Pass skip_blank_lines=False to TextParser but don't expose the option from read_excel. In this case, the rationale would be that spreadsheets are structured around specific cell locations and read_excel should never conceal that. Users can always drop_na after the fact.
Change the default to skip_blank_lines=False but change BaseExcelReader so that the argument works for multi-column worksheets as well.

jankaWIS · 2021-02-23T17:23:17Z

I'd vote for 2. but I'm not really in charge :).

ahawryluk · 2021-02-24T19:19:59Z

The more I think about it, I also lean towards option 2. The current behaviour (read_excel drops empty cells from single-column spreadsheets) seems more like a bug than a feature, and most of the SO posts that @jankaWIS provided are looking for a way to turn it off.

@WillAyd what do you think of this option? It's not really a deprecation, but would it require a FutureWarning?

WillAyd · 2021-02-24T19:36:24Z

Is the nuance you are seeing with the single column specific to excel or is it seen in the CSV parser as well?

…

Sent from my iPhone

On Feb 24, 2021, at 11:20 AM, Andrew Hawyrluk ***@***.***> wrote: The more I think about it, I also lean towards option 2. The current behaviour (read_excel drops empty cells from single-column spreadsheets) seems more like a bug than a feature, and most of the SO posts that @jankaWIS provided are looking for a way to turn it off. @WillAyd what do you think of this option? It's not really a deprecation, but would it require a FutureWarning? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ahawryluk · 2021-02-24T23:40:55Z

If the CSV is generated by pandas (or any other library that depends on the standard library csv module), then this nuance is specific to excel. Python csv writes a single-element empty row as ""\n so read_csv will not skip it even if skip_blank_lines=True. In contrast, a single-element empty row from a spreadsheet is passed to TextParser as [] and will be skipped.

So, a single-column of data with empty elements can be saved to CSV with index=False and reloaded intact. Saving the same data to excel with index=False and reloading will drop all the empty rows.

Python 3.5 had a bug in csv that represented single-element empty rows as \n, so single-column CSV data would not round-trip for that version. See #12443, #18676, and Python bug 32255. There are probably other CSVs in the wild that represent single-element empty rows as \n but I suppose pandas should at least be internally consistent.

WillAyd · 2021-02-25T00:02:06Z

Hmm sorry but I think I'm missing the big picture as to how this all relates. So ideally since we have a default of skip_blank_lines=True in CSV I think we should strive for that same default on the Excel side. It might be helpful if you've tried to that to just push what you have with the failing test case and we can debug together

ahawryluk · 2021-02-27T05:08:05Z

The pull request resolves this issue and the new tests pass, so it's definitely an improvement. Someone might read the new docstring and think, "that's an odd option," but at least they can turn it off now.

rhshadrach · 2021-03-09T23:51:24Z

#40214 fixed the buggy behavior of read_excel sometimes dropping blank lines and sometimes not. I'm -1 on exposing the skip_blank_lines argument to the user. It would need it's own implementation for excel (passing through to the parser is not sufficient), and df.dropna(how='all') accomplishes the same. With CSV files, one cannot differentiate between a truly blank line and ,,,,,. That does not exist for excel files

jmathew · 2021-03-19T15:56:54Z

Was there a work around in 1.1 or 1.2 to get skip_blank_lines=False functionality?

rhshadrach · 2021-03-20T14:22:45Z

In both 1.1 and 1.2, I believe blank lines will not be skipped unless a single column is being read. When a single column is read, there is no way currently to not skip blank lines. Is that what you're experiencing @jmathew?

jmathew · 2021-03-20T15:19:14Z

Yeah it has only one column.

rhshadrach · 2021-03-21T22:05:40Z

Thanks @jmathew. I haven't been able to find any workaround for 1.1/1.2, it seems blank lines are removed before any of the other processing in done by the parser.

The only viable option I see to not skipping blank lines in 1.1./1.2 would be to backport #40214. Since the buggy behavior is not a regression (although this fact is complicated by having the possible workaround for this buggy behavior removed in 1.1), I do not think the backport should be done in this case. This opinion is also partially based on the thought (hope?) that the number of users impacted is small, and the severity of the impact on those users is also small.

If you have any thoughts on this - please do share.

jmathew · 2021-03-21T22:48:39Z

Thanks for looking @rhshadrach

I only need a solution to tide me over until 1.3 is released. So Im fine with the decision not to backport.

Off the top of my head, the fix seems really simple. Is monkey patching the fix from my code at runtime an option? I don't know how to do it in python but it seems like it would be a good fit for this scenario.

It is blocking for my use case as I need to know the original cell location for each value. However, worst case is I reopen the spreadsheet in a different reader and translate the data frame location to the original for this one narrow situation.

jmathew · 2021-03-23T22:55:26Z

We ended up making a build of 1.1 with #40214 applied. That seemed to work well.

We also explored patching it in at runtime. Works on my machine, but hasn't been tested significantly:

import pandas as pd 
import ast
import inspect

read_excel_ast = ast.parse(inspect.getsource(pd.read_excel))

# Clear the decorators because they cause errors when exec.
read_excel_ast.body[0].decorator_list.clear()

# Modify the return call to include the skip_blank_lines arg
return_args = read_excel_ast.body[0].body[1].value.keywords
new_arg = ast.keyword('skip_blank_lines',ast.Constant(False))
new_arg.value.col_offset = 27
new_arg.value.lineno = return_args[-1].value.lineno + 1
return_args.append(new_arg)

exec(compile(read_excel_ast, '<string>', 'exec'), pd.__dict__)

rhshadrach · 2021-03-25T01:57:37Z

Thanks @jmathew - glad to hear that worked. I'll guess you already considered this, but adding an additional column (e.g. "dropme") to the excel files is another potential workaround, but that is not viable in many situations.

ahawryluk · 2021-05-26T17:31:03Z

@rhshadrach Did you have any other thoughts on this issue before it can be closed? #40214 (comment)

rhshadrach · 2021-06-09T03:57:22Z

Thanks for the ping @ahawryluk - with 1.3 around the corner I agree this can be closed.

jankaWIS added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 14, 2021

jankaWIS changed the title ~~Add back skip_blank_lines~~ Add back skip_blank_lines to read_excel in pandas 1.1.4 Feb 14, 2021

jankaWIS changed the title ~~Add back skip_blank_lines to read_excel in pandas 1.1.4~~ Add back skip_blank_lines to read_excel in pandas v>1.1.4 Feb 14, 2021

lithomas1 added Regression Functionality that used to work in a prior pandas version IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member Enhancement labels Feb 14, 2021

lithomas1 added this to the 1.3 milestone Feb 14, 2021

lithomas1 added the good first issue label Feb 22, 2021

github-actions bot assigned ahawryluk Feb 22, 2021

ahawryluk mentioned this issue Feb 27, 2021

Excel skip blank lines #40095

Closed

4 tasks

ahawryluk mentioned this issue Mar 4, 2021

BUG: read_excel skips single-column empty rows #40214

Merged

3 tasks

rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Mar 9, 2021

rhshadrach closed this as completed Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

jankaWIS commented Feb 14, 2021 •

edited

Loading

lithomas1 commented Feb 14, 2021 •

edited

Loading

WillAyd commented Feb 14, 2021

lithomas1 commented Feb 22, 2021 •

edited

Loading

ahawryluk commented Feb 22, 2021

ahawryluk commented Feb 22, 2021

lithomas1 commented Feb 22, 2021 •

edited

Loading

ahawryluk commented Feb 23, 2021 •

edited

Loading

jankaWIS commented Feb 23, 2021

ahawryluk commented Feb 24, 2021

WillAyd commented Feb 24, 2021 via email

ahawryluk commented Feb 24, 2021

WillAyd commented Feb 25, 2021

ahawryluk commented Feb 27, 2021

rhshadrach commented Mar 9, 2021

jmathew commented Mar 19, 2021

rhshadrach commented Mar 20, 2021

jmathew commented Mar 20, 2021

rhshadrach commented Mar 21, 2021

jmathew commented Mar 21, 2021 •

edited

Loading

jmathew commented Mar 23, 2021

rhshadrach commented Mar 25, 2021

ahawryluk commented May 26, 2021

rhshadrach commented Jun 9, 2021

Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

Comments

jankaWIS commented Feb 14, 2021 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

lithomas1 commented Feb 14, 2021 • edited Loading

WillAyd commented Feb 14, 2021

lithomas1 commented Feb 22, 2021 • edited Loading

ahawryluk commented Feb 22, 2021

ahawryluk commented Feb 22, 2021

lithomas1 commented Feb 22, 2021 • edited Loading

ahawryluk commented Feb 23, 2021 • edited Loading

jankaWIS commented Feb 23, 2021

ahawryluk commented Feb 24, 2021

WillAyd commented Feb 24, 2021 via email

ahawryluk commented Feb 24, 2021

WillAyd commented Feb 25, 2021

ahawryluk commented Feb 27, 2021

rhshadrach commented Mar 9, 2021

jmathew commented Mar 19, 2021

rhshadrach commented Mar 20, 2021

jmathew commented Mar 20, 2021

rhshadrach commented Mar 21, 2021

jmathew commented Mar 21, 2021 • edited Loading

jmathew commented Mar 23, 2021

rhshadrach commented Mar 25, 2021

ahawryluk commented May 26, 2021

rhshadrach commented Jun 9, 2021

jankaWIS commented Feb 14, 2021 •

edited

Loading

lithomas1 commented Feb 14, 2021 •

edited

Loading

lithomas1 commented Feb 22, 2021 •

edited

Loading

lithomas1 commented Feb 22, 2021 •

edited

Loading

ahawryluk commented Feb 23, 2021 •

edited

Loading

jmathew commented Mar 21, 2021 •

edited

Loading