BUG: read_excel skips single-column empty rows #40214

ahawryluk · 2021-03-04T04:02:14Z

Ref #39808

tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This is the fix proposed in the discussion of #40095

rhshadrach

Changes look good - two requests below.

pandas/tests/io/excel/test_readers.py

gfyoung · 2021-03-04T10:05:45Z

pandas/io/excel/_base.py

@@ -598,6 +598,7 @@ def parse(
                    skiprows=skiprows,
                    nrows=nrows,
                    na_values=na_values,
+                    skip_blank_lines=False,


Why not expose this parameter in read_excel ?

We tried that (#40095) but the behaviour when skip_blank_lines=True isn't very useful---it only skips empty spreadsheet rows if the entire spreadsheet is contained in column A. The original intent of skip_blank_lines was to skip \n lines in CSV files, but not lines with empty data elements such as ""\n or ,,,,\n, so there isn't an equivalent feature for spreadsheets.

I see. Can we make a comment about this and reference your original PR?

For sure. Do you mean an inline comment in the code, or in the description of this pull request?

lithomas1 · 2021-03-04T16:16:31Z

Not sure if this closes the original issue, as the parameter would have to be exposed in order to truly fix.
I think modifying this function

pandas/pandas/io/parsers/python_parser.py

Line 596 in fad96e1

def _is_line_empty(self, line):

might be able to fix the issue, but it could also introduce some side effects for CSV parsing.

ahawryluk · 2021-03-06T00:02:02Z

@lithomas1 I'm nervous about breaking the CSV behaviour, because it seems deliberate. I found one test where an input line ,, is supposed to produce a row of NaNs:

pandas/pandas/tests/io/parser/common/test_common_basic.py

Lines 412 to 414 in 154026c

    
                       ",,", 
        
                       {"names": ["Dummy", "X", "Dummy_2"], "usecols": ["X"]}, 
        
                       DataFrame(columns=["X"], index=[0], dtype=np.float64),

Maybe this behaviour should be documented better?

rhshadrach

lgtm; @lithomas1 - agreed that this PR does not fully close the issue as stated there, however I'm -0 on adding the skip_blank_lines argument to read_excel. read_csv needs this because you can't tell a blank line from e.g. ,,,, from the resulting DataFrame; this does not exist for excel files and it is straightforward to find and drop blank lines with df.dropna(how='all').

@ahawryluk - can you merge master

rhshadrach · 2021-03-07T16:21:01Z

@simonjayhawkins @jreback - This is a bit of an odd case, wanted to get some eyes to make sure it's being handled appropriately. This PR fixes the following bug (not a regression, AFAICT) on master: when reading an excel file, blank rows would be skipped if and only if a single column is being read. This PR makes it so that blank rows are never skipped.

However, the previous workaround was to supply skip_blank_lines=False in **kwds; this argument was removed in 1.1 and so is no longer available. I'm -0 on adding skip_blank_lines back in (see #40214 (review)).

Since this is a rather localized case, my thinking is to have this be a bugfix for 1.3 and leave the regression in 1.1.x and 1.2.x unresolved.

simonjayhawkins · 2021-03-08T11:39:06Z

Since this is a rather localized case, my thinking is to have this be a bugfix for 1.3 and leave the regression in 1.1.x and 1.2.x unresolved.

sgtm. does the OP need to change to not close the issue?

ahawryluk · 2021-03-08T18:41:03Z

@simonjayhawkins, thanks for taking a look at this. My preference is to close #39808 based on

the OP of Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808 (comment) prefers the current solution now that we understand how skip_blank_lines really works
IIUC, @rhshadrach and I don't think skip_blank_lines should be added to read_excel in the future

But I'm happy to edit the PR if everyone prefers to leave #39808 open.

simonjayhawkins · 2021-03-08T18:48:13Z

IIUC, @rhshadrach and I don't think skip_blank_lines should be added to read_excel in the future

The issue could be closed once that is agreed as they way forward on the issue itself.

jreback

lgtm. back to you @rhshadrach

jreback · 2021-03-09T01:48:23Z

for 1.3

rhshadrach · 2021-03-09T23:47:01Z

Thanks @ahawryluk! I've left the issue open for now, but will put my thoughts there shortly.

BUG: read_excel skips single-column empty rows

f0ee057

rhshadrach requested changes Mar 4, 2021

View reviewed changes

pandas/tests/io/excel/test_readers.py Show resolved Hide resolved

pandas/tests/io/excel/test_readers.py Outdated Show resolved Hide resolved

Two improvements to the test

f7adad2

gfyoung added IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version labels Mar 4, 2021

gfyoung reviewed Mar 4, 2021

View reviewed changes

Add comment

ae78f1e

rhshadrach approved these changes Mar 7, 2021

View reviewed changes

Merge branch 'master' into excel_noskip_blank

6af3986

jreback approved these changes Mar 9, 2021

View reviewed changes

jreback added this to the 1.3 milestone Mar 9, 2021

rhshadrach merged commit 93c52e4 into pandas-dev:master Mar 9, 2021

rhshadrach mentioned this pull request Mar 9, 2021

Add back skip_blank_lines to read_excel in pandas v>1.1.4 #39808

Closed

rhshadrach added Bug and removed Regression Functionality that used to work in a prior pandas version labels Mar 9, 2021

ahawryluk deleted the excel_noskip_blank branch March 9, 2021 23:55

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 11, 2021

BUG: read_excel skips single-column empty rows (pandas-dev#40214)

4050034

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel skips single-column empty rows #40214

BUG: read_excel skips single-column empty rows #40214

ahawryluk commented Mar 4, 2021 •

edited by rhshadrach

Loading

rhshadrach left a comment

gfyoung Mar 4, 2021

ahawryluk Mar 4, 2021

gfyoung Mar 4, 2021 •

edited

Loading

ahawryluk Mar 4, 2021 •

edited

Loading

lithomas1 commented Mar 4, 2021

ahawryluk commented Mar 6, 2021

rhshadrach left a comment •

edited

Loading

rhshadrach commented Mar 7, 2021 •

edited

Loading

simonjayhawkins commented Mar 8, 2021

ahawryluk commented Mar 8, 2021

simonjayhawkins commented Mar 8, 2021

jreback left a comment

jreback commented Mar 9, 2021

rhshadrach commented Mar 9, 2021

BUG: read_excel skips single-column empty rows #40214

BUG: read_excel skips single-column empty rows #40214

Conversation

ahawryluk commented Mar 4, 2021 • edited by rhshadrach Loading

rhshadrach left a comment

Choose a reason for hiding this comment

gfyoung Mar 4, 2021

Choose a reason for hiding this comment

ahawryluk Mar 4, 2021

Choose a reason for hiding this comment

gfyoung Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

ahawryluk Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

lithomas1 commented Mar 4, 2021

ahawryluk commented Mar 6, 2021

rhshadrach left a comment • edited Loading

Choose a reason for hiding this comment

rhshadrach commented Mar 7, 2021 • edited Loading

simonjayhawkins commented Mar 8, 2021

ahawryluk commented Mar 8, 2021

simonjayhawkins commented Mar 8, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback commented Mar 9, 2021

rhshadrach commented Mar 9, 2021

ahawryluk commented Mar 4, 2021 •

edited by rhshadrach

Loading

gfyoung Mar 4, 2021 •

edited

Loading

ahawryluk Mar 4, 2021 •

edited

Loading

rhshadrach left a comment •

edited

Loading

rhshadrach commented Mar 7, 2021 •

edited

Loading