Skip to content

BUG: read_excel skips single-column empty rows #40214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 9, 2021

Conversation

ahawryluk
Copy link
Contributor

@ahawryluk ahawryluk commented Mar 4, 2021

Ref #39808

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

This is the fix proposed in the discussion of #40095

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good - two requests below.

@gfyoung gfyoung added IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version labels Mar 4, 2021
@@ -598,6 +598,7 @@ def parse(
skiprows=skiprows,
nrows=nrows,
na_values=na_values,
skip_blank_lines=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not expose this parameter in read_excel ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried that (#40095) but the behaviour when skip_blank_lines=True isn't very useful---it only skips empty spreadsheet rows if the entire spreadsheet is contained in column A. The original intent of skip_blank_lines was to skip \n lines in CSV files, but not lines with empty data elements such as ""\n or ,,,,\n, so there isn't an equivalent feature for spreadsheets.

Copy link
Member

@gfyoung gfyoung Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Can we make a comment about this and reference your original PR?

Copy link
Contributor Author

@ahawryluk ahawryluk Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. Do you mean an inline comment in the code, or in the description of this pull request?

@lithomas1
Copy link
Member

Not sure if this closes the original issue, as the parameter would have to be exposed in order to truly fix.
I think modifying this function

def _is_line_empty(self, line):
might be able to fix the issue, but it could also introduce some side effects for CSV parsing.

@ahawryluk
Copy link
Contributor Author

@lithomas1 I'm nervous about breaking the CSV behaviour, because it seems deliberate. I found one test where an input line ,, is supposed to produce a row of NaNs:

",,",
{"names": ["Dummy", "X", "Dummy_2"], "usecols": ["X"]},
DataFrame(columns=["X"], index=[0], dtype=np.float64),

Maybe this behaviour should be documented better?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; @lithomas1 - agreed that this PR does not fully close the issue as stated there, however I'm -0 on adding the skip_blank_lines argument to read_excel. read_csv needs this because you can't tell a blank line from e.g. ,,,, from the resulting DataFrame; this does not exist for excel files and it is straightforward to find and drop blank lines with df.dropna(how='all').

@ahawryluk - can you merge master

@rhshadrach
Copy link
Member

rhshadrach commented Mar 7, 2021

@simonjayhawkins @jreback - This is a bit of an odd case, wanted to get some eyes to make sure it's being handled appropriately. This PR fixes the following bug (not a regression, AFAICT) on master: when reading an excel file, blank rows would be skipped if and only if a single column is being read. This PR makes it so that blank rows are never skipped.

However, the previous workaround was to supply skip_blank_lines=False in **kwds; this argument was removed in 1.1 and so is no longer available. I'm -0 on adding skip_blank_lines back in (see #40214 (review)).

Since this is a rather localized case, my thinking is to have this be a bugfix for 1.3 and leave the regression in 1.1.x and 1.2.x unresolved.

@simonjayhawkins
Copy link
Member

Since this is a rather localized case, my thinking is to have this be a bugfix for 1.3 and leave the regression in 1.1.x and 1.2.x unresolved.

sgtm. does the OP need to change to not close the issue?

@ahawryluk
Copy link
Contributor Author

@simonjayhawkins, thanks for taking a look at this. My preference is to close #39808 based on

But I'm happy to edit the PR if everyone prefers to leave #39808 open.

@simonjayhawkins
Copy link
Member

  • IIUC, @rhshadrach and I don't think skip_blank_lines should be added to read_excel in the future

The issue could be closed once that is agreed as they way forward on the issue itself.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. back to you @rhshadrach

@jreback jreback added this to the 1.3 milestone Mar 9, 2021
@jreback
Copy link
Contributor

jreback commented Mar 9, 2021

for 1.3

@rhshadrach rhshadrach merged commit 93c52e4 into pandas-dev:master Mar 9, 2021
@rhshadrach
Copy link
Member

Thanks @ahawryluk! I've left the issue open for now, but will put my thoughts there shortly.

@rhshadrach rhshadrach added Bug and removed Regression Functionality that used to work in a prior pandas version labels Mar 9, 2021
@ahawryluk ahawryluk deleted the excel_noskip_blank branch March 9, 2021 23:55
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants