Skip to content

BUG: read_excel failed with empty rows after MultiIndex header #40649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -750,6 +750,7 @@ I/O
- Bug in :func:`read_hdf` returning unexpected records when filtering on categorical string columns using ``where`` parameter (:issue:`39189`)
- Bug in :func:`read_sas` raising ``ValueError`` when ``datetimes`` were null (:issue:`39725`)
- Bug in :func:`read_excel` dropping empty values from single-column spreadsheets (:issue:`39808`)
- Bug in :func:`read_excel` raising ``AttributeError`` with ``MultiIndex`` header followed by two empty rows and no index, and bug affecting :func:`read_excel`, :func:`read_csv`, :func:`read_table`, :func:`read_fwf`, and :func:`read_clipboard` where one blank row after a ``MultiIndex`` header with no index would be dropped (:issue:`40442`)
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)

Period
Expand Down
3 changes: 2 additions & 1 deletion pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -707,7 +707,8 @@ cdef class TextReader:
ic = (len(self.index_col) if self.index_col
is not None else 0)

if lc != unnamed_count and lc - ic > unnamed_count:
# if wrong number of blanks or no index, not our format
if (lc != unnamed_count and lc - ic > unnamed_count) or ic == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a similar comment to below (e.g. for future readers what are the inference rules)

hr -= 1
self.parser_start -= 1
this_header = [None] * lc
Expand Down
6 changes: 5 additions & 1 deletion pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -551,7 +551,11 @@ def parse(
header_name, _ = pop_header_name(data[row], index_col)
header_names.append(header_name)

has_index_names = is_list_like(header) and len(header) > 1
# If there is a MultiIndex header and an index then there is also
# a row containing just the index name(s)
has_index_names = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what the inference rules are

is_list_like(header) and len(header) > 1 and index_col is not None
)

if is_list_like(index_col):
# Forward fill values for MultiIndex index.
Expand Down
3 changes: 2 additions & 1 deletion pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -431,7 +431,8 @@ def _infer_columns(self):
ic = len(self.index_col) if self.index_col is not None else 0
unnamed_count = len(this_unnamed_cols)

if lc != unnamed_count and lc - ic > unnamed_count:
# if wrong number of blanks or no index, not our format
if (lc != unnamed_count and lc - ic > unnamed_count) or ic == 0:
clear_buffer = False
this_columns = [None] * lc
self.buf = [self.buf[-1]]
Expand Down
Binary file modified pandas/tests/io/data/excel/testmultiindex.ods
Binary file not shown.
Binary file modified pandas/tests/io/data/excel/testmultiindex.xls
Binary file not shown.
Binary file modified pandas/tests/io/data/excel/testmultiindex.xlsb
Binary file not shown.
Binary file modified pandas/tests/io/data/excel/testmultiindex.xlsm
Binary file not shown.
Binary file modified pandas/tests/io/data/excel/testmultiindex.xlsx
Binary file not shown.
11 changes: 11 additions & 0 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1193,6 +1193,17 @@ def test_one_col_noskip_blank_line(self, read_ext):
result = pd.read_excel(file_name)
tm.assert_frame_equal(result, expected)

def test_multiheader_two_blank_lines(self, read_ext):
# GH 40442
file_name = "testmultiindex" + read_ext
columns = MultiIndex.from_tuples([("a", "A"), ("b", "B")])
data = [[np.nan, np.nan], [np.nan, np.nan], [1, 3], [2, 4]]
expected = DataFrame(data, columns=columns)
result = pd.read_excel(
file_name, sheet_name="mi_column_empty_rows", header=[0, 1]
)
tm.assert_frame_equal(result, expected)


class TestExcelFileRead:
@pytest.fixture(autouse=True)
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/io/parser/test_header.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,17 @@ def test_header_multi_index_common_format_malformed3(all_parsers):
tm.assert_frame_equal(expected, result)


def test_header_multi_index_blank_line(all_parsers):
# GH 40442
parser = all_parsers
data = [[None, None], [1, 2], [3, 4]]
columns = MultiIndex.from_tuples([("a", "A"), ("b", "B")])
expected = DataFrame(data, columns=columns)
data = "a,b\nA,B\n,\n1,2\n3,4"
result = parser.read_csv(StringIO(data), header=[0, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens with index_col not None (do we already tests this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these two tests both have MultiIndex headers and index_col not None:

def test_header_multi_index(all_parsers):

def test_header_multi_index_common_format1(all_parsers, kwargs):

tm.assert_frame_equal(expected, result)


@pytest.mark.parametrize(
"data,header", [("1,2,3\n4,5,6", None), ("foo,bar,baz\n1,2,3\n4,5,6", 0)]
)
Expand Down