Skip to content

BUG: read_excel for ods files raising UnboundLocalError in certain cases #36175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ I/O
- :meth:`to_csv` did not support zip compression for binary file object not having a filename (:issue: `35058`)
- :meth:`to_csv` and :meth:`read_csv` did not honor `compression` and `encoding` for path-like objects that are internally converted to file-like objects (:issue:`35677`, :issue:`26124`, and :issue:`32392`)
- :meth:`to_picke` and :meth:`read_pickle` did not support compression for file-objects (:issue:`26237`, :issue:`29054`, and :issue:`29570`)
- Bug in :meth:`read_excel` with `engine="odf"` caused ``UnboundLocalError`` in some cases where cells had nested child nodes (:issue:`36122`, and :issue:`35802`)

Plotting
^^^^^^^^
Expand Down
26 changes: 14 additions & 12 deletions pandas/io/excel/_odfreader.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,22 +197,24 @@ def _get_cell_string_value(self, cell) -> str:
Find and decode OpenDocument text:s tags that represent
a run length encoded sequence of space characters.
"""
from odf.element import Element, Text
from odf.element import Element
from odf.namespaces import TEXTNS
from odf.text import P, S
from odf.text import S

text_p = P().qname
text_s = S().qname

p = cell.childNodes[0]

value = []
if p.qname == text_p:
for k, fragment in enumerate(p.childNodes):
if isinstance(fragment, Text):
value.append(fragment.data)
elif isinstance(fragment, Element):
if fragment.qname == text_s:
spaces = int(fragment.attributes.get((TEXTNS, "c"), 1))

for fragment in cell.childNodes:
if isinstance(fragment, Element):
if fragment.qname == text_s:
spaces = int(fragment.attributes.get((TEXTNS, "c"), 1))
value.append(" " * spaces)
else:
# recursive impl needed in case of nested fragments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader it isn't clear to me what this means; is this a bug that needs to be fixed upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before #33233, the value was str(cell) which relied on the upstream odfpy implementation.

#33233 attempted to fix cases where multiple spaces was getting skipped in the output. That is because odfpy's str implementations for Nodes/Elements that are of that specific type (from odf.text import S) does not include the actual number of spaces. This imo should have been addressed upstream.

#33233 also introduced the bug that is causing the current UnboundLocalError as one line is misaligned. But fixing the misalignment adds in more bugs. With just the indentation fix, the output will still be missing certain fragments from the cell (for the file in #36122).

This was my original reason for doing a monkeypatch that patched __str__ implementation of the Element nodes to include the S spaces Node. I changed the monkeypatch but instead changed the implementation of _get_cell_string_value to have the same behavior as the monkeypatch. It still relies on odfpy's str implementation for all other Element types except for the Space Node, otherwise it pretty much mirrors Element.__str__ from odfpy.

Sorry for the wall of text.

# with multiple spaces
# https://github.com/pandas-dev/pandas/pull/36175#discussion_r484639704
value.append(self._get_cell_string_value(fragment))
else:
value.append(str(fragment))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to call str(fragment) instead of always just calling self._get_cell_string_value(fragment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we'd need to handle the str method for all the different cases which would require going into the spec. odfpy already has the str implementations done. the only case that was problematic was the multiple spaces that was addressed in the previous linked PR.

return "".join(value)
Binary file added pandas/tests/io/data/excel/gh-35802.ods
Binary file not shown.
Binary file added pandas/tests/io/data/excel/gh-36122.ods
Binary file not shown.
17 changes: 17 additions & 0 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,23 @@ def test_reader_spaces(self, read_ext):
)
tm.assert_frame_equal(actual, expected)

# gh-36122, gh-35802
@pytest.mark.parametrize(
"basename,expected",
[
("gh-35802", DataFrame({"COLUMN": ["Test (1)"]})),
("gh-36122", DataFrame(columns=["got 2nd sa"])),
],
)
def test_read_excel_ods_nested_xml(self, read_ext, basename, expected):
# see gh-35802
engine = pd.read_excel.keywords["engine"]
if engine != "odf":
pytest.skip(f"Skipped for engine: {engine}")

actual = pd.read_excel(basename + read_ext)
tm.assert_frame_equal(actual, expected)

def test_reading_all_sheets(self, read_ext):
# Test reading all sheet names by setting sheet_name to None,
# Ensure a dict is returned.
Expand Down