Skip to content

BUG: read_excel for ods files raising UnboundLocalError in certain cases #36175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 13, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 21 additions & 26 deletions pandas/io/excel/_odfreader.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,28 @@ def __init__(
storage_options: StorageOptions = None,
):
import_optional_dependency("odf")
self._monkeypatch_odf_element_str()
super().__init__(filepath_or_buffer, storage_options=storage_options)

def _monkeypatch_odf_element_str(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, why don't you just change the impl to be correct? we never want to monkeypatch things in library code.

Copy link
Contributor Author

@asishm asishm Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should not be monkeypatching the library code and ideally this should be fixed in odfpy directly.

If you look at the fix it essentially does the exact thing that the library does https://github.com/eea/odfpy/blob/master/odf/element.py#L240-L244 except add the case that PR #33233 addresses. This is relevant mostly because these Nodes can get specially if the nested child nodes also contain multiple spaces that #33233 addresses. Then it would become a recursive implementation.

For a cell like this:
image
there is no way to get all the spaces without recursive (or equivalent). The original PR is also somewhat of a monkeypatch as it essentially replaced str(cell) with _get_cell_string_value(cell)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right why can't you just change _get_string_cell_value to what you are doing here? (and put a nice comment / link to the source & issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. done

not sure why the tests are failing. they seem unrelated.

from odf.element import Element
from odf.namespaces import TEXTNS
from odf.text import S

text_s = S().qname

def element_str(self):
value = []
for c in self.childNodes:
if isinstance(c, Element) and c.qname == text_s:
spaces = int(c.attributes.get((TEXTNS, "c"), 1))
value.append(" " * spaces)
else:
value.append(str(c))
return "".join(value)

Element.__str__ = element_str

@property
def _workbook_class(self):
from odf.opendocument import OpenDocument
Expand Down Expand Up @@ -178,7 +198,7 @@ def _get_cell_value(self, cell, convert_float: bool) -> Scalar:
cell_value = cell.attributes.get((OFFICENS, "value"))
return float(cell_value)
elif cell_type == "string":
return self._get_cell_string_value(cell)
return str(cell)
elif cell_type == "currency":
cell_value = cell.attributes.get((OFFICENS, "value"))
return float(cell_value)
Expand All @@ -191,28 +211,3 @@ def _get_cell_value(self, cell, convert_float: bool) -> Scalar:
return result.time()
else:
raise ValueError(f"Unrecognized type {cell_type}")

def _get_cell_string_value(self, cell) -> str:
"""
Find and decode OpenDocument text:s tags that represent
a run length encoded sequence of space characters.
"""
from odf.element import Element, Text
from odf.namespaces import TEXTNS
from odf.text import P, S

text_p = P().qname
text_s = S().qname

p = cell.childNodes[0]

value = []
if p.qname == text_p:
for k, fragment in enumerate(p.childNodes):
if isinstance(fragment, Text):
value.append(fragment.data)
elif isinstance(fragment, Element):
if fragment.qname == text_s:
spaces = int(fragment.attributes.get((TEXTNS, "c"), 1))
value.append(" " * spaces)
return "".join(value)
Binary file added pandas/tests/io/data/excel/gh-35802.ods
Binary file not shown.
Binary file added pandas/tests/io/data/excel/gh-36122.ods
Binary file not shown.
17 changes: 17 additions & 0 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,23 @@ def test_reader_spaces(self, read_ext):
)
tm.assert_frame_equal(actual, expected)

# gh-36122, gh-35802
@pytest.mark.parametrize(
"basename,expected",
[
("gh-35802", DataFrame({"COLUMN": ["Test (1)"]})),
("gh-36122", DataFrame(columns=["got 2nd sa"])),
],
)
def test_read_excel_ods_nested_xml(self, read_ext, basename, expected):
# see gh-35802
engine = pd.read_excel.keywords["engine"]
if engine != "odf":
pytest.skip(f"Skipped for engine: {engine}")

actual = pd.read_excel(basename + read_ext)
tm.assert_frame_equal(actual, expected)

def test_reading_all_sheets(self, read_ext):
# Test reading all sheet names by setting sheet_name to None,
# Ensure a dict is returned.
Expand Down