Skip to content

Ods loses spaces 32207 #33233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,7 @@ I/O
- Bug in :meth:`read_csv` was raising a misleading exception on a permissions issue (:issue:`23784`)
- Bug in :meth:`read_csv` was raising an ``IndexError`` when header=None and 2 extra data columns
- Bug in :meth:`DataFrame.to_sql` where an ``AttributeError`` was raised when saving an out of bounds date (:issue:`26761`)
- Bug in :meth:`read_excel` did not correctly handle multiple embedded spaces in OpenDocument text cells. (:issue:`32207`)

Plotting
^^^^^^^^
Expand Down
27 changes: 26 additions & 1 deletion pandas/io/excel/_odfreader.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ def _get_cell_value(self, cell, convert_float: bool) -> Scalar:
cell_value = cell.attributes.get((OFFICENS, "value"))
return float(cell_value)
elif cell_type == "string":
return str(cell)
return self._get_cell_string_value(cell)
elif cell_type == "currency":
cell_value = cell.attributes.get((OFFICENS, "value"))
return float(cell_value)
Expand All @@ -182,3 +182,28 @@ def _get_cell_value(self, cell, convert_float: bool) -> Scalar:
return pd.to_datetime(str(cell)).time()
else:
raise ValueError(f"Unrecognized type {cell_type}")

def _get_cell_string_value(self, cell) -> str:
"""
Find and decode OpenDocument text:s tags that represent
a run length encoded sequence of space characters.
"""
from odf.element import Text, Element
from odf.text import S, P
from odf.namespaces import TEXTNS

text_p = P().qname
text_s = S().qname

p = cell.childNodes[0]

value = []
if p.qname == text_p:
for k, fragment in enumerate(p.childNodes):
if isinstance(fragment, Text):
value.append(fragment.data)
elif isinstance(fragment, Element):
if fragment.qname == text_s:
spaces = int(fragment.attributes.get((TEXTNS, "c"), 1))
value.append(" " * spaces)
return "".join(value)
Binary file added pandas/tests/io/data/excel/test_spaces.ods
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_spaces.xls
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_spaces.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_spaces.xlsm
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_spaces.xlsx
Binary file not shown.
18 changes: 18 additions & 0 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,24 @@ def test_reader_dtype_str(self, read_ext, dtype, expected):
actual = pd.read_excel(basename + read_ext, dtype=dtype)
tm.assert_frame_equal(actual, expected)

def test_reader_spaces(self, read_ext):
# see gh-32207
basename = "test_spaces"

actual = pd.read_excel(basename + read_ext)
expected = DataFrame(
{
"testcol": [
"this is great",
"4 spaces",
"1 trailing ",
" 1 leading",
"2 spaces multiple times",
]
}
)
tm.assert_frame_equal(actual, expected)

def test_reading_all_sheets(self, read_ext):
# Test reading all sheetnames by setting sheetname to None,
# Ensure a dict is returned.
Expand Down