Skip to content

Commit 23099f7

Browse files
detroutjreback
authored andcommitted
Class to read OpenDocument Tables (#25427)
1 parent 1659fff commit 23099f7

31 files changed

+295
-11
lines changed

ci/deps/travis-36-cov.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ dependencies:
1616
- nomkl
1717
- numexpr
1818
- numpy=1.15.*
19+
- odfpy
1920
- openpyxl
2021
- pandas-gbq
2122
# https://github.com/pydata/pandas-gbq/issues/271

doc/source/user_guide/io.rst

+25-3
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
3232
text;`HTML <https://en.wikipedia.org/wiki/HTML>`__;:ref:`read_html<io.read_html>`;:ref:`to_html<io.html>`
3333
text; Local clipboard;:ref:`read_clipboard<io.clipboard>`;:ref:`to_clipboard<io.clipboard>`
3434
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
35+
binary;`OpenDocument <http://www.opendocumentformat.org>`__;:ref:`read_excel<io.ods>`;
3536
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
3637
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
3738
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
@@ -2791,9 +2792,10 @@ parse HTML tables in the top-level pandas io function ``read_html``.
27912792
Excel files
27922793
-----------
27932794

2794-
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) and
2795-
Excel 2007+ (``.xlsx``) files using the ``xlrd`` Python
2796-
module. The :meth:`~DataFrame.to_excel` instance method is used for
2795+
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``)
2796+
files using the ``xlrd`` Python module. Excel 2007+ (``.xlsx``) files
2797+
can be read using either ``xlrd`` or ``openpyxl``.
2798+
The :meth:`~DataFrame.to_excel` instance method is used for
27972799
saving a ``DataFrame`` to Excel. Generally the semantics are
27982800
similar to working with :ref:`csv<io.read_csv_table>` data.
27992801
See the :ref:`cookbook<cookbook.excel>` for some advanced strategies.
@@ -3229,7 +3231,27 @@ The look and feel of Excel worksheets created from pandas can be modified using
32293231
* ``float_format`` : Format string for floating point numbers (default ``None``).
32303232
* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``).
32313233

3234+
.. _io.ods:
32323235

3236+
OpenDocument Spreadsheets
3237+
-------------------------
3238+
3239+
.. versionadded:: 0.25
3240+
3241+
The :func:`~pandas.read_excel` method can also read OpenDocument spreadsheets
3242+
using the ``odfpy`` module. The semantics and features for reading
3243+
OpenDocument spreadsheets match what can be done for `Excel files`_ using
3244+
``engine='odf'``.
3245+
3246+
.. code-block:: python
3247+
3248+
# Returns a DataFrame
3249+
pd.read_excel('path_to_file.ods', engine='odf')
3250+
3251+
.. note::
3252+
3253+
Currently pandas only supports *reading* OpenDocument spreadsheets. Writing
3254+
is not implemented.
32333255

32343256
.. _io.clipboard:
32353257

doc/source/whatsnew/v0.25.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,7 @@ Other enhancements
187187
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
188188
- :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`)
189189
- :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
190+
- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify ``engine='odf'`` to enable. Consult the :ref:`IO User Guide <io.ods>` for more details (:issue:`9070`)
190191

191192
.. _whatsnew_0250.api_breaking:
192193

pandas/compat/_optional.py

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
"lxml.etree": "3.8.0",
1414
"matplotlib": "2.2.2",
1515
"numexpr": "2.6.2",
16+
"odfpy": "1.3.0",
1617
"openpyxl": "2.4.8",
1718
"pandas_gbq": "0.8.0",
1819
"pyarrow": "0.9.0",

pandas/core/config_init.py

+9
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,7 @@ def use_inf_as_na_cb(key):
422422
_xls_options = ['xlrd']
423423
_xlsm_options = ['xlrd', 'openpyxl']
424424
_xlsx_options = ['xlrd', 'openpyxl']
425+
_ods_options = ['odf']
425426

426427

427428
with cf.config_prefix("io.excel.xls"):
@@ -447,6 +448,14 @@ def use_inf_as_na_cb(key):
447448
validator=str)
448449

449450

451+
with cf.config_prefix("io.excel.ods"):
452+
cf.register_option("reader", "auto",
453+
reader_engine_doc.format(
454+
ext='ods',
455+
others=', '.join(_ods_options)),
456+
validator=str)
457+
458+
450459
# Set up the io.excel specific writer configuration.
451460
writer_engine_doc = """
452461
: string

pandas/io/excel/_base.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -768,12 +768,14 @@ class ExcelFile:
768768
Acceptable values are None or ``xlrd``.
769769
"""
770770

771-
from pandas.io.excel._xlrd import _XlrdReader
771+
from pandas.io.excel._odfreader import _ODFReader
772772
from pandas.io.excel._openpyxl import _OpenpyxlReader
773+
from pandas.io.excel._xlrd import _XlrdReader
773774

774775
_engines = {
775776
'xlrd': _XlrdReader,
776777
'openpyxl': _OpenpyxlReader,
778+
'odf': _ODFReader,
777779
}
778780

779781
def __init__(self, io, engine=None):

pandas/io/excel/_odfreader.py

+176
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
from typing import List
2+
3+
from pandas.compat._optional import import_optional_dependency
4+
5+
import pandas as pd
6+
from pandas._typing import FilePathOrBuffer, Scalar
7+
8+
from pandas.io.excel._base import _BaseExcelReader
9+
10+
11+
class _ODFReader(_BaseExcelReader):
12+
"""Read tables out of OpenDocument formatted files
13+
14+
Parameters
15+
----------
16+
filepath_or_buffer: string, path to be parsed or
17+
an open readable stream.
18+
"""
19+
def __init__(self, filepath_or_buffer: FilePathOrBuffer):
20+
import_optional_dependency("odf")
21+
super().__init__(filepath_or_buffer)
22+
23+
@property
24+
def _workbook_class(self):
25+
from odf.opendocument import OpenDocument
26+
return OpenDocument
27+
28+
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
29+
from odf.opendocument import load
30+
return load(filepath_or_buffer)
31+
32+
@property
33+
def empty_value(self) -> str:
34+
"""Property for compat with other readers."""
35+
return ''
36+
37+
@property
38+
def sheet_names(self) -> List[str]:
39+
"""Return a list of sheet names present in the document"""
40+
from odf.table import Table
41+
42+
tables = self.book.getElementsByType(Table)
43+
return [t.getAttribute("name") for t in tables]
44+
45+
def get_sheet_by_index(self, index: int):
46+
from odf.table import Table
47+
tables = self.book.getElementsByType(Table)
48+
return tables[index]
49+
50+
def get_sheet_by_name(self, name: str):
51+
from odf.table import Table
52+
53+
tables = self.book.getElementsByType(Table)
54+
55+
for table in tables:
56+
if table.getAttribute("name") == name:
57+
return table
58+
59+
raise ValueError("sheet {name} not found".format(name))
60+
61+
def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
62+
"""Parse an ODF Table into a list of lists
63+
"""
64+
from odf.table import CoveredTableCell, TableCell, TableRow
65+
66+
covered_cell_name = CoveredTableCell().qname
67+
table_cell_name = TableCell().qname
68+
cell_names = {covered_cell_name, table_cell_name}
69+
70+
sheet_rows = sheet.getElementsByType(TableRow)
71+
empty_rows = 0
72+
max_row_len = 0
73+
74+
table = [] # type: List[List[Scalar]]
75+
76+
for i, sheet_row in enumerate(sheet_rows):
77+
sheet_cells = [x for x in sheet_row.childNodes
78+
if x.qname in cell_names]
79+
empty_cells = 0
80+
table_row = [] # type: List[Scalar]
81+
82+
for j, sheet_cell in enumerate(sheet_cells):
83+
if sheet_cell.qname == table_cell_name:
84+
value = self._get_cell_value(sheet_cell, convert_float)
85+
else:
86+
value = self.empty_value
87+
88+
column_repeat = self._get_column_repeat(sheet_cell)
89+
90+
# Queue up empty values, writing only if content succeeds them
91+
if value == self.empty_value:
92+
empty_cells += column_repeat
93+
else:
94+
table_row.extend([self.empty_value] * empty_cells)
95+
empty_cells = 0
96+
table_row.extend([value] * column_repeat)
97+
98+
if max_row_len < len(table_row):
99+
max_row_len = len(table_row)
100+
101+
row_repeat = self._get_row_repeat(sheet_row)
102+
if self._is_empty_row(sheet_row):
103+
empty_rows += row_repeat
104+
else:
105+
# add blank rows to our table
106+
table.extend([[self.empty_value]] * empty_rows)
107+
empty_rows = 0
108+
for _ in range(row_repeat):
109+
table.append(table_row)
110+
111+
# Make our table square
112+
for row in table:
113+
if len(row) < max_row_len:
114+
row.extend([self.empty_value] * (max_row_len - len(row)))
115+
116+
return table
117+
118+
def _get_row_repeat(self, row) -> int:
119+
"""Return number of times this row was repeated
120+
Repeating an empty row appeared to be a common way
121+
of representing sparse rows in the table.
122+
"""
123+
from odf.namespaces import TABLENS
124+
125+
return int(row.attributes.get((TABLENS, 'number-rows-repeated'), 1))
126+
127+
def _get_column_repeat(self, cell) -> int:
128+
from odf.namespaces import TABLENS
129+
return int(cell.attributes.get(
130+
(TABLENS, 'number-columns-repeated'), 1))
131+
132+
def _is_empty_row(self, row) -> bool:
133+
"""Helper function to find empty rows
134+
"""
135+
for column in row.childNodes:
136+
if len(column.childNodes) > 0:
137+
return False
138+
139+
return True
140+
141+
def _get_cell_value(self, cell, convert_float: bool) -> Scalar:
142+
from odf.namespaces import OFFICENS
143+
cell_type = cell.attributes.get((OFFICENS, 'value-type'))
144+
if cell_type == 'boolean':
145+
if str(cell) == "TRUE":
146+
return True
147+
return False
148+
if cell_type is None:
149+
return self.empty_value
150+
elif cell_type == 'float':
151+
# GH5394
152+
cell_value = float(cell.attributes.get((OFFICENS, 'value')))
153+
154+
if cell_value == 0. and str(cell) != cell_value: # NA handling
155+
return str(cell)
156+
157+
if convert_float:
158+
val = int(cell_value)
159+
if val == cell_value:
160+
return val
161+
return cell_value
162+
elif cell_type == 'percentage':
163+
cell_value = cell.attributes.get((OFFICENS, 'value'))
164+
return float(cell_value)
165+
elif cell_type == 'string':
166+
return str(cell)
167+
elif cell_type == 'currency':
168+
cell_value = cell.attributes.get((OFFICENS, 'value'))
169+
return float(cell_value)
170+
elif cell_type == 'date':
171+
cell_value = cell.attributes.get((OFFICENS, 'date-value'))
172+
return pd.to_datetime(cell_value)
173+
elif cell_type == 'time':
174+
return pd.to_datetime(str(cell)).time()
175+
else:
176+
raise ValueError('Unrecognized type {}'.format(cell_type))

pandas/tests/io/data/blank.ods

2.75 KB
Binary file not shown.
2.83 KB
Binary file not shown.
8.3 KB
Binary file not shown.

pandas/tests/io/data/test1.ods

4.34 KB
Binary file not shown.

pandas/tests/io/data/test2.ods

2.81 KB
Binary file not shown.

pandas/tests/io/data/test3.ods

2.82 KB
Binary file not shown.

pandas/tests/io/data/test4.ods

2.92 KB
Binary file not shown.

pandas/tests/io/data/test5.ods

2.84 KB
Binary file not shown.
3.21 KB
Binary file not shown.
3.61 KB
Binary file not shown.
3.71 KB
Binary file not shown.

pandas/tests/io/data/test_squeeze.ods

3.14 KB
Binary file not shown.

pandas/tests/io/data/test_types.ods

3.41 KB
Binary file not shown.
3.34 KB
Binary file not shown.

pandas/tests/io/data/testdtype.ods

3.12 KB
Binary file not shown.
5.44 KB
Binary file not shown.

pandas/tests/io/data/testskiprows.ods

3.16 KB
Binary file not shown.

pandas/tests/io/data/times_1900.ods

3.11 KB
Binary file not shown.

pandas/tests/io/data/times_1904.ods

3.14 KB
Binary file not shown.

pandas/tests/io/data/writertable.odt

10.1 KB
Binary file not shown.

pandas/tests/io/excel/conftest.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ def df_ref():
3030
return df_ref
3131

3232

33-
@pytest.fixture(params=['.xls', '.xlsx', '.xlsm'])
33+
@pytest.fixture(params=['.xls', '.xlsx', '.xlsm', '.ods'])
3434
def read_ext(request):
3535
"""
3636
Valid extensions for reading Excel files.

pandas/tests/io/excel/test_odf.py

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
import functools
2+
3+
import numpy as np
4+
import pytest
5+
6+
import pandas as pd
7+
import pandas.util.testing as tm
8+
9+
pytest.importorskip("odf")
10+
11+
12+
@pytest.fixture(autouse=True)
13+
def cd_and_set_engine(monkeypatch, datapath):
14+
func = functools.partial(pd.read_excel, engine="odf")
15+
monkeypatch.setattr(pd, 'read_excel', func)
16+
monkeypatch.chdir(datapath("io", "data"))
17+
18+
19+
def test_read_invalid_types_raises():
20+
# the invalid_value_type.ods required manually editing
21+
# of the included content.xml file
22+
with pytest.raises(ValueError,
23+
match="Unrecognized type awesome_new_type"):
24+
pd.read_excel("invalid_value_type.ods")
25+
26+
27+
def test_read_writer_table():
28+
# Also test reading tables from an text OpenDocument file
29+
# (.odt)
30+
index = pd.Index(["Row 1", "Row 2", "Row 3"], name="Header")
31+
expected = pd.DataFrame([
32+
[1, np.nan, 7],
33+
[2, np.nan, 8],
34+
[3, np.nan, 9],
35+
], index=index, columns=["Column 1", "Unnamed: 2", "Column 3"])
36+
37+
result = pd.read_excel("writertable.odt", 'Table1', index_col=0)
38+
39+
tm.assert_frame_equal(result, expected)

0 commit comments

Comments
 (0)