Skip to content

Commit 79067a7

Browse files
authored
ENH: add calamine excel reader (close #50395) (#54998)
1 parent 705d431 commit 79067a7

20 files changed

+290
-58
lines changed

ci/deps/actions-310.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ dependencies:
4646
- pymysql>=1.0.2
4747
- pyreadstat>=1.1.5
4848
- pytables>=3.7.0
49+
- python-calamine>=0.1.6
4950
- pyxlsb>=1.0.9
5051
- s3fs>=2022.05.0
5152
- scipy>=1.8.1

ci/deps/actions-311-downstream_compat.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ dependencies:
4747
- pymysql>=1.0.2
4848
- pyreadstat>=1.1.5
4949
- pytables>=3.7.0
50+
- python-calamine>=0.1.6
5051
- pyxlsb>=1.0.9
5152
- s3fs>=2022.05.0
5253
- scipy>=1.8.1

ci/deps/actions-311.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ dependencies:
4646
- pymysql>=1.0.2
4747
- pyreadstat>=1.1.5
4848
# - pytables>=3.7.0, 3.8.0 is first version that supports 3.11
49+
- python-calamine>=0.1.6
4950
- pyxlsb>=1.0.9
5051
- s3fs>=2022.05.0
5152
- scipy>=1.8.1

ci/deps/actions-39-minimum_versions.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ dependencies:
4848
- pymysql=1.0.2
4949
- pyreadstat=1.1.5
5050
- pytables=3.7.0
51+
- python-calamine=0.1.6
5152
- pyxlsb=1.0.9
5253
- s3fs=2022.05.0
5354
- scipy=1.8.1

ci/deps/actions-39.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ dependencies:
4646
- pymysql>=1.0.2
4747
- pyreadstat>=1.1.5
4848
- pytables>=3.7.0
49+
- python-calamine>=0.1.6
4950
- pyxlsb>=1.0.9
5051
- s3fs>=2022.05.0
5152
- scipy>=1.8.1

ci/deps/circle-310-arm64.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ dependencies:
4747
- pymysql>=1.0.2
4848
# - pyreadstat>=1.1.5 not available on ARM
4949
- pytables>=3.7.0
50+
- python-calamine>=0.1.6
5051
- pyxlsb>=1.0.9
5152
- s3fs>=2022.05.0
5253
- scipy>=1.8.1

doc/source/getting_started/install.rst

+1
Original file line numberDiff line numberDiff line change
@@ -281,6 +281,7 @@ xlrd 2.0.1 excel Reading Excel
281281
xlsxwriter 3.0.3 excel Writing Excel
282282
openpyxl 3.0.10 excel Reading / writing for xlsx files
283283
pyxlsb 1.0.9 excel Reading for xlsb files
284+
python-calamine 0.1.6 excel Reading for xls/xlsx/xlsb/ods files
284285
========================= ================== =============== =============================================================
285286

286287
HTML

doc/source/user_guide/io.rst

+21-2
Original file line numberDiff line numberDiff line change
@@ -3453,7 +3453,8 @@ Excel files
34533453
The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files
34543454
using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files
34553455
can be read using ``xlrd``. Binary Excel (``.xlsb``)
3456-
files can be read using ``pyxlsb``.
3456+
files can be read using ``pyxlsb``. All formats can be read
3457+
using :ref:`calamine<io.calamine>` engine.
34573458
The :meth:`~DataFrame.to_excel` instance method is used for
34583459
saving a ``DataFrame`` to Excel. Generally the semantics are
34593460
similar to working with :ref:`csv<io.read_csv_table>` data.
@@ -3494,6 +3495,9 @@ using internally.
34943495

34953496
* For the engine odf, pandas is using :func:`odf.opendocument.load` to read in (``.ods``) files.
34963497

3498+
* For the engine calamine, pandas is using :func:`python_calamine.load_workbook`
3499+
to read in (``.xlsx``), (``.xlsm``), (``.xls``), (``.xlsb``), (``.ods``) files.
3500+
34973501
.. code-block:: python
34983502
34993503
# Returns a DataFrame
@@ -3935,7 +3939,8 @@ The :func:`~pandas.read_excel` method can also read binary Excel files
39353939
using the ``pyxlsb`` module. The semantics and features for reading
39363940
binary Excel files mostly match what can be done for `Excel files`_ using
39373941
``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types
3938-
in files and will return floats instead.
3942+
in files and will return floats instead (you can use :ref:`calamine<io.calamine>`
3943+
if you need recognize datetime types).
39393944

39403945
.. code-block:: python
39413946
@@ -3947,6 +3952,20 @@ in files and will return floats instead.
39473952
Currently pandas only supports *reading* binary Excel files. Writing
39483953
is not implemented.
39493954

3955+
.. _io.calamine:
3956+
3957+
Calamine (Excel and ODS files)
3958+
------------------------------
3959+
3960+
The :func:`~pandas.read_excel` method can read Excel file (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``)
3961+
and OpenDocument spreadsheets (``.ods``) using the ``python-calamine`` module.
3962+
This module is a binding for Rust library `calamine <https://crates.io/crates/calamine>`__
3963+
and is faster than other engines in most cases. The optional dependency 'python-calamine' needs to be installed.
3964+
3965+
.. code-block:: python
3966+
3967+
# Returns a DataFrame
3968+
pd.read_excel("path_to_file.xlsb", engine="calamine")
39503969
39513970
.. _io.clipboard:
39523971

doc/source/whatsnew/v2.2.0.rst

+20-3
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,27 @@ including other versions of pandas.
1414
Enhancements
1515
~~~~~~~~~~~~
1616

17-
.. _whatsnew_220.enhancements.enhancement1:
17+
.. _whatsnew_220.enhancements.calamine:
1818

19-
enhancement1
20-
^^^^^^^^^^^^
19+
Calamine engine for :func:`read_excel`
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
22+
The ``calamine`` engine was added to :func:`read_excel`.
23+
It uses ``python-calamine``, which provides Python bindings for the Rust library `calamine <https://crates.io/crates/calamine>`__.
24+
This engine supports Excel files (``.xlsx``, ``.xlsm``, ``.xls``, ``.xlsb``) and OpenDocument spreadsheets (``.ods``) (:issue:`50395`).
25+
26+
There are two advantages of this engine:
27+
28+
1. Calamine is often faster than other engines, some benchmarks show results up to 5x faster than 'openpyxl', 20x - 'odf', 4x - 'pyxlsb', and 1.5x - 'xlrd'.
29+
But, 'openpyxl' and 'pyxlsb' are faster in reading a few rows from large files because of lazy iteration over rows.
30+
2. Calamine supports the recognition of datetime in ``.xlsb`` files, unlike 'pyxlsb' which is the only other engine in pandas that can read ``.xlsb`` files.
31+
32+
.. code-block:: python
33+
34+
pd.read_excel("path_to_file.xlsb", engine="calamine")
35+
36+
37+
For more, see :ref:`io.calamine` in the user guide on IO tools.
2138

2239
.. _whatsnew_220.enhancements.enhancement2:
2340

environment.yml

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ dependencies:
4747
- pymysql>=1.0.2
4848
- pyreadstat>=1.1.5
4949
- pytables>=3.7.0
50+
- python-calamine>=0.1.6
5051
- pyxlsb>=1.0.9
5152
- s3fs>=2022.05.0
5253
- scipy>=1.8.1

pandas/compat/_optional.py

+2
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
"pyarrow": "7.0.0",
3838
"pyreadstat": "1.1.5",
3939
"pytest": "7.3.2",
40+
"python-calamine": "0.1.6",
4041
"pyxlsb": "1.0.9",
4142
"s3fs": "2022.05.0",
4243
"scipy": "1.8.1",
@@ -62,6 +63,7 @@
6263
"lxml.etree": "lxml",
6364
"odf": "odfpy",
6465
"pandas_gbq": "pandas-gbq",
66+
"python_calamine": "python-calamine",
6567
"sqlalchemy": "SQLAlchemy",
6668
"tables": "pytables",
6769
}

pandas/core/config_init.py

+5-5
Original file line numberDiff line numberDiff line change
@@ -513,11 +513,11 @@ def use_inf_as_na_cb(key) -> None:
513513
auto, {others}.
514514
"""
515515

516-
_xls_options = ["xlrd"]
517-
_xlsm_options = ["xlrd", "openpyxl"]
518-
_xlsx_options = ["xlrd", "openpyxl"]
519-
_ods_options = ["odf"]
520-
_xlsb_options = ["pyxlsb"]
516+
_xls_options = ["xlrd", "calamine"]
517+
_xlsm_options = ["xlrd", "openpyxl", "calamine"]
518+
_xlsx_options = ["xlrd", "openpyxl", "calamine"]
519+
_ods_options = ["odf", "calamine"]
520+
_xlsb_options = ["pyxlsb", "calamine"]
521521

522522

523523
with cf.config_prefix("io.excel.xls"):

pandas/io/excel/_base.py

+11-5
Original file line numberDiff line numberDiff line change
@@ -159,13 +159,15 @@
159159
of dtype conversion.
160160
engine : str, default None
161161
If io is not a buffer or path, this must be set to identify io.
162-
Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb".
162+
Supported engines: "xlrd", "openpyxl", "odf", "pyxlsb", "calamine".
163163
Engine compatibility :
164164
165165
- "xlrd" supports old-style Excel files (.xls).
166166
- "openpyxl" supports newer Excel file formats.
167167
- "odf" supports OpenDocument file formats (.odf, .ods, .odt).
168168
- "pyxlsb" supports Binary Excel files.
169+
- "calamine" supports Excel (.xls, .xlsx, .xlsm, .xlsb)
170+
and OpenDocument (.ods) file formats.
169171
170172
.. versionchanged:: 1.2.0
171173
The engine `xlrd <https://xlrd.readthedocs.io/en/latest/>`_
@@ -394,7 +396,7 @@ def read_excel(
394396
| Callable[[str], bool]
395397
| None = ...,
396398
dtype: DtypeArg | None = ...,
397-
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = ...,
399+
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = ...,
398400
converters: dict[str, Callable] | dict[int, Callable] | None = ...,
399401
true_values: Iterable[Hashable] | None = ...,
400402
false_values: Iterable[Hashable] | None = ...,
@@ -433,7 +435,7 @@ def read_excel(
433435
| Callable[[str], bool]
434436
| None = ...,
435437
dtype: DtypeArg | None = ...,
436-
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = ...,
438+
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = ...,
437439
converters: dict[str, Callable] | dict[int, Callable] | None = ...,
438440
true_values: Iterable[Hashable] | None = ...,
439441
false_values: Iterable[Hashable] | None = ...,
@@ -472,7 +474,7 @@ def read_excel(
472474
| Callable[[str], bool]
473475
| None = None,
474476
dtype: DtypeArg | None = None,
475-
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb"] | None = None,
477+
engine: Literal["xlrd", "openpyxl", "odf", "pyxlsb", "calamine"] | None = None,
476478
converters: dict[str, Callable] | dict[int, Callable] | None = None,
477479
true_values: Iterable[Hashable] | None = None,
478480
false_values: Iterable[Hashable] | None = None,
@@ -1456,13 +1458,15 @@ class ExcelFile:
14561458
.xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file.
14571459
engine : str, default None
14581460
If io is not a buffer or path, this must be set to identify io.
1459-
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``
1461+
Supported engines: ``xlrd``, ``openpyxl``, ``odf``, ``pyxlsb``, ``calamine``
14601462
Engine compatibility :
14611463
14621464
- ``xlrd`` supports old-style Excel files (.xls).
14631465
- ``openpyxl`` supports newer Excel file formats.
14641466
- ``odf`` supports OpenDocument file formats (.odf, .ods, .odt).
14651467
- ``pyxlsb`` supports Binary Excel files.
1468+
- ``calamine`` supports Excel (.xls, .xlsx, .xlsm, .xlsb)
1469+
and OpenDocument (.ods) file formats.
14661470
14671471
.. versionchanged:: 1.2.0
14681472
@@ -1498,6 +1502,7 @@ class ExcelFile:
14981502
... df1 = pd.read_excel(xls, "Sheet1") # doctest: +SKIP
14991503
"""
15001504

1505+
from pandas.io.excel._calamine import CalamineReader
15011506
from pandas.io.excel._odfreader import ODFReader
15021507
from pandas.io.excel._openpyxl import OpenpyxlReader
15031508
from pandas.io.excel._pyxlsb import PyxlsbReader
@@ -1508,6 +1513,7 @@ class ExcelFile:
15081513
"openpyxl": OpenpyxlReader,
15091514
"odf": ODFReader,
15101515
"pyxlsb": PyxlsbReader,
1516+
"calamine": CalamineReader,
15111517
}
15121518

15131519
def __init__(

pandas/io/excel/_calamine.py

+127
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
from __future__ import annotations
2+
3+
from datetime import (
4+
date,
5+
datetime,
6+
time,
7+
timedelta,
8+
)
9+
from typing import (
10+
TYPE_CHECKING,
11+
Any,
12+
Union,
13+
cast,
14+
)
15+
16+
from pandas._typing import Scalar
17+
from pandas.compat._optional import import_optional_dependency
18+
from pandas.util._decorators import doc
19+
20+
import pandas as pd
21+
from pandas.core.shared_docs import _shared_docs
22+
23+
from pandas.io.excel._base import BaseExcelReader
24+
25+
if TYPE_CHECKING:
26+
from python_calamine import (
27+
CalamineSheet,
28+
CalamineWorkbook,
29+
)
30+
31+
from pandas._typing import (
32+
FilePath,
33+
ReadBuffer,
34+
StorageOptions,
35+
)
36+
37+
_CellValueT = Union[int, float, str, bool, time, date, datetime, timedelta]
38+
39+
40+
class CalamineReader(BaseExcelReader["CalamineWorkbook"]):
41+
@doc(storage_options=_shared_docs["storage_options"])
42+
def __init__(
43+
self,
44+
filepath_or_buffer: FilePath | ReadBuffer[bytes],
45+
storage_options: StorageOptions | None = None,
46+
engine_kwargs: dict | None = None,
47+
) -> None:
48+
"""
49+
Reader using calamine engine (xlsx/xls/xlsb/ods).
50+
51+
Parameters
52+
----------
53+
filepath_or_buffer : str, path to be parsed or
54+
an open readable stream.
55+
{storage_options}
56+
engine_kwargs : dict, optional
57+
Arbitrary keyword arguments passed to excel engine.
58+
"""
59+
import_optional_dependency("python_calamine")
60+
super().__init__(
61+
filepath_or_buffer,
62+
storage_options=storage_options,
63+
engine_kwargs=engine_kwargs,
64+
)
65+
66+
@property
67+
def _workbook_class(self) -> type[CalamineWorkbook]:
68+
from python_calamine import CalamineWorkbook
69+
70+
return CalamineWorkbook
71+
72+
def load_workbook(
73+
self, filepath_or_buffer: FilePath | ReadBuffer[bytes], engine_kwargs: Any
74+
) -> CalamineWorkbook:
75+
from python_calamine import load_workbook
76+
77+
return load_workbook(
78+
filepath_or_buffer, **engine_kwargs # type: ignore[arg-type]
79+
)
80+
81+
@property
82+
def sheet_names(self) -> list[str]:
83+
from python_calamine import SheetTypeEnum
84+
85+
return [
86+
sheet.name
87+
for sheet in self.book.sheets_metadata
88+
if sheet.typ == SheetTypeEnum.WorkSheet
89+
]
90+
91+
def get_sheet_by_name(self, name: str) -> CalamineSheet:
92+
self.raise_if_bad_sheet_by_name(name)
93+
return self.book.get_sheet_by_name(name)
94+
95+
def get_sheet_by_index(self, index: int) -> CalamineSheet:
96+
self.raise_if_bad_sheet_by_index(index)
97+
return self.book.get_sheet_by_index(index)
98+
99+
def get_sheet_data(
100+
self, sheet: CalamineSheet, file_rows_needed: int | None = None
101+
) -> list[list[Scalar]]:
102+
def _convert_cell(value: _CellValueT) -> Scalar:
103+
if isinstance(value, float):
104+
val = int(value)
105+
if val == value:
106+
return val
107+
else:
108+
return value
109+
elif isinstance(value, date):
110+
return pd.Timestamp(value)
111+
elif isinstance(value, timedelta):
112+
return pd.Timedelta(value)
113+
elif isinstance(value, time):
114+
# cast needed here because Scalar doesn't include datetime.time
115+
return cast(Scalar, value)
116+
117+
return value
118+
119+
rows: list[list[_CellValueT]] = sheet.to_python(skip_empty_area=False)
120+
data: list[list[Scalar]] = []
121+
122+
for row in rows:
123+
data.append([_convert_cell(cell) for cell in row])
124+
if file_rows_needed is not None and len(data) >= file_rows_needed:
125+
break
126+
127+
return data

0 commit comments

Comments
 (0)