Skip to content

Commit cdffa43

Browse files
Rik-de-KortWillAyd
authored andcommitted
ENH: XLSB support (#29836)
1 parent 8a8e967 commit cdffa43

32 files changed

+185
-14
lines changed

ci/deps/azure-37-locale.yaml

+3
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,6 @@ dependencies:
3434
- xlsxwriter
3535
- xlwt
3636
- pyarrow>=0.15
37+
- pip
38+
- pip:
39+
- pyxlsb

ci/deps/azure-macos-36.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@ dependencies:
3333
- pip
3434
- pip:
3535
- pyreadstat
36+
- pyxlsb

ci/deps/azure-windows-37.yaml

+3
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,6 @@ dependencies:
3535
- xlsxwriter
3636
- xlwt
3737
- pyreadstat
38+
- pip
39+
- pip:
40+
- pyxlsb

ci/deps/travis-36-cov.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,4 @@ dependencies:
5151
- coverage
5252
- pandas-datareader
5353
- python-dateutil
54+
- pyxlsb

doc/source/getting_started/install.rst

+1
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,7 @@ pyarrow 0.12.0 Parquet, ORC (requires 0.13.0), and
264264
pymysql 0.7.11 MySQL engine for sqlalchemy
265265
pyreadstat SPSS files (.sav) reading
266266
pytables 3.4.2 HDF5 reading / writing
267+
pyxlsb 1.0.5 Reading for xlsb files
267268
qtpy Clipboard I/O
268269
s3fs 0.3.0 Amazon S3 access
269270
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_)

doc/source/user_guide/io.rst

+27-2
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
2323
text;`JSON <https://www.json.org/>`__;:ref:`read_json<io.json_reader>`;:ref:`to_json<io.json_writer>`
2424
text;`HTML <https://en.wikipedia.org/wiki/HTML>`__;:ref:`read_html<io.read_html>`;:ref:`to_html<io.html>`
2525
text; Local clipboard;:ref:`read_clipboard<io.clipboard>`;:ref:`to_clipboard<io.clipboard>`
26-
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
26+
;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
2727
binary;`OpenDocument <http://www.opendocumentformat.org>`__;:ref:`read_excel<io.ods>`;
2828
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
2929
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
@@ -2768,7 +2768,8 @@ Excel files
27682768

27692769
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``)
27702770
files using the ``xlrd`` Python module. Excel 2007+ (``.xlsx``) files
2771-
can be read using either ``xlrd`` or ``openpyxl``.
2771+
can be read using either ``xlrd`` or ``openpyxl``. Binary Excel (``.xlsb``)
2772+
files can be read using ``pyxlsb``.
27722773
The :meth:`~DataFrame.to_excel` instance method is used for
27732774
saving a ``DataFrame`` to Excel. Generally the semantics are
27742775
similar to working with :ref:`csv<io.read_csv_table>` data.
@@ -3229,6 +3230,30 @@ OpenDocument spreadsheets match what can be done for `Excel files`_ using
32293230
Currently pandas only supports *reading* OpenDocument spreadsheets. Writing
32303231
is not implemented.
32313232

3233+
.. _io.xlsb:
3234+
3235+
Binary Excel (.xlsb) files
3236+
--------------------------
3237+
3238+
.. versionadded:: 1.0.0
3239+
3240+
The :func:`~pandas.read_excel` method can also read binary Excel files
3241+
using the ``pyxlsb`` module. The semantics and features for reading
3242+
binary Excel files mostly match what can be done for `Excel files`_ using
3243+
``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types
3244+
in files and will return floats instead.
3245+
3246+
.. code-block:: python
3247+
3248+
# Returns a DataFrame
3249+
pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
3250+
3251+
.. note::
3252+
3253+
Currently pandas only supports *reading* binary Excel files. Writing
3254+
is not implemented.
3255+
3256+
32323257
.. _io.clipboard:
32333258

32343259
Clipboard

doc/source/whatsnew/v1.0.0.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,8 @@ Other enhancements
215215
- :meth:`Styler.format` added the ``na_rep`` parameter to help format the missing values (:issue:`21527`, :issue:`28358`)
216216
- Roundtripping DataFrames with nullable integer, string and period data types to parquet
217217
(:meth:`~DataFrame.to_parquet` / :func:`read_parquet`) using the `'pyarrow'` engine
218-
now preserve those data types with pyarrow >= 0.16.0 (:issue:`20612`, :issue:`28371`).
218+
now preserve those data types with pyarrow >= 1.0.0 (:issue:`20612`).
219+
- :func:`read_excel` now can read binary Excel (``.xlsb``) files by passing ``engine='pyxlsb'``. For more details and example usage, see the :ref:`Binary Excel files documentation <io.xlsb>`. Closes :issue:`8540`.
219220
- The ``partition_cols`` argument in :meth:`DataFrame.to_parquet` now accepts a string (:issue:`27117`)
220221
- :func:`pandas.read_json` now parses ``NaN``, ``Infinity`` and ``-Infinity`` (:issue:`12213`)
221222
- :func:`to_parquet` now appropriately handles the ``schema`` argument for user defined schemas in the pyarrow engine. (:issue:`30270`)

pandas/compat/_optional.py

+1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
"pyarrow": "0.13.0",
2020
"pytables": "3.4.2",
2121
"pytest": "5.0.1",
22+
"pyxlsb": "1.0.5",
2223
"s3fs": "0.3.0",
2324
"scipy": "0.19.0",
2425
"sqlalchemy": "1.1.4",

pandas/core/config_init.py

+8
Original file line numberDiff line numberDiff line change
@@ -479,6 +479,7 @@ def use_inf_as_na_cb(key):
479479
_xlsm_options = ["xlrd", "openpyxl"]
480480
_xlsx_options = ["xlrd", "openpyxl"]
481481
_ods_options = ["odf"]
482+
_xlsb_options = ["pyxlsb"]
482483

483484

484485
with cf.config_prefix("io.excel.xls"):
@@ -515,6 +516,13 @@ def use_inf_as_na_cb(key):
515516
validator=str,
516517
)
517518

519+
with cf.config_prefix("io.excel.xlsb"):
520+
cf.register_option(
521+
"reader",
522+
"auto",
523+
reader_engine_doc.format(ext="xlsb", others=", ".join(_xlsb_options)),
524+
validator=str,
525+
)
518526

519527
# Set up the io.excel specific writer configuration.
520528
writer_engine_doc = """

pandas/io/excel/_base.py

+12-5
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,9 @@
3535
"""
3636
Read an Excel file into a pandas DataFrame.
3737
38-
Support both `xls` and `xlsx` file extensions from a local filesystem or URL.
39-
Support an option to read a single sheet or a list of sheets.
38+
Supports `xls`, `xlsx`, `xlsm`, `xlsb`, and `odf` file extensions
39+
read from a local filesystem or URL. Supports an option to read
40+
a single sheet or a list of sheets.
4041
4142
Parameters
4243
----------
@@ -789,15 +790,21 @@ class ExcelFile:
789790
If a string or path object, expected to be a path to xls, xlsx or odf file.
790791
engine : str, default None
791792
If io is not a buffer or path, this must be set to identify io.
792-
Acceptable values are None, ``xlrd``, ``openpyxl`` or ``odf``.
793+
Acceptable values are None, ``xlrd``, ``openpyxl``, ``odf``, or ``pyxlsb``.
793794
Note that ``odf`` reads tables out of OpenDocument formatted files.
794795
"""
795796

796797
from pandas.io.excel._odfreader import _ODFReader
797798
from pandas.io.excel._openpyxl import _OpenpyxlReader
798799
from pandas.io.excel._xlrd import _XlrdReader
799-
800-
_engines = {"xlrd": _XlrdReader, "openpyxl": _OpenpyxlReader, "odf": _ODFReader}
800+
from pandas.io.excel._pyxlsb import _PyxlsbReader
801+
802+
_engines = {
803+
"xlrd": _XlrdReader,
804+
"openpyxl": _OpenpyxlReader,
805+
"odf": _ODFReader,
806+
"pyxlsb": _PyxlsbReader,
807+
}
801808

802809
def __init__(self, io, engine=None):
803810
if engine is None:

pandas/io/excel/_pyxlsb.py

+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
from typing import List
2+
3+
from pandas._typing import FilePathOrBuffer, Scalar
4+
from pandas.compat._optional import import_optional_dependency
5+
6+
from pandas.io.excel._base import _BaseExcelReader
7+
8+
9+
class _PyxlsbReader(_BaseExcelReader):
10+
def __init__(self, filepath_or_buffer: FilePathOrBuffer):
11+
"""Reader using pyxlsb engine.
12+
13+
Parameters
14+
__________
15+
filepath_or_buffer: string, path object, or Workbook
16+
Object to be parsed.
17+
"""
18+
import_optional_dependency("pyxlsb")
19+
# This will call load_workbook on the filepath or buffer
20+
# And set the result to the book-attribute
21+
super().__init__(filepath_or_buffer)
22+
23+
@property
24+
def _workbook_class(self):
25+
from pyxlsb import Workbook
26+
27+
return Workbook
28+
29+
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
30+
from pyxlsb import open_workbook
31+
32+
# Todo: hack in buffer capability
33+
# This might need some modifications to the Pyxlsb library
34+
# Actual work for opening it is in xlsbpackage.py, line 20-ish
35+
36+
return open_workbook(filepath_or_buffer)
37+
38+
@property
39+
def sheet_names(self) -> List[str]:
40+
return self.book.sheets
41+
42+
def get_sheet_by_name(self, name: str):
43+
return self.book.get_sheet(name)
44+
45+
def get_sheet_by_index(self, index: int):
46+
# pyxlsb sheets are indexed from 1 onwards
47+
# There's a fix for this in the source, but the pypi package doesn't have it
48+
return self.book.get_sheet(index + 1)
49+
50+
def _convert_cell(self, cell, convert_float: bool) -> Scalar:
51+
# Todo: there is no way to distinguish between floats and datetimes in pyxlsb
52+
# This means that there is no way to read datetime types from an xlsb file yet
53+
if cell.v is None:
54+
return "" # Prevents non-named columns from not showing up as Unnamed: i
55+
if isinstance(cell.v, float) and convert_float:
56+
val = int(cell.v)
57+
if val == cell.v:
58+
return val
59+
else:
60+
return float(cell.v)
61+
62+
return cell.v
63+
64+
def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
65+
return [
66+
[self._convert_cell(c, convert_float) for c in r]
67+
for r in sheet.rows(sparse=False)
68+
]

pandas/tests/io/data/excel/blank.xlsb

8.7 KB
Binary file not shown.
Binary file not shown.

pandas/tests/io/data/excel/test1.xlsb

11.1 KB
Binary file not shown.

pandas/tests/io/data/excel/test2.xlsb

7.4 KB
Binary file not shown.

pandas/tests/io/data/excel/test3.xlsb

7.38 KB
Binary file not shown.

pandas/tests/io/data/excel/test4.xlsb

7.47 KB
Binary file not shown.

pandas/tests/io/data/excel/test5.xlsb

7.64 KB
Binary file not shown.
7.63 KB
Binary file not shown.
Binary file not shown.
10.5 KB
Binary file not shown.
8.37 KB
Binary file not shown.
7.86 KB
Binary file not shown.
9.63 KB
Binary file not shown.
7.52 KB
Binary file not shown.
18.4 KB
Binary file not shown.
7.52 KB
Binary file not shown.
7.59 KB
Binary file not shown.
7.55 KB
Binary file not shown.

pandas/tests/io/excel/conftest.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ def df_ref(datapath):
3535
return df_ref
3636

3737

38-
@pytest.fixture(params=[".xls", ".xlsx", ".xlsm", ".ods"])
38+
@pytest.fixture(params=[".xls", ".xlsx", ".xlsm", ".ods", ".xlsb"])
3939
def read_ext(request):
4040
"""
4141
Valid extensions for reading Excel files.

0 commit comments

Comments
 (0)