Skip to content

ENH: XLSB support #29836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jan 20, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
a4f2d22
initial xlsb support
Rik-de-Kort Nov 24, 2019
62564cf
Import order fix for CI pass
Rik-de-Kort Nov 25, 2019
a7a8460
Initial tests
Rik-de-Kort Nov 26, 2019
d9be281
style fixes
Rik-de-Kort Nov 28, 2019
8bf8c78
documentation
Rik-de-Kort Nov 28, 2019
cd95dce
forgot place to document
Rik-de-Kort Nov 28, 2019
7a7390d
Fixed test issue with XLRDError
Rik-de-Kort Nov 30, 2019
248ac12
Fix for unnamed column issue
Rik-de-Kort Nov 30, 2019
6ea78de
style fix
Rik-de-Kort Dec 1, 2019
44c5439
line up with upstream master
Rik-de-Kort Dec 1, 2019
92c98cd
Merge branch 'master' of https://github.com/pandas-dev/pandas
Rik-de-Kort Dec 1, 2019
64fa6f3
Fix broken xlrd test
Rik-de-Kort Dec 2, 2019
cb276e8
get docs to build
Rik-de-Kort Dec 2, 2019
4ebcb48
Remove warning filter
Rik-de-Kort Dec 6, 2019
71436a0
Merge branch 'master' of https://github.com/Rik-de-Kort/pandas
Rik-de-Kort Dec 6, 2019
00cc66b
extended description update
Rik-de-Kort Dec 7, 2019
4c81853
Merge branch 'master' of https://github.com/pandas-dev/pandas
Rik-de-Kort Dec 7, 2019
e85da03
Xlsb options instead of odf options
Rik-de-Kort Dec 9, 2019
2348c3b
Add reference in whatsnew to docs
Rik-de-Kort Dec 11, 2019
d02a5a5
Make pyxlsb show up in install.rst and show_versions
Rik-de-Kort Dec 11, 2019
c71e021
Add pyxlsb to ci builds
Rik-de-Kort Dec 14, 2019
ae3f9ea
environment.yml update
Rik-de-Kort Dec 14, 2019
a410e51
Merge upstream master
Rik-de-Kort Dec 15, 2019
7c9dcce
One update to environment.yml too many
Rik-de-Kort Dec 19, 2019
4bd8400
Trying to fix build
Rik-de-Kort Dec 23, 2019
43ab0fe
Merge upstream
Rik-de-Kort Jan 15, 2020
024492a
Added issue number
Rik-de-Kort Jan 15, 2020
b424c8e
Updated to use .rows(sparse=False) for future compat
Rik-de-Kort Jan 15, 2020
571489b
Merge branch 'master' of https://github.com/pandas-dev/pandas
Rik-de-Kort Jan 17, 2020
dad4a53
xfails in test_readers.py
Rik-de-Kort Jan 17, 2020
9b6bc9a
xfail url loads
Rik-de-Kort Jan 18, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions ci/deps/azure-37-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,6 @@ dependencies:
- xlsxwriter
- xlwt
- pyarrow>=0.15
- pip
- pip:
- pyxlsb
1 change: 1 addition & 0 deletions ci/deps/azure-macos-36.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,4 @@ dependencies:
- pip
- pip:
- pyreadstat
- pyxlsb
3 changes: 3 additions & 0 deletions ci/deps/azure-windows-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,6 @@ dependencies:
- xlsxwriter
- xlwt
- pyreadstat
- pip
- pip:
- pyxlsb
1 change: 1 addition & 0 deletions ci/deps/travis-36-cov.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,4 @@ dependencies:
- coverage
- pandas-datareader
- python-dateutil
- pyxlsb
1 change: 1 addition & 0 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ pyarrow 0.12.0 Parquet, ORC (requires 0.13.0), and
pymysql 0.7.11 MySQL engine for sqlalchemy
pyreadstat SPSS files (.sav) reading
pytables 3.4.2 HDF5 reading / writing
pyxlsb 1.0.5 Reading for xlsb files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pyxlsb 1.0.5 Reading for xlsb files
pyxlsb 1.0.6 Reading for xlsb files

qtpy Clipboard I/O
s3fs 0.3.0 Amazon S3 access
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_)
Expand Down
29 changes: 27 additions & 2 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
text;`JSON <https://www.json.org/>`__;:ref:`read_json<io.json_reader>`;:ref:`to_json<io.json_writer>`
text;`HTML <https://en.wikipedia.org/wiki/HTML>`__;:ref:`read_html<io.read_html>`;:ref:`to_html<io.html>`
text; Local clipboard;:ref:`read_clipboard<io.clipboard>`;:ref:`to_clipboard<io.clipboard>`
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
binary;`OpenDocument <http://www.opendocumentformat.org>`__;:ref:`read_excel<io.ods>`;
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
Expand Down Expand Up @@ -2768,7 +2768,8 @@ Excel files

The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``)
files using the ``xlrd`` Python module. Excel 2007+ (``.xlsx``) files
can be read using either ``xlrd`` or ``openpyxl``.
can be read using either ``xlrd`` or ``openpyxl``. Binary Excel (``.xlsb``)
files can be read using ``pyxlsb``.
The :meth:`~DataFrame.to_excel` instance method is used for
saving a ``DataFrame`` to Excel. Generally the semantics are
similar to working with :ref:`csv<io.read_csv_table>` data.
Expand Down Expand Up @@ -3229,6 +3230,30 @@ OpenDocument spreadsheets match what can be done for `Excel files`_ using
Currently pandas only supports *reading* OpenDocument spreadsheets. Writing
is not implemented.

.. _io.xlsb:

Binary Excel (.xlsb) files
--------------------------

.. versionadded:: 1.0.0

The :func:`~pandas.read_excel` method can also read binary Excel files
using the ``pyxlsb`` module. The semantics and features for reading
binary Excel files mostly match what can be done for `Excel files`_ using
``engine='pyxlsb'``. ``pyxlsb`` does not recognize datetime types
in files and will return floats instead.

.. code-block:: python

# Returns a DataFrame
pd.read_excel('path_to_file.xlsb', engine='pyxlsb')

.. note::

Currently pandas only supports *reading* binary Excel files. Writing
is not implemented.


.. _io.clipboard:

Clipboard
Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,8 @@ Other enhancements
- :meth:`Styler.format` added the ``na_rep`` parameter to help format the missing values (:issue:`21527`, :issue:`28358`)
- Roundtripping DataFrames with nullable integer, string and period data types to parquet
(:meth:`~DataFrame.to_parquet` / :func:`read_parquet`) using the `'pyarrow'` engine
now preserve those data types with pyarrow >= 0.16.0 (:issue:`20612`, :issue:`28371`).
now preserve those data types with pyarrow >= 1.0.0 (:issue:`20612`).
- :func:`read_excel` now can read binary Excel (``.xlsb``) files by passing ``engine='pyxlsb'``. For more details and example usage, see the :ref:`Binary Excel files documentation <io.xlsb>`. Closes :issue:`8540`.
- The ``partition_cols`` argument in :meth:`DataFrame.to_parquet` now accepts a string (:issue:`27117`)
- :func:`pandas.read_json` now parses ``NaN``, ``Infinity`` and ``-Infinity`` (:issue:`12213`)
- :func:`to_parquet` now appropriately handles the ``schema`` argument for user defined schemas in the pyarrow engine. (:issue:`30270`)
Expand Down
1 change: 1 addition & 0 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
"pyarrow": "0.13.0",
"pytables": "3.4.2",
"pytest": "5.0.1",
"pyxlsb": "1.0.5",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"pyxlsb": "1.0.5",
"pyxlsb": "1.0.6",

"s3fs": "0.3.0",
"scipy": "0.19.0",
"sqlalchemy": "1.1.4",
Expand Down
8 changes: 8 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -479,6 +479,7 @@ def use_inf_as_na_cb(key):
_xlsm_options = ["xlrd", "openpyxl"]
_xlsx_options = ["xlrd", "openpyxl"]
_ods_options = ["odf"]
_xlsb_options = ["pyxlsb"]


with cf.config_prefix("io.excel.xls"):
Expand Down Expand Up @@ -515,6 +516,13 @@ def use_inf_as_na_cb(key):
validator=str,
)

with cf.config_prefix("io.excel.xlsb"):
cf.register_option(
"reader",
"auto",
reader_engine_doc.format(ext="xlsb", others=", ".join(_xlsb_options)),
validator=str,
)

# Set up the io.excel specific writer configuration.
writer_engine_doc = """
Expand Down
17 changes: 12 additions & 5 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,9 @@
"""
Read an Excel file into a pandas DataFrame.
Support both `xls` and `xlsx` file extensions from a local filesystem or URL.
Support an option to read a single sheet or a list of sheets.
Supports `xls`, `xlsx`, `xlsm`, `xlsb`, and `odf` file extensions
read from a local filesystem or URL. Supports an option to read
a single sheet or a list of sheets.
Parameters
----------
Expand Down Expand Up @@ -789,15 +790,21 @@ class ExcelFile:
If a string or path object, expected to be a path to xls, xlsx or odf file.
engine : str, default None
If io is not a buffer or path, this must be set to identify io.
Acceptable values are None, ``xlrd``, ``openpyxl`` or ``odf``.
Acceptable values are None, ``xlrd``, ``openpyxl``, ``odf``, or ``pyxlsb``.
Note that ``odf`` reads tables out of OpenDocument formatted files.
"""

from pandas.io.excel._odfreader import _ODFReader
from pandas.io.excel._openpyxl import _OpenpyxlReader
from pandas.io.excel._xlrd import _XlrdReader

_engines = {"xlrd": _XlrdReader, "openpyxl": _OpenpyxlReader, "odf": _ODFReader}
from pandas.io.excel._pyxlsb import _PyxlsbReader

_engines = {
"xlrd": _XlrdReader,
"openpyxl": _OpenpyxlReader,
"odf": _ODFReader,
"pyxlsb": _PyxlsbReader,
}

def __init__(self, io, engine=None):
if engine is None:
Expand Down
68 changes: 68 additions & 0 deletions pandas/io/excel/_pyxlsb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
from typing import List

from pandas._typing import FilePathOrBuffer, Scalar
from pandas.compat._optional import import_optional_dependency

from pandas.io.excel._base import _BaseExcelReader


class _PyxlsbReader(_BaseExcelReader):
def __init__(self, filepath_or_buffer: FilePathOrBuffer):
"""Reader using pyxlsb engine.

Parameters
__________
filepath_or_buffer: string, path object, or Workbook
Object to be parsed.
"""
import_optional_dependency("pyxlsb")
# This will call load_workbook on the filepath or buffer
# And set the result to the book-attribute
super().__init__(filepath_or_buffer)

@property
def _workbook_class(self):
from pyxlsb import Workbook

return Workbook

def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from pyxlsb import open_workbook

# Todo: hack in buffer capability
# This might need some modifications to the Pyxlsb library
# Actual work for opening it is in xlsbpackage.py, line 20-ish

return open_workbook(filepath_or_buffer)

@property
def sheet_names(self) -> List[str]:
return self.book.sheets

def get_sheet_by_name(self, name: str):
return self.book.get_sheet(name)

def get_sheet_by_index(self, index: int):
# pyxlsb sheets are indexed from 1 onwards
# There's a fix for this in the source, but the pypi package doesn't have it
return self.book.get_sheet(index + 1)

def _convert_cell(self, cell, convert_float: bool) -> Scalar:
# Todo: there is no way to distinguish between floats and datetimes in pyxlsb
# This means that there is no way to read datetime types from an xlsb file yet
if cell.v is None:
return "" # Prevents non-named columns from not showing up as Unnamed: i
if isinstance(cell.v, float) and convert_float:
val = int(cell.v)
if val == cell.v:
return val
else:
return float(cell.v)

return cell.v

def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
return [
[self._convert_cell(c, convert_float) for c in r]
for r in sheet.rows(sparse=False)
]
Binary file added pandas/tests/io/data/excel/blank.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/blank_with_header.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test1.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test2.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test3.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test4.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test5.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_converters.xlsb
Binary file not shown.
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_multisheet.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_squeeze.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/test_types.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/testdateoverflow.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/testdtype.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/testmultiindex.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/testskiprows.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/times_1900.xlsb
Binary file not shown.
Binary file added pandas/tests/io/data/excel/times_1904.xlsb
Binary file not shown.
2 changes: 1 addition & 1 deletion pandas/tests/io/excel/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def df_ref(datapath):
return df_ref


@pytest.fixture(params=[".xls", ".xlsx", ".xlsm", ".ods"])
@pytest.fixture(params=[".xls", ".xlsx", ".xlsm", ".ods", ".xlsb"])
def read_ext(request):
"""
Valid extensions for reading Excel files.
Expand Down
Loading