ENH: Add calamite engine to `read_excel` #50581

kostyafarber · 2023-01-05T08:44:45Z

Adds excel reader from rust calamite as an engine using library binding library python-calamite.

closes ENH: Adding support for calamine as Excel reader engine #50395
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

kostyafarber · 2023-01-05T08:45:27Z

Opened as a first stab. There are quite a few tests failing that we would need to investigate when adding the engine.

fangchenli · 2023-01-06T16:34:21Z

python-calamite is not even half-year old. There might be pushback.

python-calamite doesn't have a wheel for arm macOS. And the instance on CircleCI doesn't have Rust installed. You either need to get the author of python-calamite to produce a wheel for arm macOS or install the Rust toolchain on CircleCI.

ci/deps/actions-38.yaml

pandas/io/excel/_calamitereader.py

kostyafarber · 2023-01-06T18:03:33Z

python-calamite is not even half-year old. There might be pushback.

python-calamite doesn't have a wheel for arm macOS. And the instance on CircleCI doesn't have Rust installed. You either need to get the author of python-calamite to produce a wheel for arm macOS or install the Rust toolchain on CircleCI.

That's a fair point. Can we have a discussion with the relevant people if we want to go ahead with this then?

Separately, I can open an issue to get a wheel for this library, if we want to go ahead with it.

mroeschke · 2023-01-06T19:57:18Z

Yeah I think I would be -0.5 adding this new engine at the moment. As @fangchenli mentioned it appears relatively new package and doesn't seem to have a super active development community around the package yet

WillAyd · 2023-01-06T21:28:50Z

Yea I'm also hesitant to bring this in - both the python binding and the core library itself don't have that much development momentum. Something to watch as they have nice ideas, but I don't think they are at the point yet where pandas internally should change anything to interface with them

lithomas1 · 2023-01-07T02:42:46Z

IMO, it's better to judge this by how many tests its able to pass. If it's able to pass most(say ~60-70%) of them for at least one of its formats, then it being new doesn't really matter.

Right now, we depend on two libraries, xlrd and odfpy, that are unmaintained, and pyxlsb seems semi-maintained, so I would argue that the level of maintenance seems just about average in terms of the Excel parsers that we use.
(The only major concern I have is if python-calamine is unable to keep pace with calamine)

What does everyone else think?

Also cc @rhshadrach

fangchenli · 2023-01-07T04:08:24Z

IMO, it's better to judge this by how many tests its able to pass. If it's able to pass most(say ~60-70%) of them for at least one of its formats, then it being new doesn't really matter.

I agree. Let's see how far this PR could go.

gfyoung · 2023-01-07T11:25:48Z

One other thing: even if this is not something we want to merge right now, IMO better to get a working implementation that can always be maintained as a PR for the longer term than waiting until "the time is right" to actually implement.

gfyoung · 2023-01-07T11:28:13Z

pandas/io/excel/_calamitereader.py

+
+    def get_sheet_by_index(self, index: int) -> int:
+        self.raise_if_bad_sheet_by_index(index)
+        return index


I presume there is more being added here?

This was ported over from the implementation in the library's repo here, but obvs wasn't implemented by the maintainer.

I'm unsure if this is something the maintainer of the library will add in the future but I can remove this for now.

python-calamine isn't returning sheet from calamine (you can use fastexcel if you need sheet). This functions are used only in https://github.com/pandas-dev/pandas/blob/1f836f16b15adc0838afe634df5bda3977b3ad10/pandas/io/excel/_base.py#L733_L736 and working in this case.

rhshadrach · 2023-01-07T14:30:37Z

Give the current state of development of calamite, if the standard features of pandas Excel reader are functional I would be okay adding this if it were tagged as experimental, maybe even going so far as to having the user specify it as engine="calimite-experimental". I think this would communicate the risk involved to users.

If it's able to pass most(say ~60-70%) of them for at least one of its formats, then it being new doesn't really matter.

This seems too low to me. Some exceptions for certain tests are okay; I'd approach on a case-by-case basis. But if users cannot reliably use an engine in its current state, then I would not add even as experimental.

…fix actions-38.yaml

kostyafarber · 2023-01-09T09:18:40Z

Update: author has built a wheel for Mac OS arm

dimastbk · 2023-01-10T07:03:27Z

Current tests: 168 failed, 354 passed (67%). I need some time to fix some fails. But for example I can't fix tests with date/datetime, because this is implemented in calamine only for xlsx (tafia/calamine#198).

kostyafarber · 2023-01-10T13:08:58Z

I think there are some other readers that can't read datetime and are excluded from the tests that include datetime.

Perhaps we could do the same in this case.

kostyafarber · 2023-01-12T19:46:42Z

Current tests: 168 failed, 354 passed (67%). I need some time to fix some fails. But for example I can't fix tests with date/datetime, because this is implemented in calamine only for xlsx (tafia/calamine#198).

No worries let us know on your progress. I am happy to help out where necessary.

WillAyd · 2023-03-16T20:25:09Z

pandas/io/excel/_calaminereader.py

+    inspect_excel_format,
+)
+
+ValueT = Union[int, float, str, bool, time, date, datetime]


Can you move this into the function it is actually used in? Sounds like it has a pretty localized use, so no need to be in the global namespace

WillAyd · 2023-03-16T20:26:19Z

doc/source/whatsnew/v2.0.0.rst

@@ -288,9 +288,11 @@ Other enhancements
 - :meth:`Series.dropna` and :meth:`DataFrame.dropna` has gained ``ignore_index`` keyword to reset index (:issue:`31725`)
 - Improved error message in :func:`to_datetime` for non-ISO8601 formats, informing users about the position of the first error (:issue:`50361`)
 - Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`)
+- Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`)


We are in the processing of cutting 2.0.0 now; at this point should target 2.1.0 for this PR

@kostyafarber, hi! Can you merge main in issue-50395?

Yep will do

@dimastbk merged!

Looks like this is still in the v2.0.0 whatsnew

Yes, I asked for merging main in pr branch. Now fixed.

You should revert any changes to this file - the v2.0.0.rst file shouldn't be touched as part of this PR

WillAyd · 2023-03-16T20:28:48Z

pandas/io/excel/_calaminereader.py

+        super().__init__(filepath_or_buffer, storage_options=storage_options)
+
+    @property
+    def _workbook_class(self):


The protocol must not be explicitly stated in code, but whatever is returned here is supposed to represent the concept of a Workbook. Not very familiar with calamine but the name CalamineReader sounds more responsible for loading data than representing a workbook concept

WillAyd · 2023-03-16T20:33:43Z

pandas/io/excel/_calaminereader.py

+        return CalamineReader
+
+    def load_workbook(self, filepath_or_buffer: FilePath | ReadBuffer[bytes]):
+        if hasattr(filepath_or_buffer, "read") and hasattr(filepath_or_buffer, "seek"):


I think you can get rid of all of this if _workbook_class is properly implemented; looks like the base class should handle things gracefully

bump python-calamine to 0.1.0

WillAyd

Implementation looks pretty good - getting down to minor stuff now. Great work

WillAyd · 2023-04-03T16:33:47Z

pandas/io/excel/_calaminereader.py

+    def load_workbook(self, filepath_or_buffer: FilePath | ReadBuffer[bytes]):
+        from python_calamine import load_workbook
+
+        return load_workbook(filepath_or_buffer)  # type: ignore[arg-type]


Can you advise what the mypy errors are for this and the subsequent ones? Not necessarily a blocker but surprised to see these

pyright:

pandas/io/excel/_calamine.py:60:30 - error: Argument of type "FilePath | ReadBuffer[bytes]" cannot be assigned to parameter "path_or_filelike" of type "str | PathLike | ReadBuffer" in function "load_workbook" Type "FilePath | ReadBuffer[bytes]" cannot be assigned to type "str | PathLike | ReadBuffer" Type "ReadBuffer[bytes]" cannot be assigned to type "str | PathLike | ReadBuffer" "ReadBuffer[bytes]" is incompatible with "str" "ReadBuffer[bytes]" is incompatible with protocol "PathLike" "__fspath__" is not present "ReadBuffer[bytes]" is incompatible with protocol "ReadBuffer" "seek" is an incompatible type Type "(__offset: int, __whence: int = ..., /) -> int" cannot be assigned to type "() -> int" (reportGeneralTypeIssues)

mypy:

pandas/io/excel/_calamine.py:60: error: Argument 1 to "load_workbook" has incompatible type "Union[Union[str, PathLike[str]], pandas._typing.ReadBuffer[bytes]]"; expected "Union[str, PathLike[Any], python_calamine._python_calamine.ReadBuffer]" [arg-type]

WillAyd · 2023-04-03T16:34:19Z

doc/source/whatsnew/v2.0.0.rst

@@ -288,9 +288,11 @@ Other enhancements
 - :meth:`Series.dropna` and :meth:`DataFrame.dropna` has gained ``ignore_index`` keyword to reset index (:issue:`31725`)
 - Improved error message in :func:`to_datetime` for non-ISO8601 formats, informing users about the position of the first error (:issue:`50361`)
 - Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`)
+- Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`)


You should revert any changes to this file - the v2.0.0.rst file shouldn't be touched as part of this PR

WillAyd · 2023-04-03T16:35:39Z

pandas/tests/io/excel/test_readers.py

+        if engine == "calamine" and read_ext in {".xls", ".xlsb", ".ods"}:
+            request.node.add_marker(
+                pytest.mark.xfail(
+                    reason="Calamine support parsing datetime only in xlsx"


It looks like the PR linked above was merged - should ods be removed from this xfail?

WillAyd · 2023-04-03T16:37:52Z

doc/source/user_guide/io.rst

@@ -3420,7 +3420,8 @@ Excel files
 The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files
 using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files
 can be read using ``xlrd``. Binary Excel (``.xlsb``)
-files can be read using ``pyxlsb``.
+files can be read using ``pyxlsb``. Also, all this formats can be read using ``python-calamine``,


Is the datetime issue the only limitation? If so we can probably be more explicit and say something like python-calamine can be used to read all formats, but specifically does not support reading datetimes from .xls and .xlsb formats

Datetime is the main limitation, but there are a few more bugs, #50581 (comment). I suppressed them all with pytest.xfail, but should I write about them in documentation?

added xfail to tests, small fixes

… docs

mroeschke · 2023-05-15T20:57:05Z

Thanks for the PR but it appears to have gone stale. Additionally, I think with PDEP-9, I think this could be better supported by a separate library without being natively included in pandas so closing for now. #51799

Co-author: Kostya Farber (see pandas-dev#50581)

Co-author: Kostya Farber (pandas-dev#50581)

kostyafarber and others added 2 commits January 5, 2023 08:39

ENH: add calamite excel reader and modify test to include engine

30da9a4

Merge branch 'main' into issue-50395

a47d3fb

kostyafarber and others added 4 commits January 5, 2023 13:59

Merge branch 'main' into issue-50395

6c1dd87

fix deps for python-calamine

fd06ad9

Merge branch 'main' into issue-50395

8b6200a

fix deps for python-calamine, add as pip package

6a8d822

fangchenli added IO Excel read_excel, to_excel Dependencies Required and optional dependencies labels Jan 6, 2023

fangchenli requested changes Jan 6, 2023

View reviewed changes

ci/deps/actions-38.yaml Outdated Show resolved Hide resolved

fangchenli requested changes Jan 6, 2023

View reviewed changes

pandas/io/excel/_calamitereader.py Outdated Show resolved Hide resolved

gfyoung added the Enhancement label Jan 7, 2023

gfyoung reviewed Jan 7, 2023

View reviewed changes

kostyafarber mentioned this pull request Jan 7, 2023

Wheel for arm MacOS dimastbk/python-calamine#3

Closed

ENH: fix typo in engine declaration, add import_optional_dependency, …

efcb2fc

…fix actions-38.yaml

kostyafarber and others added 3 commits January 11, 2023 08:11

Merge branch 'main' into issue-50395

e1105de

calamite -> calamine, updated some tests for calamine

6b50e0c

calamine excel engine: skip tests with datetime

0784733

github-actions bot added the Stale label Mar 11, 2023

WillAyd reviewed Mar 16, 2023

View reviewed changes

dimastbk added a commit to dimastbk/python-calamine that referenced this pull request Mar 22, 2023

backported pandas.CalamineExcelReader from pandas-dev/pandas#50581

4dfca9b

dimastbk added a commit to dimastbk/python-calamine that referenced this pull request Mar 23, 2023

backported pandas.CalamineExcelReader from pandas-dev/pandas#50581

da21110

kostyafarber and others added 7 commits March 23, 2023 09:45

Merge branch 'main' into issue-50395

85d31ec

Merge branch 'main' into issue-50395

a0d4193

bump python-calamine to 0.1.0

a6b6fb2

_ValueT -> _CellValueT

0a431c5

Merge pull request #6 from dimastbk/issue-50395

745cd09

bump python-calamine to 0.1.0

Merge branch 'main' into issue-50395

942a16a

Merge branch 'main' into issue-50395

8803ca9

WillAyd requested changes Apr 3, 2023

View reviewed changes

dimastbk and others added 5 commits April 4, 2023 01:20

added xfail to tests, small fixes

2f5ffba

Merge pull request #7 from dimastbk/issue-50395

b8b1a9a

added xfail to tests, small fixes

Merge branch 'main' into issue-50395

f5ab40d

bump calamine to 0.1.1, update tests (472 passed, 75 xfailed), update…

02c2e7f

… docs

Merge pull request #8 from dimastbk/issue-50395

74a3e70

mroeschke closed this May 15, 2023

dimastbk added a commit to dimastbk/pandas that referenced this pull request Jul 15, 2023

ENH: add calamine excel reader (see also pandas-dev#50581)

bafe865

dimastbk pushed a commit to dimastbk/pandas that referenced this pull request Sep 4, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

16b0aad

Co-author: Kostya Farber (see pandas-dev#50581)

dimastbk pushed a commit to dimastbk/pandas that referenced this pull request Sep 4, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

bdf286b

Co-author: Kostya Farber (pandas-dev#50581)

dimastbk mentioned this pull request Sep 4, 2023

ENH: add calamine excel reader (close #50395) #54998

Merged

5 tasks

dimastbk added a commit to dimastbk/pandas that referenced this pull request Sep 5, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

b6701f0

Co-author: Kostya Farber (pandas-dev#50581)

dimastbk added a commit to dimastbk/pandas that referenced this pull request Sep 6, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

eaafb8c

Co-author: Kostya Farber (pandas-dev#50581)

dimastbk added a commit to dimastbk/pandas that referenced this pull request Sep 6, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

30288dc

Co-author: Kostya Farber (pandas-dev#50581)

dimastbk added a commit to dimastbk/pandas that referenced this pull request Sep 6, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

fe2f6de

Co-author: Kostya Farber (pandas-dev#50581)

dimastbk added a commit to dimastbk/pandas that referenced this pull request Sep 7, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

13146e1

Co-author: Kostya Farber (pandas-dev#50581)

dimastbk added a commit to dimastbk/pandas that referenced this pull request Sep 9, 2023

ENH: add calamine excel reader (close pandas-dev#50395)

c43a34b

Co-author: Kostya Farber (pandas-dev#50581)

ENH: Add calamite engine to read_excel #50581

ENH: Add calamite engine to read_excel #50581

Conversation

kostyafarber commented Jan 5, 2023

kostyafarber commented Jan 5, 2023

fangchenli commented Jan 6, 2023

kostyafarber commented Jan 6, 2023

mroeschke commented Jan 6, 2023

WillAyd commented Jan 6, 2023

lithomas1 commented Jan 7, 2023 • edited Loading

fangchenli commented Jan 7, 2023 • edited Loading

gfyoung commented Jan 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Jan 7, 2023 • edited Loading

kostyafarber commented Jan 9, 2023

dimastbk commented Jan 10, 2023 • edited Loading

kostyafarber commented Jan 10, 2023

kostyafarber commented Jan 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented May 15, 2023

ENH: Add calamite engine to `read_excel` #50581

ENH: Add calamite engine to `read_excel` #50581

lithomas1 commented Jan 7, 2023 •

edited

Loading

fangchenli commented Jan 7, 2023 •

edited

Loading

rhshadrach commented Jan 7, 2023 •

edited

Loading

dimastbk commented Jan 10, 2023 •

edited

Loading