-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add calamite engine to read_excel
#50581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Opened as a first stab. There are quite a few tests failing that we would need to investigate when adding the engine. |
|
That's a fair point. Can we have a discussion with the relevant people if we want to go ahead with this then? Separately, I can open an issue to get a wheel for this library, if we want to go ahead with it. |
Yeah I think I would be -0.5 adding this new engine at the moment. As @fangchenli mentioned it appears relatively new package and doesn't seem to have a super active development community around the package yet |
Yea I'm also hesitant to bring this in - both the python binding and the core library itself don't have that much development momentum. Something to watch as they have nice ideas, but I don't think they are at the point yet where pandas internally should change anything to interface with them |
IMO, it's better to judge this by how many tests its able to pass. If it's able to pass most(say ~60-70%) of them for at least one of its formats, then it being new doesn't really matter. Right now, we depend on two libraries, xlrd and odfpy, that are unmaintained, and pyxlsb seems semi-maintained, so I would argue that the level of maintenance seems just about average in terms of the Excel parsers that we use. What does everyone else think? Also cc @rhshadrach |
I agree. Let's see how far this PR could go. |
One other thing: even if this is not something we want to merge right now, IMO better to get a working implementation that can always be maintained as a PR for the longer term than waiting until "the time is right" to actually implement. |
pandas/io/excel/_calamitereader.py
Outdated
|
||
def get_sheet_by_index(self, index: int) -> int: | ||
self.raise_if_bad_sheet_by_index(index) | ||
return index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume there is more being added here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was ported over from the implementation in the library's repo here, but obvs wasn't implemented by the maintainer.
I'm unsure if this is something the maintainer of the library will add in the future but I can remove this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python-calamine
isn't returning sheet from calamine (you can use fastexcel if you need sheet). This functions are used only in https://github.com/pandas-dev/pandas/blob/1f836f16b15adc0838afe634df5bda3977b3ad10/pandas/io/excel/_base.py#L733_L736 and working in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Give the current state of development of calamite, if the standard features of pandas Excel reader are functional I would be okay adding this if it were tagged as experimental, maybe even going so far as to having the user specify it as
This seems too low to me. Some exceptions for certain tests are okay; I'd approach on a case-by-case basis. But if users cannot reliably use an engine in its current state, then I would not add even as experimental. |
…fix actions-38.yaml
Update: author has built a wheel for Mac OS arm |
Current tests: 168 failed, 354 passed (67%). I need some time to fix some fails. But for example I can't fix tests with date/datetime, because this is implemented in calamine only for xlsx (tafia/calamine#198). |
I think there are some other readers that can't read datetime and are excluded from the tests that include datetime. Perhaps we could do the same in this case. |
No worries let us know on your progress. I am happy to help out where necessary. |
pandas/io/excel/_calaminereader.py
Outdated
inspect_excel_format, | ||
) | ||
|
||
ValueT = Union[int, float, str, bool, time, date, datetime] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this into the function it is actually used in? Sounds like it has a pretty localized use, so no need to be in the global namespace
doc/source/whatsnew/v2.0.0.rst
Outdated
@@ -288,9 +288,11 @@ Other enhancements | |||
- :meth:`Series.dropna` and :meth:`DataFrame.dropna` has gained ``ignore_index`` keyword to reset index (:issue:`31725`) | |||
- Improved error message in :func:`to_datetime` for non-ISO8601 formats, informing users about the position of the first error (:issue:`50361`) | |||
- Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`) | |||
- Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in the processing of cutting 2.0.0 now; at this point should target 2.1.0 for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kostyafarber, hi! Can you merge main in issue-50395?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimastbk merged!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is still in the v2.0.0 whatsnew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I asked for merging main in pr branch. Now fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should revert any changes to this file - the v2.0.0.rst file shouldn't be touched as part of this PR
pandas/io/excel/_calaminereader.py
Outdated
super().__init__(filepath_or_buffer, storage_options=storage_options) | ||
|
||
@property | ||
def _workbook_class(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocol must not be explicitly stated in code, but whatever is returned here is supposed to represent the concept of a Workbook. Not very familiar with calamine but the name CalamineReader
sounds more responsible for loading data than representing a workbook concept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pandas/io/excel/_calaminereader.py
Outdated
return CalamineReader | ||
|
||
def load_workbook(self, filepath_or_buffer: FilePath | ReadBuffer[bytes]): | ||
if hasattr(filepath_or_buffer, "read") and hasattr(filepath_or_buffer, "seek"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get rid of all of this if _workbook_class
is properly implemented; looks like the base class should handle things gracefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
bump python-calamine to 0.1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks pretty good - getting down to minor stuff now. Great work
pandas/io/excel/_calaminereader.py
Outdated
def load_workbook(self, filepath_or_buffer: FilePath | ReadBuffer[bytes]): | ||
from python_calamine import load_workbook | ||
|
||
return load_workbook(filepath_or_buffer) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you advise what the mypy errors are for this and the subsequent ones? Not necessarily a blocker but surprised to see these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyright:
pandas/io/excel/_calamine.py:60:30 - error: Argument of type "FilePath | ReadBuffer[bytes]" cannot be assigned to parameter "path_or_filelike" of type "str | PathLike | ReadBuffer" in function "load_workbook"
Type "FilePath | ReadBuffer[bytes]" cannot be assigned to type "str | PathLike | ReadBuffer"
Type "ReadBuffer[bytes]" cannot be assigned to type "str | PathLike | ReadBuffer"
"ReadBuffer[bytes]" is incompatible with "str"
"ReadBuffer[bytes]" is incompatible with protocol "PathLike"
"__fspath__" is not present
"ReadBuffer[bytes]" is incompatible with protocol "ReadBuffer"
"seek" is an incompatible type
Type "(__offset: int, __whence: int = ..., /) -> int" cannot be assigned to type "() -> int" (reportGeneralTypeIssues)
mypy:
pandas/io/excel/_calamine.py:60: error: Argument 1 to "load_workbook" has incompatible type "Union[Union[str, PathLike[str]], pandas._typing.ReadBuffer[bytes]]"; expected "Union[str, PathLike[Any], python_calamine._python_calamine.ReadBuffer]" [arg-type]
doc/source/whatsnew/v2.0.0.rst
Outdated
@@ -288,9 +288,11 @@ Other enhancements | |||
- :meth:`Series.dropna` and :meth:`DataFrame.dropna` has gained ``ignore_index`` keyword to reset index (:issue:`31725`) | |||
- Improved error message in :func:`to_datetime` for non-ISO8601 formats, informing users about the position of the first error (:issue:`50361`) | |||
- Improved error message when trying to align :class:`DataFrame` objects (for example, in :func:`DataFrame.compare`) to clarify that "identically labelled" refers to both index and columns (:issue:`50083`) | |||
- Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should revert any changes to this file - the v2.0.0.rst file shouldn't be touched as part of this PR
if engine == "calamine" and read_ext in {".xls", ".xlsb", ".ods"}: | ||
request.node.add_marker( | ||
pytest.mark.xfail( | ||
reason="Calamine support parsing datetime only in xlsx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the PR linked above was merged - should ods be removed from this xfail?
@@ -3420,7 +3420,8 @@ Excel files | |||
The :func:`~pandas.read_excel` method can read Excel 2007+ (``.xlsx``) files | |||
using the ``openpyxl`` Python module. Excel 2003 (``.xls``) files | |||
can be read using ``xlrd``. Binary Excel (``.xlsb``) | |||
files can be read using ``pyxlsb``. | |||
files can be read using ``pyxlsb``. Also, all this formats can be read using ``python-calamine``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the datetime issue the only limitation? If so we can probably be more explicit and say something like python-calamine can be used to read all formats, but specifically does not support reading datetimes from .xls and .xlsb formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datetime is the main limitation, but there are a few more bugs, #50581 (comment). I suppressed them all with pytest.xfail, but should I write about them in documentation?
added xfail to tests, small fixes
Thanks for the PR but it appears to have gone stale. Additionally, I think with PDEP-9, I think this could be better supported by a separate library without being natively included in pandas so closing for now. #51799 |
Co-author: Kostya Farber (see pandas-dev#50581)
Co-author: Kostya Farber (pandas-dev#50581)
Co-author: Kostya Farber (pandas-dev#50581)
Co-author: Kostya Farber (pandas-dev#50581)
Co-author: Kostya Farber (pandas-dev#50581)
Co-author: Kostya Farber (pandas-dev#50581)
Co-author: Kostya Farber (pandas-dev#50581)
Co-author: Kostya Farber (pandas-dev#50581)
Adds excel reader from rust calamite as an engine using library binding library python-calamite.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.