Skip to content

BUG: Fix pd.read_orc raising AttributeError #40970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,21 @@ pyarrow 0.15.0 Parquet, ORC, and feather reading /
pyreadstat SPSS files (.sav) reading
========================= ================== =============================================================

.. _install.warn_orc:

.. warning::

* If you want to use :func:`~pandas.read_orc`, it is highly recommended to install pyarrow using conda.
The following is a summary of the environment in which :func:`~pandas.read_orc` can work.

========================= ================== =============================================================
System Conda PyPI
========================= ================== =============================================================
Linux Successful Failed(pyarrow==3.0 Successful)
macOS Successful Failed
Windows Failed Failed
========================= ================== =============================================================

Access data in the cloud
^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
5 changes: 5 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5443,6 +5443,11 @@ Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc.
for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.

.. warning::

* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
* :func:`~pandas.read_orc` is not supported on Windows yet, you can find valid environments on :ref:`install optional dependencies <install.warn_orc>`.

.. _io.sql:

SQL queries
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -783,6 +783,7 @@ I/O
- Bug in :func:`read_sas` raising ``ValueError`` when ``datetimes`` were null (:issue:`39725`)
- Bug in :func:`read_excel` dropping empty values from single-column spreadsheets (:issue:`39808`)
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)

Period
^^^^^^
Expand Down
13 changes: 8 additions & 5 deletions pandas/io/orc.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
""" orc compat """
from __future__ import annotations

import distutils
from typing import TYPE_CHECKING

from pandas._typing import FilePathOrBuffer
from pandas.compat._optional import import_optional_dependency

from pandas.io.common import get_handle

Expand Down Expand Up @@ -42,13 +42,16 @@ def read_orc(
Returns
-------
DataFrame
Notes
-------
Before using this function you should read the :ref:`user guide about ORC <io.orc>`
and :ref:`install optional dependencies <install.warn_orc>`.
"""
# we require a newer version of pyarrow than we support for parquet
import pyarrow

if distutils.version.LooseVersion(pyarrow.__version__) < "0.13.0":
raise ImportError("pyarrow must be >= 0.13.0 for read_orc")
orc = import_optional_dependency("pyarrow.orc")

with get_handle(path, "rb", is_text=False) as handles:
orc_file = pyarrow.orc.ORCFile(handles.handle)
orc_file = orc.ORCFile(handles.handle)
return orc_file.read(columns=columns, **kwargs).to_pandas()
1 change: 0 additions & 1 deletion pandas/tests/io/test_orc.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
from pandas import read_orc
import pandas._testing as tm

pytest.importorskip("pyarrow", minversion="0.13.0")
pytest.importorskip("pyarrow.orc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests might fail without this, since orc doesn't import properly for certain OSes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will raise some errors on Windows env.

But if add this line to test_orc.py, the test case can not find the bug mentioned in #40918, it just skips this test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe try pyarrow._orc? (That's what it tries to pyarrow.orc tries to find for me on windows)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me re-describe the reason for deleting this line.

If the PyArrow package is installed from Conda, pytest.importorskip("pyarrow.orc") will successfully import pyarrow.orc module, so test_orc.py-Line143 got = read_orc(inputfile).iloc[:10] will works fine.

But, AttributeError will be raised if the user uses pd.read_orc directly without importing pyarrow.orc first. The test case failed to find this bug.


Maybe we should keep pytest.importorskip("pyarrow.orc") and delete pytest.importorskip("pyarrow.orc"). Then fix this bug in pandas/io/orc.py, and make pyarrow be imported only once(discuss below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am hypothesizing that the bug will reproduce since pyarrow.orc is not imported by pytest.importorskip(pyarrow._orc), and running pytest.importorskip(pyarrow._orc) will skip the test on Windows where pyarrow orc support is not present, as I think it is pyarrow._orc where the orc stuff is actually implemented. I cannot verify this works, though.


pytestmark = pytest.mark.filterwarnings(
Expand Down