-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix pd.read_orc raising AttributeError #40970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
67a6e5b
0623037
e61629a
728d26d
ef76984
8728848
d0c34db
2d665cd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5443,6 +5443,11 @@ Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc. | |
for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the | ||
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library. | ||
|
||
Several caveats. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can make this a note or warning |
||
|
||
* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow. | ||
* :func:`~pandas.read_orc` is not supported on Windows yet. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you link to the warning you added in install.rst |
||
|
||
.. _io.sql: | ||
|
||
SQL queries | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,10 @@ | ||
""" orc compat """ | ||
from __future__ import annotations | ||
|
||
import distutils | ||
from typing import TYPE_CHECKING | ||
|
||
from pandas._typing import FilePathOrBuffer | ||
from pandas.compat._optional import import_optional_dependency | ||
|
||
from pandas.io.common import get_handle | ||
|
||
|
@@ -42,13 +42,15 @@ def read_orc( | |
Returns | ||
------- | ||
DataFrame | ||
|
||
Notes | ||
------- | ||
Before using this function you should read the :ref:`user guide about ORC <io.orc>`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can provide a link to install.rst here |
||
""" | ||
# we require a newer version of pyarrow than we support for parquet | ||
import pyarrow | ||
|
||
if distutils.version.LooseVersion(pyarrow.__version__) < "0.13.0": | ||
raise ImportError("pyarrow must be >= 0.13.0 for read_orc") | ||
orc = import_optional_dependency("pyarrow.orc") | ||
|
||
with get_handle(path, "rb", is_text=False) as handles: | ||
orc_file = pyarrow.orc.ORCFile(handles.handle) | ||
orc_file = orc.ORCFile(handles.handle) | ||
return orc_file.read(columns=columns, **kwargs).to_pandas() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,7 +9,6 @@ | |
from pandas import read_orc | ||
import pandas._testing as tm | ||
|
||
pytest.importorskip("pyarrow", minversion="0.13.0") | ||
pytest.importorskip("pyarrow.orc") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the tests might fail without this, since orc doesn't import properly for certain OSes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it will raise some errors on Windows env. But if add this line to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe try There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, let me re-describe the reason for deleting this line. If the PyArrow package is installed from Conda, But, AttributeError will be raised if the user uses Maybe we should keep There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I am hypothesizing that the bug will reproduce since |
||
|
||
pytestmark = pytest.mark.filterwarnings( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for Linux on PyPI can you just say pyarrow>=3.0 Successful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tested it on Linux(ubuntu18.04)/macOS/Windows10.
The above summary is based on the test results.
ps. The latest version of PyArrow(3.0) installed from PyPI works well on Linux. But I'm not sure if the above error will happen in future releases because JIRA-7811 has not been fixed yet. So I use
pyarrow==3.0
in doc.