-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix pd.read_orc raising AttributeError #40970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix pd.read_orc raising AttributeError #40970
Conversation
@@ -10,7 +10,6 @@ | |||
import pandas._testing as tm | |||
|
|||
pytest.importorskip("pyarrow", minversion="0.13.0") | |||
pytest.importorskip("pyarrow.orc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests might fail without this, since orc doesn't import properly for certain OSes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will raise some errors on Windows env.
But if add this line to test_orc.py
, the test case can not find the bug mentioned in #40918, it just skips this test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe try pyarrow._orc
? (That's what it tries to pyarrow.orc
tries to find for me on windows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, let me re-describe the reason for deleting this line.
If the PyArrow package is installed from Conda, pytest.importorskip("pyarrow.orc")
will successfully import pyarrow.orc module, so test_orc.py-Line143 got = read_orc(inputfile).iloc[:10]
will works fine.
But, AttributeError will be raised if the user uses pd.read_orc
directly without importing pyarrow.orc
first. The test case failed to find this bug.
Maybe we should keep pytest.importorskip("pyarrow.orc")
and delete pytest.importorskip("pyarrow.orc")
. Then fix this bug in pandas/io/orc.py, and make pyarrow
be imported only once(discuss below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I am hypothesizing that the bug will reproduce since pyarrow.orc
is not imported by pytest.importorskip(pyarrow._orc)
, and running pytest.importorskip(pyarrow._orc)
will skip the test on Windows where pyarrow orc support is not present, as I think it is pyarrow._orc
where the orc stuff is actually implemented. I cannot verify this works, though.
pandas/io/orc.py
Outdated
@@ -49,6 +49,8 @@ def read_orc( | |||
if distutils.version.LooseVersion(pyarrow.__version__) < "0.13.0": | |||
raise ImportError("pyarrow must be >= 0.13.0 for read_orc") | |||
|
|||
import pyarrow.orc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrrow should only be imported once, so should remove the import pyarrow
statement a couple lines up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line-49 will check the PyArrow's version, it depends on line-47(import PyArrow
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking again, it's better to use
pandas/pandas/compat/_optional.py
Line 63 in 04f9a4b
def import_optional_dependency( |
so would be something like
from pandas.compat._optional import import_optional_dependency
orc = import_optional_dependency("pyarrow.orc")
...
orc_file = orc.ORCFile(handles.handle)
and then we could get rid of the other code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already tested it and it looks good. But it will remind users to install pyarrow.orc
instead of pyarrow
since import_optional_dependency cannot receive customized error messages. Is that OK?
Log:
ImportError: Missing optional dependency 'pyarrow.orc'. Use pip or conda to install pyarrow.orc.
Thanks @amznero for the PR. I think it might be worth deprecating pd.read_orc. Looking at the code, read_orc is a very simple wrapper around pyarrow, so it might be better for the user to call the pyarrow code themselves. Looks good generally(few comments), but needs whatsnew(https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v1.3.0.rst and also maybe operating system requirements in docs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good a couple of doc comments, are we actually testing this on the appropriate platforms? (e.g. are we skpping on windows) and the pyarrow >=3.0 build succeeds on linux while the others are skipped?
ping on green.
========================= ================== ============================================================= | ||
System Conda PyPI | ||
========================= ================== ============================================================= | ||
Linux Successful Failed(pyarrow==3.0 excepted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for Linux on PyPI can you just say pyarrow>=3.0 Successful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tested it on Linux(ubuntu18.04)/macOS/Windows10.
The above summary is based on the test results.
ps. The latest version of PyArrow(3.0) installed from PyPI works well on Linux. But I'm not sure if the above error will happen in future releases because JIRA-7811 has not been fixed yet. So I use pyarrow==3.0
in doc.
doc/source/user_guide/io.rst
Outdated
Several caveats. | ||
|
||
* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow. | ||
* :func:`~pandas.read_orc` is not supported on Windows yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you link to the warning you added in install.rst
doc/source/user_guide/io.rst
Outdated
@@ -5443,6 +5443,11 @@ Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc. | |||
for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the | |||
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library. | |||
|
|||
Several caveats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make this a note or warning
pandas/io/orc.py
Outdated
|
||
Notes | ||
------- | ||
Before using this function you should read the :ref:`user guide about ORC <io.orc>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can provide a link to install.rst here
thanks @amznero |
I am not sure if this is a reasonable solution. Cause If users install PyArrow from PyPI on MacOS or Win10(works fine on Linux), it will always raise an
AttributeError
exception.Should we add some operating system requirments to
pd.read_orc
? (Related to PyArrow's JIRA #7811).