Skip to content

BUG: Fix pd.read_orc raising AttributeError #40970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

amznero
Copy link
Contributor

@amznero amznero commented Apr 16, 2021


I am not sure if this is a reasonable solution. Cause If users install PyArrow from PyPI on MacOS or Win10(works fine on Linux), it will always raise an AttributeError exception.

Should we add some operating system requirments to pd.read_orc? (Related to PyArrow's JIRA #7811).

@@ -10,7 +10,6 @@
import pandas._testing as tm

pytest.importorskip("pyarrow", minversion="0.13.0")
pytest.importorskip("pyarrow.orc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests might fail without this, since orc doesn't import properly for certain OSes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will raise some errors on Windows env.

But if add this line to test_orc.py, the test case can not find the bug mentioned in #40918, it just skips this test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe try pyarrow._orc? (That's what it tries to pyarrow.orc tries to find for me on windows)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me re-describe the reason for deleting this line.

If the PyArrow package is installed from Conda, pytest.importorskip("pyarrow.orc") will successfully import pyarrow.orc module, so test_orc.py-Line143 got = read_orc(inputfile).iloc[:10] will works fine.

But, AttributeError will be raised if the user uses pd.read_orc directly without importing pyarrow.orc first. The test case failed to find this bug.


Maybe we should keep pytest.importorskip("pyarrow.orc") and delete pytest.importorskip("pyarrow.orc"). Then fix this bug in pandas/io/orc.py, and make pyarrow be imported only once(discuss below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am hypothesizing that the bug will reproduce since pyarrow.orc is not imported by pytest.importorskip(pyarrow._orc), and running pytest.importorskip(pyarrow._orc) will skip the test on Windows where pyarrow orc support is not present, as I think it is pyarrow._orc where the orc stuff is actually implemented. I cannot verify this works, though.

pandas/io/orc.py Outdated
@@ -49,6 +49,8 @@ def read_orc(
if distutils.version.LooseVersion(pyarrow.__version__) < "0.13.0":
raise ImportError("pyarrow must be >= 0.13.0 for read_orc")

import pyarrow.orc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrrow should only be imported once, so should remove the import pyarrow statement a couple lines up.

Copy link
Contributor Author

@amznero amznero Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line-49 will check the PyArrow's version, it depends on line-47(import PyArrow).

Copy link
Member

@lithomas1 lithomas1 Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking again, it's better to use

def import_optional_dependency(

so would be something like

from pandas.compat._optional import import_optional_dependency
orc = import_optional_dependency("pyarrow.orc")
...
orc_file = orc.ORCFile(handles.handle)

and then we could get rid of the other code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already tested it and it looks good. But it will remind users to install pyarrow.orc instead of pyarrow since import_optional_dependency cannot receive customized error messages. Is that OK?

Log:
ImportError: Missing optional dependency 'pyarrow.orc'. Use pip or conda to install pyarrow.orc.

@lithomas1
Copy link
Member

lithomas1 commented Apr 17, 2021

Thanks @amznero for the PR.

I think it might be worth deprecating pd.read_orc. Looking at the code, read_orc is a very simple wrapper around pyarrow, so it might be better for the user to call the pyarrow code themselves.

Looks good generally(few comments), but needs whatsnew(https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v1.3.0.rst and also maybe operating system requirements in docs).

@lithomas1 lithomas1 added the IO Data IO issues that don't fit into a more specific label label Apr 17, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a couple of doc comments, are we actually testing this on the appropriate platforms? (e.g. are we skpping on windows) and the pyarrow >=3.0 build succeeds on linux while the others are skipped?

ping on green.

========================= ================== =============================================================
System Conda PyPI
========================= ================== =============================================================
Linux Successful Failed(pyarrow==3.0 excepted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Linux on PyPI can you just say pyarrow>=3.0 Successful

Copy link
Contributor Author

@amznero amznero Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tested it on Linux(ubuntu18.04)/macOS/Windows10.
The above summary is based on the test results.

ps. The latest version of PyArrow(3.0) installed from PyPI works well on Linux. But I'm not sure if the above error will happen in future releases because JIRA-7811 has not been fixed yet. So I use pyarrow==3.0 in doc.

Several caveats.

* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
* :func:`~pandas.read_orc` is not supported on Windows yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you link to the warning you added in install.rst

@@ -5443,6 +5443,11 @@ Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc.
for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.

Several caveats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this a note or warning

pandas/io/orc.py Outdated

Notes
-------
Before using this function you should read the :ref:`user guide about ORC <io.orc>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can provide a link to install.rst here

@jreback jreback added IO Parquet parquet, feather and removed IO Data IO issues that don't fit into a more specific label labels Apr 20, 2021
@jreback jreback added this to the 1.3 milestone Apr 20, 2021
@amznero amznero requested a review from jreback April 21, 2021 11:52
@jreback jreback merged commit 6f3488f into pandas-dev:master Apr 21, 2021
@jreback
Copy link
Contributor

jreback commented Apr 21, 2021

thanks @amznero

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Load ORC-format data failed when pandas version>1.2.0.dev0
3 participants