BUG: Fix pd.read_orc raising AttributeError #40970

amznero · 2021-04-16T06:25:47Z

closes BUG: Load ORC-format data failed when pandas version>1.2.0.dev0 #40918
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

I am not sure if this is a reasonable solution. Cause If users install PyArrow from PyPI on MacOS or Win10(works fine on Linux), it will always raise an AttributeError exception.

Should we add some operating system requirments to pd.read_orc? (Related to PyArrow's JIRA #7811).

lithomas1 · 2021-04-17T01:26:46Z

pandas/tests/io/test_orc.py

@@ -10,7 +10,6 @@
 import pandas._testing as tm

 pytest.importorskip("pyarrow", minversion="0.13.0")
-pytest.importorskip("pyarrow.orc")


I think the tests might fail without this, since orc doesn't import properly for certain OSes.

Yes, it will raise some errors on Windows env.

But if add this line to test_orc.py, the test case can not find the bug mentioned in #40918, it just skips this test case.

Maybe try pyarrow._orc? (That's what it tries to pyarrow.orc tries to find for me on windows)

Sorry, let me re-describe the reason for deleting this line.

If the PyArrow package is installed from Conda, pytest.importorskip("pyarrow.orc") will successfully import pyarrow.orc module, so test_orc.py-Line143 got = read_orc(inputfile).iloc[:10] will works fine.

But, AttributeError will be raised if the user uses pd.read_orc directly without importing pyarrow.orc first. The test case failed to find this bug.

Maybe we should keep pytest.importorskip("pyarrow.orc") and delete pytest.importorskip("pyarrow.orc"). Then fix this bug in pandas/io/orc.py, and make pyarrow be imported only once(discuss below).

Well, I am hypothesizing that the bug will reproduce since pyarrow.orc is not imported by pytest.importorskip(pyarrow._orc), and running pytest.importorskip(pyarrow._orc) will skip the test on Windows where pyarrow orc support is not present, as I think it is pyarrow._orc where the orc stuff is actually implemented. I cannot verify this works, though.

lithomas1 · 2021-04-17T01:34:02Z

pandas/io/orc.py

@@ -49,6 +49,8 @@ def read_orc(
    if distutils.version.LooseVersion(pyarrow.__version__) < "0.13.0":
        raise ImportError("pyarrow must be >= 0.13.0 for read_orc")

+    import pyarrow.orc


pyarrrow should only be imported once, so should remove the import pyarrow statement a couple lines up.

Line-49 will check the PyArrow's version, it depends on line-47(import PyArrow).

Looking again, it's better to use

pandas/pandas/compat/_optional.py

Line 63 in 04f9a4b

def import_optional_dependency(

so would be something like

from pandas.compat._optional import import_optional_dependency orc = import_optional_dependency("pyarrow.orc") ... orc_file = orc.ORCFile(handles.handle)

and then we could get rid of the other code.

I already tested it and it looks good. But it will remind users to install pyarrow.orc instead of pyarrow since import_optional_dependency cannot receive customized error messages. Is that OK?

Log:
ImportError: Missing optional dependency 'pyarrow.orc'. Use pip or conda to install pyarrow.orc.

lithomas1 · 2021-04-17T01:35:16Z

Thanks @amznero for the PR.

I think it might be worth deprecating pd.read_orc. Looking at the code, read_orc is a very simple wrapper around pyarrow, so it might be better for the user to call the pyarrow code themselves.

Looks good generally(few comments), but needs whatsnew(https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v1.3.0.rst and also maybe operating system requirements in docs).

jreback

looks good a couple of doc comments, are we actually testing this on the appropriate platforms? (e.g. are we skpping on windows) and the pyarrow >=3.0 build succeeds on linux while the others are skipped?

ping on green.

jreback · 2021-04-20T12:48:17Z

doc/source/getting_started/install.rst

+      ========================= ================== =============================================================
+      System                    Conda              PyPI
+      ========================= ================== =============================================================
+      Linux                     Successful         Failed(pyarrow==3.0 excepted)


for Linux on PyPI can you just say pyarrow>=3.0 Successful

Yes, I tested it on Linux(ubuntu18.04)/macOS/Windows10.
The above summary is based on the test results.

ps. The latest version of PyArrow(3.0) installed from PyPI works well on Linux. But I'm not sure if the above error will happen in future releases because JIRA-7811 has not been fixed yet. So I use pyarrow==3.0 in doc.

jreback · 2021-04-20T12:48:39Z

doc/source/user_guide/io.rst

+Several caveats.
+
+* It is *highly recommended* to install pyarrow using conda due to some issues occurred by pyarrow.
+* :func:`~pandas.read_orc` is not supported on Windows yet.


can you link to the warning you added in install.rst

jreback · 2021-04-20T12:48:49Z

doc/source/user_guide/io.rst

@@ -5443,6 +5443,11 @@ Similar to the :ref:`parquet <io.parquet>` format, the `ORC Format <https://orc.
 for data frames. It is designed to make reading data frames efficient. pandas provides *only* a reader for the
 ORC format, :func:`~pandas.read_orc`. This requires the `pyarrow <https://arrow.apache.org/docs/python/>`__ library.

+Several caveats.


you can make this a note or warning

jreback · 2021-04-20T12:49:03Z

pandas/io/orc.py

+
+    Notes
+    -------
+    Before using this function you should read the :ref:`user guide about ORC <io.orc>`.


you can provide a link to install.rst here

jreback · 2021-04-21T12:41:06Z

thanks @amznero

BUG: Fix pd.read_orc raising AttributeError

67a6e5b

lithomas1 reviewed Apr 17, 2021

View reviewed changes

lithomas1 added the IO Data IO issues that don't fit into a more specific label label Apr 17, 2021

amznero and others added 6 commits April 19, 2021 19:19

PR comments

0623037

pre-commit

e61629a

add whatsnew

728d26d

Merge branch 'master' into gh40918-fix-import-pyarrow-orc-error

ef76984

add doc

8728848

fix ref

d0c34db

jreback requested changes Apr 20, 2021

View reviewed changes

jreback added IO Parquet parquet, feather and removed IO Data IO issues that don't fit into a more specific label labels Apr 20, 2021

jreback added this to the 1.3 milestone Apr 20, 2021

PR comments

2d665cd

amznero requested a review from jreback April 21, 2021 11:52

jreback approved these changes Apr 21, 2021

View reviewed changes

jreback merged commit 6f3488f into pandas-dev:master Apr 21, 2021

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

BUG: Fix pd.read_orc raising AttributeError (pandas-dev#40970)

8d1aa2f

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

BUG: Fix pd.read_orc raising AttributeError (pandas-dev#40970)

e63b5d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix pd.read_orc raising AttributeError #40970

BUG: Fix pd.read_orc raising AttributeError #40970

amznero commented Apr 16, 2021 •

edited

Loading

lithomas1 Apr 17, 2021

amznero Apr 17, 2021

lithomas1 Apr 18, 2021

amznero Apr 19, 2021

lithomas1 Apr 20, 2021

lithomas1 Apr 17, 2021

amznero Apr 17, 2021 •

edited

Loading

lithomas1 Apr 17, 2021 •

edited

Loading

amznero Apr 19, 2021

lithomas1 commented Apr 17, 2021 •

edited

Loading

jreback left a comment

jreback Apr 20, 2021

amznero Apr 21, 2021 •

edited

Loading

jreback Apr 20, 2021

jreback Apr 20, 2021

jreback Apr 20, 2021

jreback commented Apr 21, 2021

BUG: Fix pd.read_orc raising AttributeError #40970

BUG: Fix pd.read_orc raising AttributeError #40970

Conversation

amznero commented Apr 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amznero Apr 17, 2021 • edited Loading

Choose a reason for hiding this comment

lithomas1 Apr 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Apr 17, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amznero Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 21, 2021

amznero commented Apr 16, 2021 •

edited

Loading

amznero Apr 17, 2021 •

edited

Loading

lithomas1 Apr 17, 2021 •

edited

Loading

lithomas1 commented Apr 17, 2021 •

edited

Loading

amznero Apr 21, 2021 •

edited

Loading