Skip to content

BUG: Load ORC-format data failed when pandas version>1.2.0.dev0 #40918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
amznero opened this issue Apr 13, 2021 · 2 comments · Fixed by #40970
Closed
3 tasks done

BUG: Load ORC-format data failed when pandas version>1.2.0.dev0 #40918

amznero opened this issue Apr 13, 2021 · 2 comments · Fixed by #40970
Labels
Bug Dependencies Required and optional dependencies IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@amznero
Copy link
Contributor

amznero commented Apr 13, 2021


Code Sample, a copy-pastable example

...
import pandas as pd
orc_data = pd.read_orc(orc_file_path)

Problem description

Pandas uses PyArrow package to load ORC/Parquet data.

For the orc data format, it will use pyarrow.orc.ORCFile to read data (orc.py), but the PyArrow does not declare orc in __init__.py file, so pandas will raise an AttributeError: module 'pyarrow' has no attribute 'orc'

image

This bug will occur if the Pandas version is greater than v1.2.0.dev0(after commit-6d1541e). Before that, pandas/io/orc.py will declare import pyarrow.orc before uses pyarrow to load orc data(v1.1.5/pandas/io.orc.py/).


Testing environment:

  • Ubuntu 18.04
  • python 3.7
  • pandas v1.2.1
  • pyarrow v3.0.0 (install via pip)(I haven't installed pyarrow via Conda for testing yet.)
@amznero amznero added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 13, 2021
@lithomas1
Copy link
Member

lithomas1 commented Apr 13, 2021

pyarrow.orc doesn't seem to work in general for pyarrow > 0.15.0 I think. On pip, I get

>>> import pyarrow.orc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\liwende\Anaconda3\lib\site-packages\pyarrow\orc.py", line 24, in <module>
    import pyarrow._orc as _orc
ModuleNotFoundError: No module named 'pyarrow._orc'

and with conda I get

>>> import pyarrow.orc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\liwende\Anaconda3\lib\site-packages\pyarrow\__init__.py", line 54, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ImportError: DLL load failed: The specified procedure could not be found.

(both with pyarrow 3.0.0)
Maybe related? https://issues.apache.org/jira/browse/ARROW-7811

Other than that, agreed that it's a bug.

@lithomas1 lithomas1 added Dependencies Required and optional dependencies IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 13, 2021
@amznero
Copy link
Contributor Author

amznero commented Apr 14, 2021

Seems this problem(pyarrow.orc) is related to the operating system

I can run it successfully on Linux(Ubuntu), but I have the same problem as you mentioned on Windows10.

image


Maybe related to https://stackoverflow.com/a/58967391

@lithomas1 lithomas1 added this to the Contributions Welcome milestone Apr 14, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dependencies Required and optional dependencies IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants