-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add ORC reader #29447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add ORC reader #29447
Conversation
Note: could also likely use some work to share the functionality between parquet and ORC readers where possible. There's a lot of copy paste currently in this PR. |
I unfortunately am not going to have the bandwidth to drive this to completion likely for a while, so if someone would like to take this over it would be welcome. |
ok let's see what we can do here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
pandas/io/orc.py
Outdated
DataFrame | ||
""" | ||
|
||
impl = get_engine(engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity what other readers do you see implementing here? I assume a lot of this extra code is for config management to return the PyArrowImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing, can prob simply this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you still want to do this but commenting for context of next few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would personally simplify this a lot by removing the class.
If at some point we actually want to have different engines, we can always use the approach of the parquet code then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor things not blockers for me. Looks good overall
pandas/io/orc.py
Outdated
DataFrame | ||
""" | ||
|
||
impl = get_engine(engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you still want to do this but commenting for context of next few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments.
For test testing files, did you write them yourself, or do they come from somewhere? I was wondering if you could make the > 50KB ones a bit smaller.
doc/source/user_guide/io.rst
Outdated
for data frames. It is designed to make reading data frames efficient. Pandas provides *only* a reader for the | ||
ORC format, :func:`~pandas.read_orc`. | ||
|
||
See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually no documentation about ORC there, so that link is not very helpful ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention this reader needs pyarrow to be installed instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 2 lines above about the ORC format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs are a bit sparse on the pyarrow side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was just looking through this PR to see who to ask about pyarrow.orc docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc @jorisvandenbossche did a PR in arrow side for this
pandas/io/orc.py
Outdated
DataFrame | ||
""" | ||
|
||
impl = get_engine(engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would personally simplify this a lot by removing the class.
If at some point we actually want to have different engines, we can always use the approach of the parquet code then.
IIRC @kkraus14 mentioned these were externally generated; these are pretty small so I don't think this is a big deal (and they are skipped when we generate the stripped builds anyhow (e.g. remove tests/data) |
updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC @kkraus14 mentioned these were externally generated; these are pretty small so I don't think this is a big deal (and they are skipped when we generate the stripped builds anyhow (e.g. remove tests/data)
OK!
This reverts commit 19e4d47.
this is good to go @jorisvandenbossche @WillAyd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
(just 2 minor doc comments that can be applied online)
pandas/io/orc.py
Outdated
By file-like object, we refer to objects with a ``read()`` method, | ||
such as a file handler (e.g. via builtin ``open`` function) | ||
or ``StringIO``. | ||
columns : list, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns : list, default=None | |
columns : list, default None |
pandas/io/orc.py
Outdated
DataFrame | ||
""" | ||
|
||
# we require a newer version of pyarrow thaN we support for parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# we require a newer version of pyarrow thaN we support for parquet | |
# we require a newer version of pyarrow than we support for parquet |
@jorisvandenbossche wow I can't commit these on-line (likely because this is owned by @kkraus14 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. very nice implementation
actually this was on my office computer; we block push so likely that’s the issue; thanks for merge |
Apologies that I didn't have the bandwidth to drive this to completion. Thanks @jreback and @jorisvandenbossche for driving this to completion. |
thanks for putting this up @kkraus14 ! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Added an ORC reader following the
read_parquet
API. Still need to give some additional love to the docstrings but this is at least ready for some discussion and eyes on it.