Skip to content

DISCUSS: What would an ORC reader/writer API look like? #25229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kkraus14 opened this issue Feb 8, 2019 · 13 comments · Fixed by #29447
Closed

DISCUSS: What would an ORC reader/writer API look like? #25229

kkraus14 opened this issue Feb 8, 2019 · 13 comments · Fixed by #29447
Labels
API Design IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@kkraus14
Copy link
Contributor

kkraus14 commented Feb 8, 2019

cc @mrocklin for dask.dataframe visibility

I'm one of the developers of https://github.com/rapidsai/cudf and we're working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn't. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.

At the top level, I imagine it would look almost identical to Parquet in something like the following:

def read_orc(path, engine='auto', columns=None, **kwargs):
    """
    Load an orc object from the file path, returning a DataFrame.

    Parameters
    ----------
    path : string
        File path
    columns : list, default=None
        If not None, only these columns will be read from the file.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    kwargs are passed to the engine

    Returns
    -------
    DataFrame
    """
    ...


def to_orc(self, fname, engine='auto', compression='snappy', index=None,
           partition_cols=None, **kwargs):
    """
    Write a DataFrame to the binary orc format.

    This function writes the dataframe as a `orc file
    <https://orc.apache.org/>`_. You can choose different orc
    backends, and have the option of compression. See
    :ref:`the user guide <io.orc>` for more details.

    Parameters
    ----------
    fname : str
        File path or Root Directory path. Will be used as Root Directory
        path while writing a partitioned dataset.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
        Name of the compression to use. Use ``None`` for no compression.
    index : bool, default None
        If ``True``, include the dataframe's index(es) in the file output.
        If ``False``, they will not be written to the file. If ``None``,
        the behavior depends on the chosen engine.
    partition_cols : list, optional, default None
        Column names by which to partition the dataset
        Columns are partitioned in the order they are given
    **kwargs
        Additional arguments passed to the orc library. See
        :ref:`pandas io <io.orc>` for more details.
    """
    ...
@xhochy
Copy link
Contributor

xhochy commented Feb 8, 2019

+1 for making it look like the Parquet API. Both formats are very similar and could be considered as "competitors". They should also roughly match on the pyarrow in future (ORC is currently missing Dataset support in the style of pyarrow.parquet.ParquetDataset and we're missing a writer API which makes testing hard).

We can skip the engine argument here though as there is only one implementation at the moment.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2019

is this 'close' enough to parquet in people's minds, that we could just add a flavor='parquet|orc' to the parquet readers/writer (in pandas) then appropriately dispatch?

@mrocklin
Copy link
Contributor

mrocklin commented Feb 8, 2019

From a user perspective I think that it might be better to have explicit read_parquet and read_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow's ORC reader becomes more consistent with its parquet reader.

+1 to everything that @xhochy said

@kkraus14
Copy link
Contributor Author

kkraus14 commented Feb 8, 2019

From a user perspective I think that it might be better to have explicit read_parquet and read_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow's ORC reader becomes more consistent with its parquet reader.

+1 to everything that @xhochy said

+1 to having separate functions for read_parquet and read_orc and everything that @xhochy suggested.

@gfyoung gfyoung added IO Data IO issues that don't fit into a more specific label API Design labels Feb 9, 2019
@kkraus14 kkraus14 mentioned this issue Nov 6, 2019
5 tasks
@jreback jreback added this to the 1.0 milestone Nov 17, 2019
@jorisvandenbossche
Copy link
Member

To play the devil's advocate for a moment: do we think this is actually worth including in pandas as a top-level function?

I haven't seen much usage of ORC personally (but my view is also limited; and it's of course also a chicken and egg problem, having it in pandas would give it more exposure).

@mrocklin
Copy link
Contributor

In my experience ORC is less commonly used than Parquet, but is still fairly common, at least among enterprise hadoop shops. I think that everyone who bought a hadoop/spark cluster from Cloudera ended up using Parquet while everyone who bought a hadoop/spark cluster from HortonWorks ended up using ORC (that's a generalization though). I commonly find ORC in companies who historically used Hortonworks, but are now increasing their use of Python.

I think that ORC is less popular than Parquet, and so not a strong priority, but still common enough to be well in scope for a project like Pandas.

@jorisvandenbossche
Copy link
Member

@mrocklin thanks for that context! Sounds good to me then

@voycey
Copy link

voycey commented Dec 10, 2019

@mrocklin ORC has different use cases than Parquet, especially with its powerful predicate push down, block level indexes and bloom filters.
Many people are using it with Presto due to the huge amount of work they invested in streamlining ORC.
Also in our tests ORC massively outperformed parquet for our use case (20%+ speed increases).

We are absolutely committed to ORC as a format simply due to the amount of data we manage on a tiny budget and ORC having the features required to allow us to do this within that budget.

With Support from spark, cudf and BigQuery recently added I think this should be bumped up the roadmap!

@KaiRoesner
Copy link

...but what about the writer API (to_orc()) ?

@jreback
Copy link
Contributor

jreback commented Mar 4, 2020

there is not support for writing orc in pyarrow

@benjamincerigo
Copy link

Looks like they have a ticket for it: https://issues.apache.org/jira/browse/ARROW-3014

@impredicative
Copy link

impredicative commented Jul 16, 2021

there is not support for writing orc in pyarrow

@jreback I am now able to write orc using pyarrow==4.0.1 + pandas. It would be great for pandas to implement pd.to_orc to make this more convenient. I currently write orc as:

import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc
import smart_open

# Create dataframe
df = pd.read_csv('https://j.mp/iriscsv')
table = pa.Table.from_pandas(df, preserve_index=False)  # Ref: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas

# Write locally using pyarrow
orc.write_table(table, '/tmp/iris.orc')

# Write locally using smart_open+pyarrow
with smart_open.open('/tmp/iris2.orc', "wb") as output_file:
    orc.write_table(table, output_file)

# Write to cloud using smart_open+pyarrow
with smart_open.open('s3://my-bucket/iris.orc', "wb") as output_file:
    orc.write_table(table, output_file)

As noted by @voycey, orc is highly relevant with Presto, also potentially with AWS Athena which uses Presto.

@jreback
Copy link
Contributor

jreback commented Jul 16, 2021

the community can certainly put up a pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants