DISCUSS: What would an ORC reader/writer API look like? #25229

kkraus14 · 2019-02-08T15:28:21Z

cc @mrocklin for dask.dataframe visibility

I'm one of the developers of https://github.com/rapidsai/cudf and we're working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn't. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.

At the top level, I imagine it would look almost identical to Parquet in something like the following:

def read_orc(path, engine='auto', columns=None, **kwargs):
    """
    Load an orc object from the file path, returning a DataFrame.

    Parameters
    ----------
    path : string
        File path
    columns : list, default=None
        If not None, only these columns will be read from the file.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    kwargs are passed to the engine

    Returns
    -------
    DataFrame
    """
    ...


def to_orc(self, fname, engine='auto', compression='snappy', index=None,
           partition_cols=None, **kwargs):
    """
    Write a DataFrame to the binary orc format.

    This function writes the dataframe as a `orc file
    <https://orc.apache.org/>`_. You can choose different orc
    backends, and have the option of compression. See
    :ref:`the user guide <io.orc>` for more details.

    Parameters
    ----------
    fname : str
        File path or Root Directory path. Will be used as Root Directory
        path while writing a partitioned dataset.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
        Name of the compression to use. Use ``None`` for no compression.
    index : bool, default None
        If ``True``, include the dataframe's index(es) in the file output.
        If ``False``, they will not be written to the file. If ``None``,
        the behavior depends on the chosen engine.
    partition_cols : list, optional, default None
        Column names by which to partition the dataset
        Columns are partitioned in the order they are given
    **kwargs
        Additional arguments passed to the orc library. See
        :ref:`pandas io <io.orc>` for more details.
    """
    ...

The text was updated successfully, but these errors were encountered:

xhochy · 2019-02-08T16:11:55Z

+1 for making it look like the Parquet API. Both formats are very similar and could be considered as "competitors". They should also roughly match on the pyarrow in future (ORC is currently missing Dataset support in the style of pyarrow.parquet.ParquetDataset and we're missing a writer API which makes testing hard).

We can skip the engine argument here though as there is only one implementation at the moment.

jreback · 2019-02-08T17:00:11Z

is this 'close' enough to parquet in people's minds, that we could just add a flavor='parquet|orc' to the parquet readers/writer (in pandas) then appropriately dispatch?

mrocklin · 2019-02-08T17:13:30Z

From a user perspective I think that it might be better to have explicit read_parquet and read_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow's ORC reader becomes more consistent with its parquet reader.

+1 to everything that @xhochy said

kkraus14 · 2019-02-08T17:24:28Z

From a user perspective I think that it might be better to have explicit read_parquet and read_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow's ORC reader becomes more consistent with its parquet reader.

+1 to everything that @xhochy said

+1 to having separate functions for read_parquet and read_orc and everything that @xhochy suggested.

jorisvandenbossche · 2019-11-19T22:06:28Z

To play the devil's advocate for a moment: do we think this is actually worth including in pandas as a top-level function?

I haven't seen much usage of ORC personally (but my view is also limited; and it's of course also a chicken and egg problem, having it in pandas would give it more exposure).

mrocklin · 2019-11-20T01:16:05Z

In my experience ORC is less commonly used than Parquet, but is still fairly common, at least among enterprise hadoop shops. I think that everyone who bought a hadoop/spark cluster from Cloudera ended up using Parquet while everyone who bought a hadoop/spark cluster from HortonWorks ended up using ORC (that's a generalization though). I commonly find ORC in companies who historically used Hortonworks, but are now increasing their use of Python.

I think that ORC is less popular than Parquet, and so not a strong priority, but still common enough to be well in scope for a project like Pandas.

jorisvandenbossche · 2019-11-20T09:08:10Z

@mrocklin thanks for that context! Sounds good to me then

voycey · 2019-12-10T04:29:14Z

@mrocklin ORC has different use cases than Parquet, especially with its powerful predicate push down, block level indexes and bloom filters.
Many people are using it with Presto due to the huge amount of work they invested in streamlining ORC.
Also in our tests ORC massively outperformed parquet for our use case (20%+ speed increases).

We are absolutely committed to ORC as a format simply due to the amount of data we manage on a tiny budget and ORC having the features required to allow us to do this within that budget.

With Support from spark, cudf and BigQuery recently added I think this should be bumped up the roadmap!

KaiRoesner · 2020-03-04T15:03:15Z

...but what about the writer API (to_orc()) ?

jreback · 2020-03-04T15:08:08Z

there is not support for writing orc in pyarrow

benjamincerigo · 2020-09-18T13:28:47Z

Looks like they have a ticket for it: https://issues.apache.org/jira/browse/ARROW-3014

impredicative · 2021-07-16T22:09:33Z

there is not support for writing orc in pyarrow

@jreback I am now able to write orc using pyarrow==4.0.1 + pandas. It would be great for pandas to implement pd.to_orc to make this more convenient. I currently write orc as:

import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc
import smart_open

# Create dataframe
df = pd.read_csv('https://j.mp/iriscsv')
table = pa.Table.from_pandas(df, preserve_index=False)  # Ref: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas

# Write locally using pyarrow
orc.write_table(table, '/tmp/iris.orc')

# Write locally using smart_open+pyarrow
with smart_open.open('/tmp/iris2.orc', "wb") as output_file:
    orc.write_table(table, output_file)

# Write to cloud using smart_open+pyarrow
with smart_open.open('s3://my-bucket/iris.orc', "wb") as output_file:
    orc.write_table(table, output_file)

As noted by @voycey, orc is highly relevant with Presto, also potentially with AWS Athena which uses Presto.

jreback · 2021-07-16T22:24:30Z

the community can certainly put up a pull request

kkraus14 mentioned this issue Feb 8, 2019

ORC Reader and Writer API dask/dask#4468

Closed

gfyoung added IO Data IO issues that don't fit into a more specific label API Design labels Feb 9, 2019

kkraus14 mentioned this issue Nov 6, 2019

ENH: Add ORC reader #29447

Merged

5 tasks

jreback added this to the 1.0 milestone Nov 17, 2019

jorisvandenbossche closed this as completed in #29447 Dec 11, 2019

HyukjinKwon mentioned this issue Jan 28, 2021

Implemented ks.read_orc databricks/koalas#2017

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISCUSS: What would an ORC reader/writer API look like? #25229

DISCUSS: What would an ORC reader/writer API look like? #25229

kkraus14 commented Feb 8, 2019 •

edited

Loading

xhochy commented Feb 8, 2019

jreback commented Feb 8, 2019

mrocklin commented Feb 8, 2019

kkraus14 commented Feb 8, 2019

jorisvandenbossche commented Nov 19, 2019

mrocklin commented Nov 20, 2019

jorisvandenbossche commented Nov 20, 2019

voycey commented Dec 10, 2019

KaiRoesner commented Mar 4, 2020

jreback commented Mar 4, 2020

benjamincerigo commented Sep 18, 2020

impredicative commented Jul 16, 2021 •

edited

Loading

jreback commented Jul 16, 2021

DISCUSS: What would an ORC reader/writer API look like? #25229

DISCUSS: What would an ORC reader/writer API look like? #25229

Comments

kkraus14 commented Feb 8, 2019 • edited Loading

xhochy commented Feb 8, 2019

jreback commented Feb 8, 2019

mrocklin commented Feb 8, 2019

kkraus14 commented Feb 8, 2019

jorisvandenbossche commented Nov 19, 2019

mrocklin commented Nov 20, 2019

jorisvandenbossche commented Nov 20, 2019

voycey commented Dec 10, 2019

KaiRoesner commented Mar 4, 2020

jreback commented Mar 4, 2020

benjamincerigo commented Sep 18, 2020

impredicative commented Jul 16, 2021 • edited Loading

jreback commented Jul 16, 2021

kkraus14 commented Feb 8, 2019 •

edited

Loading

impredicative commented Jul 16, 2021 •

edited

Loading