-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DISCUSS: What would an ORC reader/writer API look like? #25229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 for making it look like the Parquet API. Both formats are very similar and could be considered as "competitors". They should also roughly match on the We can skip the |
is this 'close' enough to parquet in people's minds, that we could just add a |
From a user perspective I think that it might be better to have explicit +1 to everything that @xhochy said |
+1 to having separate functions for |
To play the devil's advocate for a moment: do we think this is actually worth including in pandas as a top-level function? I haven't seen much usage of ORC personally (but my view is also limited; and it's of course also a chicken and egg problem, having it in pandas would give it more exposure). |
In my experience ORC is less commonly used than Parquet, but is still fairly common, at least among enterprise hadoop shops. I think that everyone who bought a hadoop/spark cluster from Cloudera ended up using Parquet while everyone who bought a hadoop/spark cluster from HortonWorks ended up using ORC (that's a generalization though). I commonly find ORC in companies who historically used Hortonworks, but are now increasing their use of Python. I think that ORC is less popular than Parquet, and so not a strong priority, but still common enough to be well in scope for a project like Pandas. |
@mrocklin thanks for that context! Sounds good to me then |
@mrocklin ORC has different use cases than Parquet, especially with its powerful predicate push down, block level indexes and bloom filters. We are absolutely committed to ORC as a format simply due to the amount of data we manage on a tiny budget and ORC having the features required to allow us to do this within that budget. With Support from spark, cudf and BigQuery recently added I think this should be bumped up the roadmap! |
...but what about the writer API ( |
there is not support for writing orc in pyarrow |
Looks like they have a ticket for it: https://issues.apache.org/jira/browse/ARROW-3014 |
@jreback I am now able to write orc using pyarrow==4.0.1 + pandas. It would be great for pandas to implement import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc
import smart_open
# Create dataframe
df = pd.read_csv('https://j.mp/iriscsv')
table = pa.Table.from_pandas(df, preserve_index=False) # Ref: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas
# Write locally using pyarrow
orc.write_table(table, '/tmp/iris.orc')
# Write locally using smart_open+pyarrow
with smart_open.open('/tmp/iris2.orc', "wb") as output_file:
orc.write_table(table, output_file)
# Write to cloud using smart_open+pyarrow
with smart_open.open('s3://my-bucket/iris.orc', "wb") as output_file:
orc.write_table(table, output_file) As noted by @voycey, orc is highly relevant with Presto, also potentially with AWS Athena which uses Presto. |
the community can certainly put up a pull request |
cc @mrocklin for dask.dataframe visibility
I'm one of the developers of https://github.com/rapidsai/cudf and we're working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn't. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.
At the top level, I imagine it would look almost identical to Parquet in something like the following:
The text was updated successfully, but these errors were encountered: