Skip to content

ENH: Delta Lake file format support #35017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jackwellsxyz opened this issue Jun 26, 2020 · 8 comments
Closed

ENH: Delta Lake file format support #35017

jackwellsxyz opened this issue Jun 26, 2020 · 8 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@jackwellsxyz
Copy link

Is your feature request related to a problem?

No

Describe the solution you'd like

I'd love it if Pandas could support Databricks' Delta Lake file format (https://github.com/delta-io/delta). It's a type of versioned parquet file format that supports updates/inserts/deletions.

API breaking implications

None that I'm aware of

Describe alternatives you've considered

N/A

Additional context

N/A

@jackwellsxyz jackwellsxyz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 26, 2020
@yehoshuadimarsky
Copy link
Contributor

yehoshuadimarsky commented Jun 26, 2020

I'm pretty sure Delta Lake only supports access via Spark at the moment, not through any other tools or APIs. I'll post the link in the docs where it mentions it when I can find it, but I do remember reading that.

EDIT: Well, their README does say this:

The only stable public APIs, currently provided by Delta Lake, are through the DataFrameReader/Writer (i.e. spark.read, df.write, spark.readStream and df.writeStream). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).

All other interfaces in this library are considered internal, and they are subject to change across minor/patch releases.

@jbrockmendel
Copy link
Member

I doubt we would support that directly, but if there is an implementation we could link to it in the ecosystem docs.

@ggGibs
Copy link

ggGibs commented Jul 2, 2020

FYI Koalas reads/writes to delta lake (link), and has easy koalas <--> pandas methods
But it looks like a pandas-like wrapper on top of pyspark - you will need a spark cluster somewhere to handle the I/O. No hope running it standalone

@TomAugspurger
Copy link
Contributor

Happy to take an addition to the ecosystem docs when a reader / writer is available. That won't be implemented in pandas, but perhaps in / using pyarrow.

@houqp
Copy link
Contributor

houqp commented Mar 25, 2021

Happy to share that we now have native support for Delta lakes outside of JVM and Spark. It's a full deltalake implementation in pure Rust. We also provide a thin python wrapper for pandas integration, see https://github.com/delta-io/delta-rs/tree/main/python#usage.

@jbrockmendel perhaps we can link this to the ecosystem docs?

@jbrockmendel
Copy link
Member

fine by me

@MrPowers
Copy link

There is a delta-rs project that makes it easy to read Delta Lakes into pandas DataFrames as mentioned by @houqp. See this snippet:

from deltalake import DeltaTable

dt = DeltaTable("resources/delta/1")
df = dt.to_pandas()

Python write access doesn't exist yet, but hopefully it'll be added soon cause it'd be an awesome addition for the Pandas community!

@houqp
Copy link
Contributor

houqp commented Oct 13, 2021

Add to @MrPowers 's comment, delta-rs now has write support in the rust core, just waiting for someone to send us a PR to expose that write api to the python shim ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

7 participants