Skip to content

ENH: Add the support of the Delta Lake format in Pandas as an optional extra dependency #49692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
74f6bac
Add the support of the Delta Lake format in Pandas
fvaleye Nov 14, 2022
bcf1879
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 15, 2022
0e4307e
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 15, 2022
511a2e0
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 15, 2022
d2d479c
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 15, 2022
1e1bddc
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 16, 2022
d211d1c
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 16, 2022
172ba73
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 16, 2022
fbc1223
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 17, 2022
8fc22ca
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 17, 2022
d6b0a5d
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 18, 2022
6e81a6b
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 19, 2022
614ea97
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 19, 2022
b5e3f58
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 20, 2022
389e465
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 21, 2022
3512a7b
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 22, 2022
271df4a
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 22, 2022
8165b6f
Merge branch 'main' into enhancement/deltalake-format-io-support
fvaleye Nov 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
binary;`Python Pickle Format <https://docs.python.org/3/library/pickle.html>`__;:ref:`read_pickle<io.pickle>`;:ref:`to_pickle<io.pickle>`
SQL;`SQL <https://en.wikipedia.org/wiki/SQL>`__;:ref:`read_sql<io.sql>`;:ref:`to_sql<io.sql>`
SQL;`Delta Lake <https://en.wikipedia.org/wiki/Delta_Lake_(Software)>`__;:ref:`read_deltalake<io.deltalake>`;
SQL;`Google BigQuery <https://en.wikipedia.org/wiki/BigQuery>`__;:ref:`read_gbq<io.bigquery>`;:ref:`to_gbq<io.bigquery>`

:ref:`Here <io.perf>` is an informal performance comparison for some of these IO methods.
Expand Down Expand Up @@ -5901,6 +5902,17 @@ And then issue the following queries:
data.to_sql("data", con)
pd.read_sql_query("SELECT * FROM data", con)

.. _io.deltalake:

Delta Lake
----------
The ``deltalake`` package provides functionality to read/write from Delta Lake.

pandas integrates with this external package. if ``deltalake`` is installed, you can
use the pandas methods ``pd.read_deltalake``, which will call the
respective functions from ``deltalake``.

Full documentation can be found `here <https://delta-io.github.io/delta-rs/python/>`__.

.. _io.bigquery:

Expand Down
2 changes: 2 additions & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@
read_parquet,
read_orc,
read_feather,
read_deltalake,
read_gbq,
read_html,
read_xml,
Expand Down Expand Up @@ -313,6 +314,7 @@
"read_excel",
"read_feather",
"read_fwf",
"read_deltalake",
"read_gbq",
"read_hdf",
"read_html",
Expand Down
2 changes: 2 additions & 0 deletions pandas/io/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"""

from pandas.io.clipboards import read_clipboard
from pandas.io.deltalake import read_deltalake
from pandas.io.excel import (
ExcelFile,
ExcelWriter,
Expand Down Expand Up @@ -46,6 +47,7 @@
"read_excel",
"read_feather",
"read_fwf",
"read_deltalake",
"read_gbq",
"read_hdf",
"read_html",
Expand Down
89 changes: 89 additions & 0 deletions pandas/io/deltalake.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
""" Delta Lake support """
from __future__ import annotations

from typing import (
TYPE_CHECKING,
Any,
)

from pandas.compat._optional import import_optional_dependency

if TYPE_CHECKING:
import pyarrow.fs as pa_fs

from pandas import DataFrame


def _try_import():
# since pandas is a dependency of deltalake
# we need to import on first use
msg = (
"deltalake is required to load data from Delta Lake. "
"See the docs: https://delta-io.github.io/delta-rs/python."
)
deltalake = import_optional_dependency("deltalake", extra=msg)
return deltalake


def read_deltalake(
table_uri: str,
version: int | None = None,
storage_options: dict[str, str] | None = None,
without_files: bool = False,
partitions: list[tuple[str, str, Any]] | None = None,
columns: list[str] | None = None,
filesystem: str | pa_fs.FileSystem | None = None,
) -> DataFrame:
"""
Load data from Deltalake.

This function requires the `deltalake package
<https://delta-io.github.io/delta-rs/python>`__.

See the `How to load a Delta table
<https://delta-io.github.io/delta-rs/python/usage.html#loading-a-delta-table>`__
guide for loading instructions.

Parameters
----------
table_uri: str
The path of the DeltaTable.
version: int, optional
The version of the DeltaTable.
storage_options: Dict[str, str], optional
A dictionary of the options to use for the storage backend.
without_files: bool, default False
If True, will load table without tracking files.
Some append-only applications might have no need of tracking any files.
So, the DeltaTable will be loaded with a significant memory reduction.
partitions: List[Tuple[str, str, Any], optional
A list of partition filters, see help(DeltaTable.files_by_partitions)
for filter syntax.
columns: List[str], optional
The columns to project. This can be a list of column names to include
(order and duplicates will be preserved).
filesystem: Union[str, pa_fs.FileSystem], optional
A concrete implementation of the Pyarrow FileSystem or
a fsspec-compatible interface. If None, the first file path will be used
to determine the right FileSystem.

Returns
-------
df: DataFrame
DataFrame including the results.

See Also
--------
deltalake.DeltaTable : Create a DeltaTable instance with the deltalake library.
"""
deltalake = _try_import()

table = deltalake.DeltaTable(
table_uri=table_uri,
version=version,
storage_options=storage_options,
without_files=without_files,
)
return table.to_pandas(
partitions=partitions, columns=columns, filesystem=filesystem
)
1 change: 1 addition & 0 deletions pandas/tests/api/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ class TestPDApi(Base):
"read_csv",
"read_excel",
"read_fwf",
"read_deltalake",
"read_gbq",
"read_hdf",
"read_html",
Expand Down