Skip to content

ENH: Implement DataFrame.from_pyarrow and DataFrame.to_pyarrow #51769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ Other enhancements
- Added new argument ``engine`` to :func:`read_json` to support parsing JSON with pyarrow by specifying ``engine="pyarrow"`` (:issue:`48893`)
- Added support for SQLAlchemy 2.0 (:issue:`40686`)
- :class:`Index` set operations :meth:`Index.union`, :meth:`Index.intersection`, :meth:`Index.difference`, and :meth:`Index.symmetric_difference` now support ``sort=True``, which will always return a sorted result, unlike the default ``sort=None`` which does not sort in some cases (:issue:`25151`)
- :class:`DataFrame` constructor supports PyArrow tables as input (:issue:`51760`)

.. ---------------------------------------------------------------------------
.. _whatsnew_200.notable_bug_fixes:
Expand Down
12 changes: 12 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@
PeriodArray,
TimedeltaArray,
)
from pandas.core.arrays.arrow import ArrowDtype, ArrowExtensionArray
from pandas.core.arrays.sparse import SparseFrameAccessor
from pandas.core.construction import (
ensure_wrapped_if_datetimelike,
Expand Down Expand Up @@ -665,6 +666,17 @@ def __init__(
NDFrame.__init__(self, data)
return

try:
Copy link
Member

@phofl phofl Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer pointing user to to_pandas for now. This solution does not respect pandas_metadata, which seems unfortunate. Also this adds quite significant overhead when arrow is not installed (arr is a numpy array):

%timeit pd.DataFrame(arr)
86.6 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit pd.DataFrame(arr)
7.49 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

I'd prefer something like pd.DataFrame.from_arrow or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting finding. Not sure if necessarily a problem, since the overhead is constant, it doesn't grow with the size or type of data being loaded, seems to be a constant 80 microseconds (in my machine it take the same when pyarrow is not installed, but double than you when it is). I understand the Python interpreter looks in more places when the module is not immediately found, but that's like 6 or 12 times more than when it's found.

In absolute values 80 microseconds constant is not something that seems that bad, but since it's 10 times more than a normal use case, seems like a fair point to avoid. At least until PyArrow is a required dependency (or never, since we probably should deprecate having a constructor that loads everything in favor of a factory approach).

For the metadata, can you clarify which is the metadata not being respected?

Copy link
Member

@phofl phofl Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following should preserve the index:

df = DataFrame({"a": [1, 2, 3]}, index=[1, 2, 3])
table = pyarrow.table(df)
DataFrame(table)

this returns

   a  __index_level_0__
0  1                  1
1  2                  2
2  3                  3

to_pandas does this automatically, we have the same problem in read_parquet and friends, that's why I put up #51766

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, very good point. I didn't think about the index, I see there is some more logic in the transformation: https://github.com/apache/arrow/blob/68e89ac9ee5bf2c0c68493eb9405554162fdfab8/python/pyarrow/pandas_compat.py#L797

I guess it doesn't makes sense to replicate all that logic in our code, since PyArrow already has it. I guess DataFrame.from_arrow as a wrapper to Table.to_pandas (using the same parameters) is what make more sense, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed. Users can use set_index and rename if they want to add an index or rename columns anyway, so no real need for the constructor functionality

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, I'm not sure if it's better to use from_arrow or from_pyarrow. My bet is that it's likely that another Arrow implementation in Python but based in Arrow can be developed eventually. Polars must have reimplemented PyArrow in a way, and whether it's because that code is moved of Polars, or another project is created, I guess it'll end up happening.

Maybe I'm overthinking, but at that point, is it more natural to have something like DataFrame.from_pyarrow and DataFrame.from_arrow2, or just one DataFrame.from_arrow that supports both? Small preference for the former, but happy to hear other opinions.

import pyarrow as pa
except ImportError:
pass
else:
if isinstance(data, pa.Table):
data = {
name: ArrowExtensionArray(array)
for name, array in zip(data.column_names, data.columns)
}

manager = get_option("mode.data_manager")

# GH47215
Expand Down
26 changes: 26 additions & 0 deletions pandas/tests/frame/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import numpy as np
from numpy import ma
from numpy.ma import mrecords
import pyarrow as pa
import pytest
import pytz

Expand Down Expand Up @@ -60,6 +61,7 @@
SparseArray,
TimedeltaArray,
)
from pandas.core.arrays.arrow import ArrowDtype

MIXED_FLOAT_DTYPES = ["float16", "float32", "float64"]
MIXED_INT_DTYPES = [
Expand Down Expand Up @@ -2340,6 +2342,30 @@ def test_constructor_with_extension_array(self, extension_arr):
result = DataFrame(extension_arr)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
"data,dtype",
[
([1, 2, 3], pa.uint8()),
([1.0, 2.0, 3.0], pa.float64()),
(["foo", "bar", "foobar"], pa.string()),
],
)
def test_constructor_with_pyarrow_table(self, data, dtype):
array = pa.array(data, type=dtype)
table = pa.table([array], names=["col"])
result = DataFrame(table)
assert isinstance(result["col"].dtype, ArrowDtype)
result_array = result["col"]._data.array._data.chunks[0]
assert result_array == array

for result_buffer, expected_buffer in zip(
result_array.buffers(), array.buffers()
):
if result_buffer is None and expected_buffer is None:
continue
assert result_buffer.address == expected_buffer.address
assert result_buffer.size == expected_buffer.size

def test_datetime_date_tuple_columns_from_dict(self):
# GH 10863
v = date.today()
Expand Down