Skip to content

ENH: Add StringArray.__arrow_array__ for conversion to Arrow #29182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,11 +102,12 @@ Other enhancements
- :meth:`DataFrame.to_string` added the ``max_colwidth`` parameter to control when wide columns are truncated (:issue:`9784`)
- :meth:`MultiIndex.from_product` infers level names from inputs if not explicitly provided (:issue:`27292`)
- :meth:`DataFrame.to_latex` now accepts ``caption`` and ``label`` arguments (:issue:`25436`)
- The :ref:`integer dtype <integer_na>` with support for missing values can now be converted to
``pyarrow`` (>= 0.15.0), which means that it is supported in writing to the Parquet file format
when using the ``pyarrow`` engine. It is currently not yet supported when converting back to
pandas (so it will become an integer or float dtype depending on the presence of missing data).
(:issue:`28368`)
- The :ref:`integer dtype <integer_na>` with support for missing values and the
new :ref:`string dtype <text.types>` can now be converted to ``pyarrow`` (>=
0.15.0), which means that it is supported in writing to the Parquet file
format when using the ``pyarrow`` engine. It is currently not yet supported
when converting back to pandas, so it will become an integer or float
(depending on the presence of missing data) or object dtype column. (:issue:`28368`)
- :meth:`DataFrame.to_json` now accepts an ``indent`` integer argument to enable pretty printing of JSON output (:issue:`12004`)
- :meth:`read_stata` can read Stata 119 dta files. (:issue:`28250`)
- Added ``encoding`` argument to :func:`DataFrame.to_html` for non-ascii text (:issue:`28663`)
Expand Down
10 changes: 10 additions & 0 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,16 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
return cls._from_sequence(strings, dtype=dtype, copy=copy)

def __arrow_array__(self, type=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we make a new base EA mixin class for this? e.g. this is pretty generic code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and have String/Integer array use this? (maybe actually just put in the EA base class itself?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this one is different than the one in IntegerArray, so it cannot directly be shared with that. I thought about putting this in the PandasArray class, but the StringArray is the only subclass of it that is actually used. So therefore I kept it here. But open to other options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe actually just put in the EA base class itself?

That's not possible, because it depends on the internals (eg StringArray uses _ndarray for the data, IntegerArray uses _data and _mask). The only thing that could be put in the base EA class is the signature with a NotImplementedError, but I need to check how that interacts with pyarrow (as the code there does a hasattr check, so if the method is present, it expects it to work)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool. yeah this can certainly be handled in a followup; but i think we DO want to generally enable these types of conversions.

"""
Convert myself into a pyarrow Array.
"""
import pyarrow as pa

if type is None:
type = pa.string()
return pa.array(self._ndarray, type=type, from_pandas=True)

def __setitem__(self, key, value):
value = extract_array(value, extract_numpy=True)
if isinstance(value, type(self)):
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
import numpy as np
import pytest

import pandas.util._test_decorators as td

import pandas as pd
import pandas.util.testing as tm

Expand Down Expand Up @@ -158,3 +160,14 @@ def test_reduce_missing(skipna):
assert result == "abc"
else:
assert pd.isna(result)


@td.skip_if_no("pyarrow", min_version="0.15.0")
def test_arrow_array():
# protocol added in 0.15.0
import pyarrow as pa

data = pd.array(["a", "b", "c"], dtype="string")
arr = pa.array(data)
expected = pa.array(list(data), type=pa.string(), from_pandas=True)
assert arr.equals(expected)
2 changes: 1 addition & 1 deletion pandas/tests/arrays/test_integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -819,7 +819,7 @@ def test_ufunc_reduce_raises(values):
np.add.reduce(a)


@td.skip_if_no("pyarrow", min_version="0.14.1.dev")
@td.skip_if_no("pyarrow", min_version="0.15.0")
def test_arrow_array(data):
# protocol added in 0.15.0
import pyarrow as pa
Expand Down
19 changes: 13 additions & 6 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -504,15 +504,22 @@ def test_empty_dataframe(self, pa):
df = pd.DataFrame()
check_round_trip(df, pa)

@td.skip_if_no("pyarrow", min_version="0.14.1.dev")
def test_nullable_integer(self, pa):
df = pd.DataFrame({"a": pd.Series([1, 2, 3], dtype="Int64")})
# currently de-serialized as plain int
expected = df.assign(a=df.a.astype("int64"))
@td.skip_if_no("pyarrow", min_version="0.15.0")
def test_additional_extension_arrays(self, pa):
# test additional ExtensionArrays that are supported through the
# __arrow_array__ protocol
df = pd.DataFrame(
{
"a": pd.Series([1, 2, 3], dtype="Int64"),
"b": pd.Series(["a", None, "c"], dtype="string"),
}
)
# currently de-serialized as plain int / object
expected = df.assign(a=df.a.astype("int64"), b=df.b.astype("object"))
check_round_trip(df, pa, expected=expected)

df = pd.DataFrame({"a": pd.Series([1, 2, 3, None], dtype="Int64")})
# if missing values currently de-serialized as float
# if missing values in integer, currently de-serialized as float
expected = df.assign(a=df.a.astype("float64"))
check_round_trip(df, pa, expected=expected)

Expand Down