Skip to content

[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
4cb60e6
Implement BaseDtypeTests for ArrowStringDtype
xhochy Jul 10, 2020
d242f2d
Refactor to use parametrized StringDtype
TomAugspurger Sep 3, 2020
d39ab2c
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Feb 18, 2021
2367810
abs-imports
simonjayhawkins Feb 18, 2021
9166d3b
post merge fixup
simonjayhawkins Feb 19, 2021
8760705
StringDtype[python] -> string[python]
simonjayhawkins Feb 19, 2021
d5b3fec
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 22, 2021
2c657df
pre-commit fix for inconsistent use of pandas namespace
simonjayhawkins Mar 22, 2021
647a6c2
fix typo
simonjayhawkins Mar 22, 2021
0596fd7
pre-commit fixup - undefined name 'ArrowStringDtype'
simonjayhawkins Mar 22, 2021
c5a19c5
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 26, 2021
99680c9
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 28, 2021
69a6cc1
"StringDtype[storage]" -> "string[storage]" misc
simonjayhawkins Mar 28, 2021
bd147ba
__from_arrow__
simonjayhawkins Mar 28, 2021
830275f
more testing (wip)
simonjayhawkins Mar 28, 2021
214e524
fix inference
simonjayhawkins Mar 28, 2021
c9ba03c
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 29, 2021
7425536
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 1, 2021
68ac391
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 1, 2021
5cfa97a
post-merge fixup
simonjayhawkins Apr 1, 2021
74dbf96
remove changes to test_string_dtype - broken off in #40725
simonjayhawkins Apr 1, 2021
3985943
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 15, 2021
3bda421
post merge fix-up
simonjayhawkins Apr 15, 2021
0c108a4
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 15, 2021
523e24c
post merge fix-up
simonjayhawkins Apr 15, 2021
279624c
revert some changes made for pre-commit checks.
simonjayhawkins Apr 15, 2021
80d231e
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 16, 2021
c5ced5a
post merge fix-up
simonjayhawkins Apr 16, 2021
459812c
undo unrelated changes
simonjayhawkins Apr 16, 2021
d707b6b
undo changes to imports
simonjayhawkins Apr 16, 2021
71ccf24
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 17, 2021
daaac06
StringDtype.construct_array_type - add ref to issue
simonjayhawkins Apr 17, 2021
46626d1
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 19, 2021
3677bfa
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 1, 2021
42d382f
post merge fixup
simonjayhawkins May 1, 2021
4fb1a0d
add draft release note
simonjayhawkins May 1, 2021
5d4eac1
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 12, 2021
15efb2e
post merge fix-up
simonjayhawkins May 12, 2021
b53cfe0
docstrings
simonjayhawkins May 12, 2021
b7db53f
benchmarks
simonjayhawkins May 12, 2021
3399f08
pyarrow min
simonjayhawkins May 12, 2021
e365f01
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 26, 2021
71d1e6c
post merge fixup
simonjayhawkins May 26, 2021
9e23c35
misc clean
simonjayhawkins May 26, 2021
c69a611
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 27, 2021
64b3206
update construct_from_string docstring
simonjayhawkins May 27, 2021
d83a4ff
update whatsnew for dtype="string"
simonjayhawkins May 27, 2021
ef38660
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 30, 2021
aef1162
update release note
simonjayhawkins May 30, 2021
6247a5b
paramertize test for df.convert_dtypes()
simonjayhawkins May 30, 2021
a6d066c
fixup pd.array and more testing of string_storage option
simonjayhawkins May 31, 2021
8adb08d
use string_storage fixture more
simonjayhawkins May 31, 2021
3ad0638
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 31, 2021
56714c9
post merge fixup
simonjayhawkins May 31, 2021
6a1cc2b
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 2, 2021
1761a84
remove accessor methods section from release note
simonjayhawkins Jun 2, 2021
3e26baa
consistent dtype naming in benchmark
simonjayhawkins Jun 2, 2021
6b470b1
Apply suggestions from code review
simonjayhawkins Jun 2, 2021
2ec6de0
name and str() change to "string"
simonjayhawkins Jun 2, 2021
a0b7a70
remove testing of sting dtype without storage specified.
simonjayhawkins Jun 2, 2021
d9dcd20
update StringDtype docstring
simonjayhawkins Jun 2, 2021
4a37470
add ArrowStringArray to pd.arrays namespace
simonjayhawkins Jun 2, 2021
1d59c7a
add common base class, BaseStringArray
simonjayhawkins Jun 2, 2021
e57c850
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 4, 2021
51f1b1d
fixup roundtrip tests
simonjayhawkins Jun 4, 2021
fc95c06
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 7, 2021
ef02a43
remove link
simonjayhawkins Jun 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,8 @@ def __ne__(self, other: Any) -> ArrayLike:
"""
Return for `self != other` (element-wise in-equality).
"""
if isinstance(other, (ABCSeries, ABCDataFrame, ABCIndex)):
return NotImplemented
return ~(self == other)

def to_numpy(
Expand Down Expand Up @@ -516,7 +518,6 @@ def astype(self, dtype, copy=True):
NumPy ndarray with 'dtype' for its dtype.
"""
from pandas.core.arrays.string_ import StringDtype
from pandas.core.arrays.string_arrow import ArrowStringDtype

dtype = pandas_dtype(dtype)
if is_dtype_equal(dtype, self.dtype):
Expand All @@ -526,9 +527,7 @@ def astype(self, dtype, copy=True):
return self.copy()

# FIXME: Really hard-code here?
if isinstance(
dtype, (ArrowStringDtype, StringDtype)
): # allow conversion to StringArrays
if isinstance(dtype, StringDtype): # allow conversion to StringArrays
return dtype.construct_array_type()._from_sequence(self, copy=False)

return np.array(self, dtype=dtype, copy=copy)
Expand Down Expand Up @@ -1038,9 +1037,9 @@ def take(
from the right (the default). This is similar to
:func:`numpy.take`.

* True: negative values in `indices` indicate
missing values. These values are set to `fill_value`. Any other
other negative values raise a ``ValueError``.
* True: ``-1`` in `indices` indicate missing values.
These values are set to `fill_value`. Any other other negative
value raise a ``ValueError``.

fill_value : any, optional
Fill value to use for NA-indices when `allow_fill` is True.
Expand Down
98 changes: 90 additions & 8 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,16 @@

from typing import (
TYPE_CHECKING,
Any,
Optional,
Type,
Union,
)

import numpy as np

from pandas._config import get_option

from pandas._libs import (
lib,
missing as libmissing,
Expand Down Expand Up @@ -49,6 +52,8 @@
if TYPE_CHECKING:
import pyarrow

from pandas.core.arrays.string_arrow import ArrowStringArray


@register_extension_dtype
class StringDtype(ExtensionDtype):
Expand Down Expand Up @@ -79,37 +84,114 @@ class StringDtype(ExtensionDtype):
StringDtype
"""

name = "string"

#: StringDtype.na_value uses pandas.NA
na_value = libmissing.NA
_metadata = ("storage",)

def __init__(self, storage=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the docstring above for this new keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d9dcd20

if storage is None:
storage = get_option("mode.string_storage")
Comment on lines +101 to +102
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to delay looking up this option and allow the StringDtype to be initialized with storage=None. That could eg allow doing s.astype("string") preserving the original dtype if the series already is of string dtype (regardless of the default).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit (wip) started to add a parameterised fixture for more testing prior to implementing this.

since the dtype change is user facing this change worthwhile, there is still more to do as many tests use dtype="string" but does potentially make this PR harder to review. I will carry on and do the others if doing this makes sense.

Have added string[pyarrow] to this fixture. This gives some additional failures for outstanding parts of ArrowStringArray. It maybe that to keep this PR more cleanly scoped, that we defer adding that for a follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't think this comment is resolved, so "unresolved" it in the UI)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure can you open an issue explaining the api you'd like. I have updated _from_seqenece to accept the string "string", but if you are writing code if you do a comparison as a conditional, you expect the storage type to be fixed. and not then do something different. If the only places where this could be usefull is astype, we could put the logic there.

if storage not in {"python", "pyarrow"}:
raise ValueError(
f"Storage must be 'python' or 'pyarrow'. Got {storage} instead."
)
self.storage = storage

@property
def name(self):
return f"StringDtype[{self.storage}]"

@property
def type(self) -> Type[str]:
return str

@classmethod
def construct_array_type(cls) -> Type[StringArray]:
def construct_from_string(cls, string):
"""
Construct a StringDtype from a string.

Parameters
----------
string : str
The type of the name. The storage type will be taking from `string`.
Valid options and their storage types are

========================== ==============
string result storage
========================== ==============
``'string'`` global default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global default -> python (at least currently)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated docstring

``'string[python]'`` python
``'StringDtype[python]'`` python
``'string[pyarrow]'`` pyarrow
``'StringDtype[pyarrow]'`` pyarrow
========================== =============

Returns
-------
StringDtype

Raise
-----
TypeError
If the string is not a valid option.

"""
if not isinstance(string, str):
raise TypeError(
f"'construct_from_string' expects a string, got {type(string)}"
)
if string == "string":
# TODO: use global default
return cls()
elif string in {"string[python]", "StringDtype[python]"}:
return cls(storage="python")
elif string in {"string[pyarrow]", "StringDtype[pyarrow]"}:
return cls(storage="pyarrow")
else:
raise TypeError(f"Cannot construct a '{cls.__name__}' from '{string}'")

def __eq__(self, other: Any) -> bool:
if isinstance(other, str) and other == "string":
return True
return super().__eq__(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is actually the behaviour here? Are "string[python]" and "string[arrow]" seen as not equal, I suppose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Not sure what the behavior should be if we delay looking up the storage. that could complicate things so will leave for a follow-on after better understanding of what we want, and when the lookup would be final.

>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string")
... 
True
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string")
... 
True
>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string[python]")
... 
True
>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string[pyarrow]")
... 
False
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string[python]")
... 
False
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string[pyarrow]")
... 
True
>>> StringDtype(storage="python") == "string"
True
>>> 
>>> StringDtype(storage="pyarrow") == "string"
True
>>> 
>>> StringDtype(storage="python") == StringDtype(storage="python")
True
>>> 
>>> StringDtype(storage="pyarrow") == StringDtype(storage="pyarrow")
True
>>> 
>>> StringDtype(storage="python") == StringDtype(storage="pyarrow")
False
>>> 
>>> StringDtype(storage="pyarrow") == "string[pyarrow]"
True
>>> 
>>> StringDtype(storage="pyarrow") == "string[python]"
False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is ok, as we consider dtypes to be exactly equal. later on maybe we can allow equivalent tests (possibly open an issue about this)


def __hash__(self) -> int:
# custom __eq__ so have to override __hash__
return super().__hash__()

# TODO: this is a classmethod, but we need to know the storage type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue about this is #36126. Tom had a PR #36136 with some discussion about it. Would need to look back, but I think the conclusion was indeed that StringDtype can just make this a normal method for now.

# error: Signature of "construct_array_type" incompatible with supertype
# "ExtensionDtype"
def construct_array_type( # type: ignore[override]
self,
) -> Type[StringArray | ArrowStringArray]:
"""
Return the array type associated with this dtype.

Returns
-------
type
"""
return StringArray
from pandas.core.arrays.string_arrow import ArrowStringArray

def __repr__(self) -> str:
return "StringDtype"
if self.storage == "python":
return StringArray
else:
return ArrowStringArray

def __repr__(self):
return self.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am think that we should add a __str__ that doesn't include the storage. This is used in eg the output of df.dtypes, and IMO for most users the fact that it's using "python" objects is an implementation detail, and should not be that visible (you can always check the actual dtype object to see what kind of storage it's using)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this would be fine, e.g. string is de-facto string[python], though this means we can essentially never change it and string[pyarrow] for non-default storage types. maybe string[object] is more instructive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what about the global setting. the pyarrow backed string array should not be a second class citizen. At the moment if the global setting is changed , all tests pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string is de-facto string[python]

What "string" means depends on the global setting, which is by default (and not de-facto) "string[python]".

To be clear, I didn't want to suggest that "string" would always mean "string[python]". I just think that in the str representation (eg used in df.dtypes), we shouldn't show that detail.


def __from_arrow__(
self, array: Union[pyarrow.Array, pyarrow.ChunkedArray]
) -> StringArray:
) -> ArrowStringArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't change to always ArrowStringArray, but still depend on what self.storage is (even when the data is coming from arrow, we still want to create python objects-backed string array if that's the default, I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> arr
<pyarrow.lib.ChunkedArray object at 0x7f45ffd5e090>
[
  [
    "a",
    "b",
    "c"
  ]
]
>>> 
>>> StringDtype(storage="python").__from_arrow__(arr)
<StringArray>
['a', 'b', 'c']
Length: 3, dtype: string[python]
>>> 
>>> StringDtype(storage="pyarrow").__from_arrow__(arr)
<ArrowStringArray>
['a', 'b', 'c']
Length: 3, dtype: string[pyarrow]
>>> 

"""
Construct StringArray from pyarrow Array/ChunkedArray.
"""
import pyarrow

from pandas.core.arrays.string_arrow import ArrowStringArray

if isinstance(array, pyarrow.Array):
chunks = [array]
else:
Expand All @@ -122,7 +204,7 @@ def __from_arrow__(
str_arr = StringArray._from_sequence(np.array(arr))
results.append(str_arr)

return StringArray._concat_same_type(results)
return ArrowStringArray._concat_same_type(results)


class StringArray(PandasArray):
Expand Down
130 changes: 16 additions & 114 deletions pandas/core/arrays/string_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,20 @@
Any,
Optional,
Sequence,
Type,
Union,
)

import numpy as np
import pyarrow as pa
import pyarrow.compute as pc

from pandas._libs import (
lib,
missing as libmissing,
)
from pandas._libs import lib
from pandas._typing import (
Dtype,
NpDtype,
)
from pandas.util._validators import validate_fillna_kwargs

from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.dtypes.dtypes import register_extension_dtype
from pandas.core.dtypes.missing import isna

from pandas.api.types import (
Expand All @@ -35,121 +31,27 @@
)
from pandas.core.arraylike import OpsMixin
from pandas.core.arrays.base import ExtensionArray
from pandas.core.arrays.string_ import StringDtype
from pandas.core.indexers import (
check_array_indexer,
validate_indices,
)
from pandas.core.missing import get_fill_func

try:
import pyarrow as pa
except ImportError:
pa = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed don't need to lazily import pyarrow anymore for the sake of the dtype object. But if we want to expose the ArrowStringArray in pandas.core.arrays like all other arrays, we are still going to need to keep this behind an ImportError ..

(not sure if we want to expose ArrowStringArray, but until that's decided I would maybe leave those imports as is)

else:
# PyArrow backed StringArrays are available starting at 1.0.0, but this
# file is imported from even if pyarrow is < 1.0.0, before pyarrow.compute
# and its compute functions existed. GH38801
if LooseVersion(pa.__version__) >= "1.0.0":
import pyarrow.compute as pc

ARROW_CMP_FUNCS = {
"eq": pc.equal,
"ne": pc.not_equal,
"lt": pc.less,
"gt": pc.greater,
"le": pc.less_equal,
"ge": pc.greater_equal,
}
ARROW_CMP_FUNCS = {
"eq": pc.equal,
"ne": pc.not_equal,
"lt": pc.less,
"gt": pc.greater,
"le": pc.less_equal,
"ge": pc.greater_equal,
}


if TYPE_CHECKING:
from pandas import Series


@register_extension_dtype
class ArrowStringDtype(ExtensionDtype):
"""
Extension dtype for string data in a ``pyarrow.ChunkedArray``.

.. versionadded:: 1.2.0

.. warning::

ArrowStringDtype is considered experimental. The implementation and
parts of the API may change without warning.

Attributes
----------
None

Methods
-------
None

Examples
--------
>>> from pandas.core.arrays.string_arrow import ArrowStringDtype
>>> ArrowStringDtype()
ArrowStringDtype
"""

name = "arrow_string"

#: StringDtype.na_value uses pandas.NA
na_value = libmissing.NA

@property
def type(self) -> Type[str]:
return str

@classmethod
def construct_array_type(cls) -> Type[ArrowStringArray]:
"""
Return the array type associated with this dtype.

Returns
-------
type
"""
return ArrowStringArray

def __hash__(self) -> int:
return hash("ArrowStringDtype")

def __repr__(self) -> str:
return "ArrowStringDtype"

def __from_arrow__(
self, array: Union[pa.Array, pa.ChunkedArray]
) -> ArrowStringArray:
"""
Construct StringArray from pyarrow Array/ChunkedArray.
"""
return ArrowStringArray(array)

def __eq__(self, other) -> bool:
"""Check whether 'other' is equal to self.

By default, 'other' is considered equal if
* it's a string matching 'self.name'.
* it's an instance of this type.

Parameters
----------
other : Any

Returns
-------
bool
"""
if isinstance(other, ArrowStringDtype):
return True
elif isinstance(other, str) and other == "arrow_string":
return True
else:
return False


class ArrowStringArray(OpsMixin, ExtensionArray):
"""
Extension array for string data in a ``pyarrow.ChunkedArray``.
Expand Down Expand Up @@ -188,13 +90,13 @@ class ArrowStringArray(OpsMixin, ExtensionArray):

Examples
--------
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="arrow_string")
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="string[arrow]")
<ArrowStringArray>
['This is', 'some text', <NA>, 'data.']
Length: 4, dtype: arrow_string
"""

_dtype = ArrowStringDtype()
_dtype = StringDtype(storage="pyarrow")

def __init__(self, values):
self._chk_pyarrow_available()
Expand Down Expand Up @@ -232,9 +134,9 @@ def _from_sequence_of_strings(
return cls._from_sequence(strings, dtype=dtype, copy=copy)

@property
def dtype(self) -> ArrowStringDtype:
def dtype(self) -> StringDtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ArrowStringDtype

"""
An instance of 'ArrowStringDtype'.
An instance of 'StringDtype[pyarrow]'.
"""
return self._dtype

Expand Down
Loading