Skip to content

[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
4cb60e6
Implement BaseDtypeTests for ArrowStringDtype
xhochy Jul 10, 2020
d242f2d
Refactor to use parametrized StringDtype
TomAugspurger Sep 3, 2020
d39ab2c
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Feb 18, 2021
2367810
abs-imports
simonjayhawkins Feb 18, 2021
9166d3b
post merge fixup
simonjayhawkins Feb 19, 2021
8760705
StringDtype[python] -> string[python]
simonjayhawkins Feb 19, 2021
d5b3fec
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 22, 2021
2c657df
pre-commit fix for inconsistent use of pandas namespace
simonjayhawkins Mar 22, 2021
647a6c2
fix typo
simonjayhawkins Mar 22, 2021
0596fd7
pre-commit fixup - undefined name 'ArrowStringDtype'
simonjayhawkins Mar 22, 2021
c5a19c5
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 26, 2021
99680c9
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 28, 2021
69a6cc1
"StringDtype[storage]" -> "string[storage]" misc
simonjayhawkins Mar 28, 2021
bd147ba
__from_arrow__
simonjayhawkins Mar 28, 2021
830275f
more testing (wip)
simonjayhawkins Mar 28, 2021
214e524
fix inference
simonjayhawkins Mar 28, 2021
c9ba03c
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 29, 2021
7425536
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 1, 2021
68ac391
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 1, 2021
5cfa97a
post-merge fixup
simonjayhawkins Apr 1, 2021
74dbf96
remove changes to test_string_dtype - broken off in #40725
simonjayhawkins Apr 1, 2021
3985943
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 15, 2021
3bda421
post merge fix-up
simonjayhawkins Apr 15, 2021
0c108a4
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 15, 2021
523e24c
post merge fix-up
simonjayhawkins Apr 15, 2021
279624c
revert some changes made for pre-commit checks.
simonjayhawkins Apr 15, 2021
80d231e
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 16, 2021
c5ced5a
post merge fix-up
simonjayhawkins Apr 16, 2021
459812c
undo unrelated changes
simonjayhawkins Apr 16, 2021
d707b6b
undo changes to imports
simonjayhawkins Apr 16, 2021
71ccf24
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 17, 2021
daaac06
StringDtype.construct_array_type - add ref to issue
simonjayhawkins Apr 17, 2021
46626d1
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 19, 2021
3677bfa
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 1, 2021
42d382f
post merge fixup
simonjayhawkins May 1, 2021
4fb1a0d
add draft release note
simonjayhawkins May 1, 2021
5d4eac1
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 12, 2021
15efb2e
post merge fix-up
simonjayhawkins May 12, 2021
b53cfe0
docstrings
simonjayhawkins May 12, 2021
b7db53f
benchmarks
simonjayhawkins May 12, 2021
3399f08
pyarrow min
simonjayhawkins May 12, 2021
e365f01
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 26, 2021
71d1e6c
post merge fixup
simonjayhawkins May 26, 2021
9e23c35
misc clean
simonjayhawkins May 26, 2021
c69a611
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 27, 2021
64b3206
update construct_from_string docstring
simonjayhawkins May 27, 2021
d83a4ff
update whatsnew for dtype="string"
simonjayhawkins May 27, 2021
ef38660
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 30, 2021
aef1162
update release note
simonjayhawkins May 30, 2021
6247a5b
paramertize test for df.convert_dtypes()
simonjayhawkins May 30, 2021
a6d066c
fixup pd.array and more testing of string_storage option
simonjayhawkins May 31, 2021
8adb08d
use string_storage fixture more
simonjayhawkins May 31, 2021
3ad0638
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 31, 2021
56714c9
post merge fixup
simonjayhawkins May 31, 2021
6a1cc2b
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 2, 2021
1761a84
remove accessor methods section from release note
simonjayhawkins Jun 2, 2021
3e26baa
consistent dtype naming in benchmark
simonjayhawkins Jun 2, 2021
6b470b1
Apply suggestions from code review
simonjayhawkins Jun 2, 2021
2ec6de0
name and str() change to "string"
simonjayhawkins Jun 2, 2021
a0b7a70
remove testing of sting dtype without storage specified.
simonjayhawkins Jun 2, 2021
d9dcd20
update StringDtype docstring
simonjayhawkins Jun 2, 2021
4a37470
add ArrowStringArray to pd.arrays namespace
simonjayhawkins Jun 2, 2021
1d59c7a
add common base class, BaseStringArray
simonjayhawkins Jun 2, 2021
e57c850
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 4, 2021
51f1b1d
fixup roundtrip tests
simonjayhawkins Jun 4, 2021
fc95c06
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 7, 2021
ef02a43
remove link
simonjayhawkins Jun 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1132,8 +1132,9 @@ def string_dtype(request):
@pytest.fixture(
params=[
"string",
"string[python]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to have both "string" and "string[python]" ? That would duplicate a lot of tests?

(and same for the ones below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not, most of these tests are pretty fast. when running locally 7/8 cores finish in a few minutes and then you are waiting for the numba tests for another 5 minutes or so running on a single thread.

will remove anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in a0b7a70

pytest.param(
"arrow_string", marks=td.skip_if_no("pyarrow", min_version="1.0.0")
"string[pyarrow]", marks=td.skip_if_no("pyarrow", min_version="1.0.0")
),
]
)
Expand All @@ -1142,10 +1143,9 @@ def nullable_string_dtype(request):
Parametrized fixture for string dtypes.

* 'string'
* 'arrow_string'
* 'string[python]'
* 'string[pyarrow]'
"""
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401

return request.param


Expand Down
13 changes: 6 additions & 7 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,8 @@ def __ne__(self, other: Any) -> ArrayLike: # type: ignore[override]
"""
Return for `self != other` (element-wise in-equality).
"""
if isinstance(other, (ABCSeries, ABCDataFrame, ABCIndex)):
return NotImplemented
return ~(self == other)

def to_numpy(
Expand Down Expand Up @@ -530,7 +532,6 @@ def astype(self, dtype, copy=True):
NumPy ndarray with 'dtype' for its dtype.
"""
from pandas.core.arrays.string_ import StringDtype
from pandas.core.arrays.string_arrow import ArrowStringDtype

dtype = pandas_dtype(dtype)
if is_dtype_equal(dtype, self.dtype):
Expand All @@ -540,9 +541,7 @@ def astype(self, dtype, copy=True):
return self.copy()

# FIXME: Really hard-code here?
if isinstance(
dtype, (ArrowStringDtype, StringDtype)
): # allow conversion to StringArrays
if isinstance(dtype, StringDtype): # allow conversion to StringArrays
return dtype.construct_array_type()._from_sequence(self, copy=False)

return np.array(self, dtype=dtype, copy=copy)
Expand Down Expand Up @@ -1053,9 +1052,9 @@ def take(
from the right (the default). This is similar to
:func:`numpy.take`.

* True: negative values in `indices` indicate
missing values. These values are set to `fill_value`. Any other
other negative values raise a ``ValueError``.
* True: ``-1`` in `indices` indicate missing values.
These values are set to `fill_value`. Any other negative
value raises a ``ValueError``.

fill_value : any, optional
Fill value to use for NA-indices when `allow_fill` is True.
Expand Down
124 changes: 105 additions & 19 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
from __future__ import annotations

from typing import TYPE_CHECKING
from typing import (
TYPE_CHECKING,
Any,
)

import numpy as np

from pandas._config import get_option

from pandas._libs import (
lib,
missing as libmissing,
Expand Down Expand Up @@ -45,6 +50,8 @@
if TYPE_CHECKING:
import pyarrow

from pandas.core.arrays.string_arrow import ArrowStringArray


@register_extension_dtype
class StringDtype(ExtensionDtype):
Expand Down Expand Up @@ -75,50 +82,129 @@ class StringDtype(ExtensionDtype):
StringDtype
"""

name = "string"

#: StringDtype.na_value uses pandas.NA
na_value = libmissing.NA
_metadata = ("storage",)

def __init__(self, storage=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the docstring above for this new keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in d9dcd20

if storage is None:
storage = get_option("mode.string_storage")
Comment on lines +101 to +102
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to delay looking up this option and allow the StringDtype to be initialized with storage=None. That could eg allow doing s.astype("string") preserving the original dtype if the series already is of string dtype (regardless of the default).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit (wip) started to add a parameterised fixture for more testing prior to implementing this.

since the dtype change is user facing this change worthwhile, there is still more to do as many tests use dtype="string" but does potentially make this PR harder to review. I will carry on and do the others if doing this makes sense.

Have added string[pyarrow] to this fixture. This gives some additional failures for outstanding parts of ArrowStringArray. It maybe that to keep this PR more cleanly scoped, that we defer adding that for a follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't think this comment is resolved, so "unresolved" it in the UI)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure can you open an issue explaining the api you'd like. I have updated _from_seqenece to accept the string "string", but if you are writing code if you do a comparison as a conditional, you expect the storage type to be fixed. and not then do something different. If the only places where this could be usefull is astype, we could put the logic there.

if storage not in {"python", "pyarrow"}:
raise ValueError(
f"Storage must be 'python' or 'pyarrow'. Got {storage} instead."
)
self.storage = storage

@property
def name(self):
return f"string[{self.storage}]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep name to be simply "string"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.


@property
def type(self) -> type[str]:
return str

@classmethod
def construct_array_type(cls) -> type_t[StringArray]:
def construct_from_string(cls, string):
"""
Construct a StringDtype from a string.

Parameters
----------
string : str
The type of the name. The storage type will be taking from `string`.
Valid options and their storage types are

========================== ==============
string result storage
========================== ==============
``'string'`` global default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global default -> python (at least currently)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated docstring

``'string[python]'`` python
``'string[pyarrow]'`` pyarrow
========================== =============

Returns
-------
StringDtype

Raise
-----
TypeError
If the string is not a valid option.

"""
if not isinstance(string, str):
raise TypeError(
f"'construct_from_string' expects a string, got {type(string)}"
)
if string == "string":
# TODO: use global default
return cls()
elif string == "string[python]":
return cls(storage="python")
elif string == "string[pyarrow]":
return cls(storage="pyarrow")
else:
raise TypeError(f"Cannot construct a '{cls.__name__}' from '{string}'")

def __eq__(self, other: Any) -> bool:
if isinstance(other, str) and other == "string":
return True
return super().__eq__(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is actually the behaviour here? Are "string[python]" and "string[arrow]" seen as not equal, I suppose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Not sure what the behavior should be if we delay looking up the storage. that could complicate things so will leave for a follow-on after better understanding of what we want, and when the lookup would be final.

>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string")
... 
True
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string")
... 
True
>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string[python]")
... 
True
>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string[pyarrow]")
... 
False
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string[python]")
... 
False
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string[pyarrow]")
... 
True
>>> StringDtype(storage="python") == "string"
True
>>> 
>>> StringDtype(storage="pyarrow") == "string"
True
>>> 
>>> StringDtype(storage="python") == StringDtype(storage="python")
True
>>> 
>>> StringDtype(storage="pyarrow") == StringDtype(storage="pyarrow")
True
>>> 
>>> StringDtype(storage="python") == StringDtype(storage="pyarrow")
False
>>> 
>>> StringDtype(storage="pyarrow") == "string[pyarrow]"
True
>>> 
>>> StringDtype(storage="pyarrow") == "string[python]"
False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is ok, as we consider dtypes to be exactly equal. later on maybe we can allow equivalent tests (possibly open an issue about this)


def __hash__(self) -> int:
# custom __eq__ so have to override __hash__
return super().__hash__()

# TODO: this is a classmethod, but we need to know the storage type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue about this is #36126. Tom had a PR #36136 with some discussion about it. Would need to look back, but I think the conclusion was indeed that StringDtype can just make this a normal method for now.

# error: Signature of "construct_array_type" incompatible with supertype
# "ExtensionDtype"
def construct_array_type( # type: ignore[override]
self,
) -> type_t[StringArray | ArrowStringArray]:
"""
Return the array type associated with this dtype.

Returns
-------
type
"""
return StringArray
from pandas.core.arrays.string_arrow import ArrowStringArray

def __repr__(self) -> str:
return "StringDtype"
if self.storage == "python":
return StringArray
else:
return ArrowStringArray

def __repr__(self):
return self.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am think that we should add a __str__ that doesn't include the storage. This is used in eg the output of df.dtypes, and IMO for most users the fact that it's using "python" objects is an implementation detail, and should not be that visible (you can always check the actual dtype object to see what kind of storage it's using)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this would be fine, e.g. string is de-facto string[python], though this means we can essentially never change it and string[pyarrow] for non-default storage types. maybe string[object] is more instructive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, what about the global setting. the pyarrow backed string array should not be a second class citizen. At the moment if the global setting is changed , all tests pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string is de-facto string[python]

What "string" means depends on the global setting, which is by default (and not de-facto) "string[python]".

To be clear, I didn't want to suggest that "string" would always mean "string[python]". I just think that in the str representation (eg used in df.dtypes), we shouldn't show that detail.


def __from_arrow__(
self, array: pyarrow.Array | pyarrow.ChunkedArray
) -> StringArray:
) -> StringArray | ArrowStringArray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't object to a typing for StringArrayTypes to encompass these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but we are likely to implement a base class shortly for these so it would be redundant, xref #40962

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 1d59c7a

"""
Construct StringArray from pyarrow Array/ChunkedArray.
"""
import pyarrow
if self.storage == "pyarrow":
from pandas.core.arrays.string_arrow import ArrowStringArray

if isinstance(array, pyarrow.Array):
chunks = [array]
return ArrowStringArray(array)
else:
# pyarrow.ChunkedArray
chunks = array.chunks

results = []
for arr in chunks:
# using _from_sequence to ensure None is converted to NA
str_arr = StringArray._from_sequence(np.array(arr))
results.append(str_arr)
import pyarrow

if isinstance(array, pyarrow.Array):
chunks = [array]
else:
# pyarrow.ChunkedArray
chunks = array.chunks

results = []
for arr in chunks:
# using _from_sequence to ensure None is converted to NA
str_arr = StringArray._from_sequence(np.array(arr))
results.append(str_arr)

return StringArray._concat_same_type(results)
return StringArray._concat_same_type(results)


class StringArray(PandasArray):
Expand Down
Loading