-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 25 commits
4cb60e6
d242f2d
d39ab2c
2367810
9166d3b
8760705
d5b3fec
2c657df
647a6c2
0596fd7
c5a19c5
99680c9
69a6cc1
bd147ba
830275f
214e524
c9ba03c
7425536
68ac391
5cfa97a
74dbf96
3985943
3bda421
0c108a4
523e24c
279624c
80d231e
c5ced5a
459812c
d707b6b
71ccf24
daaac06
46626d1
3677bfa
42d382f
4fb1a0d
5d4eac1
15efb2e
b53cfe0
b7db53f
3399f08
e365f01
71d1e6c
9e23c35
c69a611
64b3206
d83a4ff
ef38660
aef1162
6247a5b
a6d066c
8adb08d
3ad0638
56714c9
6a1cc2b
1761a84
3e26baa
6b470b1
2ec6de0
a0b7a70
d9dcd20
4a37470
1d59c7a
e57c850
51f1b1d
fc95c06
ef02a43
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,14 @@ | ||
from __future__ import annotations | ||
|
||
from typing import TYPE_CHECKING | ||
from typing import ( | ||
TYPE_CHECKING, | ||
Any, | ||
) | ||
|
||
import numpy as np | ||
|
||
from pandas._config import get_option | ||
|
||
from pandas._libs import ( | ||
lib, | ||
missing as libmissing, | ||
|
@@ -45,6 +50,8 @@ | |
if TYPE_CHECKING: | ||
import pyarrow | ||
|
||
from pandas.core.arrays.string_arrow import ArrowStringArray | ||
|
||
|
||
@register_extension_dtype | ||
class StringDtype(ExtensionDtype): | ||
|
@@ -75,50 +82,129 @@ class StringDtype(ExtensionDtype): | |
StringDtype | ||
""" | ||
|
||
name = "string" | ||
|
||
#: StringDtype.na_value uses pandas.NA | ||
na_value = libmissing.NA | ||
_metadata = ("storage",) | ||
|
||
def __init__(self, storage=None): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you update the docstring above for this new keyword? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done in d9dcd20 |
||
if storage is None: | ||
storage = get_option("mode.string_storage") | ||
Comment on lines
+101
to
+102
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might want to delay looking up this option and allow the StringDtype to be initialized with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The latest commit (wip) started to add a parameterised fixture for more testing prior to implementing this. since the dtype change is user facing this change worthwhile, there is still more to do as many tests use Have added There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I don't think this comment is resolved, so "unresolved" it in the UI) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure can you open an issue explaining the api you'd like. I have updated _from_seqenece to accept the string "string", but if you are writing code if you do a comparison as a conditional, you expect the storage type to be fixed. and not then do something different. If the only places where this could be usefull is astype, we could put the logic there. |
||
if storage not in {"python", "pyarrow"}: | ||
raise ValueError( | ||
f"Storage must be 'python' or 'pyarrow'. Got {storage} instead." | ||
) | ||
self.storage = storage | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
@property | ||
def name(self): | ||
return f"string[{self.storage}]" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we keep There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good. |
||
|
||
@property | ||
def type(self) -> type[str]: | ||
return str | ||
|
||
@classmethod | ||
def construct_array_type(cls) -> type_t[StringArray]: | ||
def construct_from_string(cls, string): | ||
""" | ||
Construct a StringDtype from a string. | ||
|
||
Parameters | ||
---------- | ||
string : str | ||
The type of the name. The storage type will be taking from `string`. | ||
Valid options and their storage types are | ||
|
||
========================== ============== | ||
string result storage | ||
========================== ============== | ||
``'string'`` global default | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. global default -> python (at least currently) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated docstring |
||
``'string[python]'`` python | ||
``'string[pyarrow]'`` pyarrow | ||
========================== ============= | ||
|
||
Returns | ||
------- | ||
StringDtype | ||
|
||
Raise | ||
----- | ||
TypeError | ||
If the string is not a valid option. | ||
|
||
""" | ||
if not isinstance(string, str): | ||
raise TypeError( | ||
f"'construct_from_string' expects a string, got {type(string)}" | ||
) | ||
if string == "string": | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# TODO: use global default | ||
return cls() | ||
elif string == "string[python]": | ||
return cls(storage="python") | ||
elif string == "string[pyarrow]": | ||
return cls(storage="pyarrow") | ||
else: | ||
raise TypeError(f"Cannot construct a '{cls.__name__}' from '{string}'") | ||
|
||
def __eq__(self, other: Any) -> bool: | ||
if isinstance(other, str) and other == "string": | ||
return True | ||
return super().__eq__(other) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is actually the behaviour here? Are "string[python]" and "string[arrow]" seen as not equal, I suppose? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's correct. Not sure what the behavior should be if we delay looking up the storage. that could complicate things so will leave for a follow-on after better understanding of what we want, and when the lookup would be final.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think this is ok, as we consider dtypes to be exactly equal. later on maybe we can allow equivalent tests (possibly open an issue about this) |
||
|
||
def __hash__(self) -> int: | ||
# custom __eq__ so have to override __hash__ | ||
return super().__hash__() | ||
|
||
# TODO: this is a classmethod, but we need to know the storage type. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
# error: Signature of "construct_array_type" incompatible with supertype | ||
# "ExtensionDtype" | ||
def construct_array_type( # type: ignore[override] | ||
self, | ||
) -> type_t[StringArray | ArrowStringArray]: | ||
""" | ||
Return the array type associated with this dtype. | ||
|
||
Returns | ||
------- | ||
type | ||
""" | ||
return StringArray | ||
from pandas.core.arrays.string_arrow import ArrowStringArray | ||
|
||
def __repr__(self) -> str: | ||
return "StringDtype" | ||
if self.storage == "python": | ||
return StringArray | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else: | ||
return ArrowStringArray | ||
|
||
def __repr__(self): | ||
return self.name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am think that we should add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think this would be fine, e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, what about the global setting. the pyarrow backed string array should not be a second class citizen. At the moment if the global setting is changed , all tests pass. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What To be clear, I didn't want to suggest that "string" would always mean "string[python]". I just think that in the str representation (eg used in |
||
|
||
def __from_arrow__( | ||
self, array: pyarrow.Array | pyarrow.ChunkedArray | ||
) -> StringArray: | ||
) -> StringArray | ArrowStringArray: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wouldn't object to a typing for StringArrayTypes to encompass these There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, but we are likely to implement a base class shortly for these so it would be redundant, xref #40962 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok great There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done in 1d59c7a |
||
""" | ||
Construct StringArray from pyarrow Array/ChunkedArray. | ||
""" | ||
import pyarrow | ||
if self.storage == "pyarrow": | ||
from pandas.core.arrays.string_arrow import ArrowStringArray | ||
|
||
if isinstance(array, pyarrow.Array): | ||
chunks = [array] | ||
return ArrowStringArray(array) | ||
else: | ||
# pyarrow.ChunkedArray | ||
chunks = array.chunks | ||
|
||
results = [] | ||
for arr in chunks: | ||
# using _from_sequence to ensure None is converted to NA | ||
str_arr = StringArray._from_sequence(np.array(arr)) | ||
results.append(str_arr) | ||
import pyarrow | ||
|
||
if isinstance(array, pyarrow.Array): | ||
chunks = [array] | ||
else: | ||
# pyarrow.ChunkedArray | ||
chunks = array.chunks | ||
|
||
results = [] | ||
for arr in chunks: | ||
# using _from_sequence to ensure None is converted to NA | ||
str_arr = StringArray._from_sequence(np.array(arr)) | ||
results.append(str_arr) | ||
|
||
return StringArray._concat_same_type(results) | ||
return StringArray._concat_same_type(results) | ||
|
||
|
||
class StringArray(PandasArray): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to have both "string" and "string[python]" ? That would duplicate a lot of tests?
(and same for the ones below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not, most of these tests are pretty fast. when running locally 7/8 cores finish in a few minutes and then you are waiting for the numba tests for another 5 minutes or so running on a single thread.
will remove anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in a0b7a70