-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Arrow backed string array - implement factorize() method without casting to objects #38007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
c53a3c2
b7d0ab8
154496a
c545970
6e3aac8
73c7de9
42ca9c3
a251537
dbc8253
ea59c38
7d98727
0023f08
6a28414
c4db20d
88ab4f4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
Any, | ||
Optional, | ||
Sequence, | ||
Tuple, | ||
Type, | ||
Union, | ||
) | ||
|
@@ -20,6 +21,7 @@ | |
Dtype, | ||
NpDtype, | ||
) | ||
from pandas.util._decorators import doc | ||
from pandas.util._validators import validate_fillna_kwargs | ||
|
||
from pandas.core.dtypes.base import ExtensionDtype | ||
|
@@ -273,9 +275,22 @@ def __len__(self) -> int: | |
""" | ||
return len(self._data) | ||
|
||
@classmethod | ||
def _from_factorized(cls, values, original): | ||
return cls._from_sequence(values) | ||
@doc(ExtensionArray.factorize) | ||
def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]: | ||
encoded = self._data.dictionary_encode() | ||
indices = pa.chunked_array( | ||
[c.indices for c in encoded.chunks], type=encoded.type.index_type | ||
).to_pandas() | ||
if indices.dtype.kind == "f": | ||
indices[np.isnan(indices)] = na_sentinel | ||
indices = indices.astype(np.int64, copy=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering, is the I suppose that we always return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
refactor in 0023f08 partially to address comments but yes, we seem to be getting an int32 from pyarrow also we could maybe work with numpy arrays here directly for the indices instead of pandas Series? |
||
|
||
if encoded.num_chunks: | ||
uniques = type(self)(encoded.chunk(0).dictionary) | ||
else: | ||
uniques = type(self)(pa.array([], type=encoded.type.value_type)) | ||
|
||
return indices.values, uniques | ||
|
||
@classmethod | ||
def _concat_same_type(cls, to_concat) -> ArrowStringArray: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this in a try/except? (we need to be able to still run the benchmarks with slightly older pandas version that might not have this import available)