-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Arrow backed string array - implement factorize() method without casting to objects #38007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Arrow backed string array - implement factorize() method without casting to objects #38007
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.
That sounds as a good idea, yes (not sure how easy it is to do, since there are multiple data fixtures)
pandas/core/arrays/string_arrow.py
Outdated
return cls._from_sequence(values) | ||
@doc(ExtensionArray.factorize) | ||
def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]: | ||
if self._data.num_chunks == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowadays, dictionary_encode
works fine for ChunkedArrays as well, so I am not sure this if
statement is actually needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tooks a stab at that in fletcher
to let CI verify this assumption: Seems to work with pyarrow
0.17-2.0 xhochy/fletcher#206
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
this should yield a perf improvement yes? can you add an asv for this. |
There is an existing benchmark in algorithms.py::Factorize.time_factorize() which has a "string" case -> can rename this to "str", and add an actual "string" dtype version. |
pandas/core/arrays/string_arrow.py
Outdated
if indices.dtype.kind == "f": | ||
indices[np.isnan(indices)] = na_sentinel | ||
indices = indices.astype(int) | ||
if not is_int64_dtype(indices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do
indices = indices.astype(np.int64, copy=False)
pandas/core/arrays/string_arrow.py
Outdated
indices = encoded.indices.to_pandas() | ||
if indices.dtype.kind == "f": | ||
indices[np.isnan(indices)] = na_sentinel | ||
indices = indices.astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int -> np.int64
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
status here? |
benchmarks added but pyarrow not added to env. this is OK for local testing with extensions tests updated for arrays with more than 1 chunk. good news is that we now see the factorize failures these will be fixed by incorporating latest changes from fletcher next and other comments addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
asv_bench/benchmarks/algorithms.py
Outdated
@@ -5,6 +5,7 @@ | |||
from pandas._libs import lib | |||
|
|||
import pandas as pd | |||
from pandas.core.arrays.string_arrow import ArrowStringDtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this in a try/except? (we need to be able to still run the benchmarks with slightly older pandas version that might not have this import available)
).to_pandas() | ||
if indices.dtype.kind == "f": | ||
indices[np.isnan(indices)] = na_sentinel | ||
indices = indices.astype(np.int64, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering, is the int64
needed here? (pyarrow will typically use int32 as default I think)
I suppose that we always return int64
from factorize
for the indices. Short-term, casting to int64 might be best then (to ensure nothing else breaks because of not doing that), but long term we should maybe check if internally we require int64 or would be fine with int32 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering, is the
int64
needed here? (pyarrow will typically use int32 as default I think)
refactor in 0023f08 partially to address comments
but yes, we seem to be getting an int32 from pyarrow
also we could maybe work with numpy arrays here directly for the indices instead of pandas Series?
@simonjayhawkins did you check what difference it gives in performance for the benchmark case compared to object dtype? (just that single case, no need to run the full suite) |
with changes in this PR to compare with String(Index)
without
|
@jorisvandenbossche @jreback anything else to be done here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping. Can you merge latest master to be sure?
greenish. |
Thanks! |
xref #35169 (comment), follow-on to #35259
This is moreless copy/paste from https://github.com/xhochy/fletcher with a slight tidy for initial review feedback
still to do
we don't have failing tests, but the return type should be
Tuple[np.ndarray, ExtensionArray]
while we havereturn factorize(np_array, na_sentinel=na_sentinel)
if more than 1 chunk and the return type of pd.factorize isTuple[np.ndarray, Union[np.ndarray, ABCIndex]]
(mypy doesn't report this as an error since np.ndarray resolves toAny
)since we don't have failing tests, we probably should paramatrize data for the base extension tests with an array with one chunk and a array with multiple chunks.
cc @jorisvandenbossche @xhochy