Skip to content

Commit bf59a95

Browse files
Backport PR #54856 on branch 2.1.x (BUG: ArrowExtensionArray.factorize with chunked dictionary array) (#54863)
Backport PR #54856: BUG: ArrowExtensionArray.factorize with chunked dictionary array Co-authored-by: Matthew Roeschke <[email protected]>
1 parent 7f469dc commit bf59a95

File tree

3 files changed

+18
-2
lines changed

3 files changed

+18
-2
lines changed

doc/source/whatsnew/v2.1.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -847,6 +847,7 @@ ExtensionArray
847847
- Bug in :meth:`Series.rank` returning wrong order for small values with ``Float64`` dtype (:issue:`52471`)
848848
- Bug in :meth:`Series.unique` for boolean ``ArrowDtype`` with ``NA`` values (:issue:`54667`)
849849
- Bug in :meth:`~arrays.ArrowExtensionArray.__iter__` and :meth:`~arrays.ArrowExtensionArray.__getitem__` returning python datetime and timedelta objects for non-nano dtypes (:issue:`53326`)
850+
- Bug in :meth:`~arrays.ArrowExtensionArray.factorize` returning incorrect uniques for a ``pyarrow.dictionary`` type ``pyarrow.chunked_array`` with more than one chunk (:issue:`54844`)
850851
- Bug when passing an :class:`ExtensionArray` subclass to ``dtype`` keywords. This will now raise a ``UserWarning`` to encourage passing an instance instead (:issue:`31356`, :issue:`54592`)
851852
- Bug where the :class:`DataFrame` repr would not work when a column had an :class:`ArrowDtype` with a ``pyarrow.ExtensionDtype`` (:issue:`54063`)
852853
- Bug where the ``__from_arrow__`` method of masked ExtensionDtypes (e.g. :class:`Float64Dtype`, :class:`BooleanDtype`) would not accept PyArrow arrays of type ``pyarrow.null()`` (:issue:`52223`)

pandas/core/arrays/arrow/array.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -1043,13 +1043,15 @@ def factorize(
10431043
indices = np.array([], dtype=np.intp)
10441044
uniques = type(self)(pa.chunked_array([], type=encoded.type.value_type))
10451045
else:
1046-
pa_indices = encoded.combine_chunks().indices
1046+
# GH 54844
1047+
combined = encoded.combine_chunks()
1048+
pa_indices = combined.indices
10471049
if pa_indices.null_count > 0:
10481050
pa_indices = pc.fill_null(pa_indices, -1)
10491051
indices = pa_indices.to_numpy(zero_copy_only=False, writable=True).astype(
10501052
np.intp, copy=False
10511053
)
1052-
uniques = type(self)(encoded.chunk(0).dictionary)
1054+
uniques = type(self)(combined.dictionary)
10531055

10541056
if pa_version_under11p0 and pa.types.is_duration(pa_type):
10551057
uniques = cast(ArrowExtensionArray, uniques.astype(self.dtype))

pandas/tests/extension/test_arrow.py

+13
Original file line numberDiff line numberDiff line change
@@ -3059,3 +3059,16 @@ def test_duration_fillna_numpy(pa_type):
30593059
result = ser1.fillna(ser2)
30603060
expected = pd.Series([1, 2], dtype=ArrowDtype(pa_type))
30613061
tm.assert_series_equal(result, expected)
3062+
3063+
3064+
def test_factorize_chunked_dictionary():
3065+
# GH 54844
3066+
pa_array = pa.chunked_array(
3067+
[pa.array(["a"]).dictionary_encode(), pa.array(["b"]).dictionary_encode()]
3068+
)
3069+
ser = pd.Series(ArrowExtensionArray(pa_array))
3070+
res_indices, res_uniques = ser.factorize()
3071+
exp_indicies = np.array([0, 1], dtype=np.intp)
3072+
exp_uniques = pd.Index(ArrowExtensionArray(pa_array.combine_chunks()))
3073+
tm.assert_numpy_array_equal(res_indices, exp_indicies)
3074+
tm.assert_index_equal(res_uniques, exp_uniques)

0 commit comments

Comments
 (0)