Skip to content

Commit 8b6b5b4

Browse files
Backport PR #48620 on branch 1.5.x (REGR: Performance decrease in factorize) (#48710)
Backport PR #48620: REGR: Performance decrease in factorize Co-authored-by: Richard Shadrach <[email protected]>
1 parent 05ce6dc commit 8b6b5b4

File tree

2 files changed

+13
-11
lines changed

2 files changed

+13
-11
lines changed

doc/source/whatsnew/v1.5.1.rst

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ including other versions of pandas.
1515
Fixed regressions
1616
~~~~~~~~~~~~~~~~~
1717
- Regression in :func:`.read_csv` causing an ``EmptyDataError`` when using an UTF-8 file handle that was already read from (:issue:`48646`)
18+
- Fixed performance regression in :func:`factorize` when ``na_sentinel`` is not ``None`` and ``sort=False`` (:issue:`48620`)
1819
-
1920

2021
.. ---------------------------------------------------------------------------

pandas/core/algorithms.py

+12-11
Original file line numberDiff line numberDiff line change
@@ -566,17 +566,6 @@ def factorize_array(
566566

567567
hash_klass, values = _get_hashtable_algo(values)
568568

569-
# factorize can now handle differentiating various types of null values.
570-
# However, for backwards compatibility we only use the null for the
571-
# provided dtype. This may be revisited in the future, see GH#48476.
572-
null_mask = isna(values)
573-
if null_mask.any():
574-
na_value = na_value_for_dtype(values.dtype, compat=False)
575-
# Don't modify (potentially user-provided) array
576-
# error: No overload variant of "where" matches argument types "Any", "object",
577-
# "ndarray[Any, Any]"
578-
values = np.where(null_mask, na_value, values) # type: ignore[call-overload]
579-
580569
table = hash_klass(size_hint or len(values))
581570
uniques, codes = table.factorize(
582571
values,
@@ -810,6 +799,18 @@ def factorize(
810799
na_sentinel_arg = None
811800
else:
812801
na_sentinel_arg = na_sentinel
802+
803+
if not dropna and not sort and is_object_dtype(values):
804+
# factorize can now handle differentiating various types of null values.
805+
# These can only occur when the array has object dtype.
806+
# However, for backwards compatibility we only use the null for the
807+
# provided dtype. This may be revisited in the future, see GH#48476.
808+
null_mask = isna(values)
809+
if null_mask.any():
810+
na_value = na_value_for_dtype(values.dtype, compat=False)
811+
# Don't modify (potentially user-provided) array
812+
values = np.where(null_mask, na_value, values)
813+
813814
codes, uniques = factorize_array(
814815
values,
815816
na_sentinel=na_sentinel_arg,

0 commit comments

Comments
 (0)