Skip to content

PERF/ENH: add fast astyping for Categorical #37355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Nov 18, 2020
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
5d82b02
PERF/ENH: add fast astyping for categorical input
arw2019 Oct 22, 2020
c18ae4e
replace is_categorical -> is_categorical_dtype
arw2019 Oct 23, 2020
d7c0575
ASV: add astyping benchmark
arw2019 Oct 23, 2020
856995f
DOC: whatsnew
arw2019 Oct 23, 2020
57817a4
feedback: move change core/generic -> internals
arw2019 Oct 23, 2020
1050d9e
rewrite categorical check in Block.astype
arw2019 Oct 23, 2020
c8c05cc
rewrite the fix
arw2019 Oct 23, 2020
b8141c4
rewrite the fix
arw2019 Oct 23, 2020
3d3bcf1
fix handling of strings
arw2019 Oct 23, 2020
3714d09
improve readability
arw2019 Oct 23, 2020
f8f501f
add more special casing...
arw2019 Oct 23, 2020
f4b5952
Merge remote-tracking branch 'upstream/master' into GH8628
arw2019 Oct 26, 2020
2ec7ded
feedback: move changes to Categorical.astype
arw2019 Oct 26, 2020
cd110bc
feedback: add comment in Categorical.astype
arw2019 Oct 29, 2020
113a569
CLN: remove unnecessary line
arw2019 Oct 31, 2020
341ceb6
TST: error msg changed
arw2019 Nov 1, 2020
c5f3fd4
ASV: feedback
arw2019 Nov 1, 2020
9943bb9
ASV: more
arw2019 Nov 1, 2020
c720536
DOC: mention Series.astype
arw2019 Nov 1, 2020
f96a20d
Merge remote-tracking branch 'upstream/master' into GH8628
arw2019 Nov 4, 2020
37e3264
Merge remote-tracking branch 'upstream/master' into GH8628
arw2019 Nov 7, 2020
190c015
BUG: fix empty input cases
arw2019 Nov 8, 2020
f9a3040
BUG: fix array vs. np.array usage
arw2019 Nov 8, 2020
568aa7f
BUG: astype to same dtype; use na_value_for_dtype
arw2019 Nov 8, 2020
6860e48
TST/BUG: fix error message
arw2019 Nov 8, 2020
f2aa2ef
CLN: simplify elif
arw2019 Nov 8, 2020
d226d84
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 9, 2020
9a9e24a
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 9, 2020
a323544
TST: revert error message
arw2019 Nov 9, 2020
07b2a65
REF/ERR: use take_1d + catch casting error
arw2019 Nov 9, 2020
229bfc7
ERR: fix casting error message
arw2019 Nov 9, 2020
da12be0
ERR/TST: revert message change in method & test
arw2019 Nov 9, 2020
e5ede6d
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 9, 2020
93f3e1a
REF (feedback): astype categories then extract numpy array
arw2019 Nov 10, 2020
9cb5fe3
CLN (feedback): remove np.array from take_1d
arw2019 Nov 10, 2020
f55964e
CI (feedback): use ensure_platform_int in cast
arw2019 Nov 10, 2020
b342135
COMMENT: add TODO re consolidating EA/ndarray cases
arw2019 Nov 10, 2020
73e0442
CLN: shorten variable name
arw2019 Nov 10, 2020
19e22e2
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 11, 2020
d195d91
TST/CI (32bit): fix up int conversions
arw2019 Nov 11, 2020
3351cb1
fix merge error
arw2019 Nov 11, 2020
38696d9
merge with upstream/master
arw2019 Nov 13, 2020
071deec
merge with master
arw2019 Nov 17, 2020
13fa086
TST: use np.intp in dtype tests
arw2019 Nov 17, 2020
dda6804
Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…
arw2019 Nov 18, 2020
1016894
CI: fix 32bit again
arw2019 Nov 18, 2020
a9544b3
DOC: add note re: CategoricalIndex TypeError catch
arw2019 Nov 18, 2020
527b15a
merge with upstream/master
arw2019 Nov 18, 2020
9c29946
Merge branch 'GH8628' of https://github.com/arw2019/pandas into GH8628
arw2019 Nov 18, 2020
7e9fc32
CI: fix merge error
arw2019 Nov 18, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions asv_bench/benchmarks/categoricals.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import string
import sys
import warnings

import numpy as np
Expand Down Expand Up @@ -67,6 +69,47 @@ def time_existing_series(self):
pd.Categorical(self.series)


class AsType:
def setup(self):
N = 10 ** 5

random_pick = np.random.default_rng().choice

categories = {
"str": list(string.ascii_letters),
"int": np.random.randint(2 ** 16, size=154),
"float": sys.maxsize * np.random.random((38,)),
"timestamp": [
pd.Timestamp(x, unit="s") for x in np.random.randint(2 ** 18, size=578)
],
}

self.df = pd.DataFrame(
{col: random_pick(cats, N) for col, cats in categories.items()}
)

for col in ("int", "float", "timestamp"):
self.df[col + "_as_str"] = self.df[col].astype(str)

for col in self.df.columns:
self.df[col] = self.df[col].astype("category")

def astype_str(self):
[self.df[col].astype("str") for col in "int float timestamp".split()]

def astype_int(self):
[self.df[col].astype("int") for col in "int_as_str timestamp".split()]

def astype_float(self):
[
self.df[col].astype("float")
for col in "float_as_str int int_as_str timestamp".split()
]

def astype_datetime(self):
self.df["float"].astype(pd.DatetimeTZDtype(tz="US/Pacific"))


class Concat:
def setup(self):
N = 10 ** 5
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,7 @@ Performance improvements
- Reduced peak memory usage in :meth:`DataFrame.to_pickle` when using ``protocol=5`` in python 3.8+ (:issue:`34244`)
- faster ``dir`` calls when many index labels, e.g. ``dir(ser)`` (:issue:`37450`)
- Performance improvement in :class:`ExpandingGroupby` (:issue:`37064`)
- Performance improvement in :meth:`Series.astype` and :meth:`DataFrame.astype` for :class:`Categorical` (:issue:`8628`)
- Performance improvement in :meth:`pd.DataFrame.groupby` for ``float`` ``dtype`` (:issue:`28303`), changes of the underlying hash-function can lead to changes in float based indexes sort ordering for ties (e.g. :meth:`pd.Index.value_counts`)
- Performance improvement in :meth:`pd.isin` for inputs with more than 1e6 elements

Expand Down
38 changes: 29 additions & 9 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,20 +402,40 @@ def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike:
If copy is set to False and dtype is categorical, the original
object is returned.
"""
if is_categorical_dtype(dtype):
if self.dtype is dtype:
result = self.copy() if copy else self

elif is_categorical_dtype(dtype):
dtype = cast(Union[str, CategoricalDtype], dtype)

# GH 10696/18593/18630
dtype = self.dtype.update_dtype(dtype)
result = self.copy() if copy else self
if dtype == self.dtype:
return result
return result._set_dtype(dtype)
if is_extension_array_dtype(dtype):
return array(self, dtype=dtype, copy=copy)
if is_integer_dtype(dtype) and self.isna().any():
self = self.copy() if copy else self
result = self._set_dtype(dtype)

# TODO: consolidate with ndarray case?
elif is_extension_array_dtype(dtype):
result = array(self, dtype=dtype, copy=copy)

elif is_integer_dtype(dtype) and self.isna().any():
raise ValueError("Cannot convert float NaN to integer")
return np.array(self, dtype=dtype, copy=copy)

elif len(self.codes) == 0 or len(self.categories) == 0:
result = np.array(self, dtype=dtype, copy=copy)

else:
# GH8628 (PERF): astype category codes instead of astyping array
try:
astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
except (TypeError, ValueError):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change TypeError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to fix the error message for CategoricalIndex. If we don't catch TypeError we end up with TypeError: Cannot cast Index to dtype float64 (below) versus something like TypeError: Cannot cast object to dtype float64

In [2]: idx = pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"])

In [3]: idx.astype('float')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy)
    700         try:
--> 701             casted = self._values.astype(dtype, copy=copy)
    702         except (TypeError, ValueError) as err:

ValueError: could not convert string to float: 'a'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-38d56ec15c36> in <module>
----> 1 idx.astype('float')

/workspaces/pandas-arw2019/pandas/core/indexes/category.py in astype(self, dtype, copy)
    369     @doc(Index.astype)
    370     def astype(self, dtype, copy=True):
--> 371         res_data = self._data.astype(dtype, copy=copy)
    372         return Index(res_data, name=self.name)
    373 

/workspaces/pandas-arw2019/pandas/core/arrays/categorical.py in astype(self, dtype, copy)
    427             # GH8628 (PERF): astype category codes instead of astyping array
    428             try:
--> 429                 astyped_cats = self.categories.astype(dtype=dtype, copy=copy)
    430             except (ValueError):
    431                 raise ValueError(

/workspaces/pandas-arw2019/pandas/core/indexes/base.py in astype(self, dtype, copy)
    701             casted = self._values.astype(dtype, copy=copy)
    702         except (TypeError, ValueError) as err:
--> 703             raise TypeError(
    704                 f"Cannot cast {type(self).__name__} to dtype {dtype}"
    705             ) from err

TypeError: Cannot cast Index to dtype float64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine, but can you add a comment for this then right here (so future readers understand)

f"Cannot cast {self.categories.dtype} dtype to {dtype}"
)

astyped_cats = extract_array(astyped_cats, extract_numpy=True)
result = take_1d(astyped_cats, libalgos.ensure_platform_int(self._codes))

return result

@cache_readonly
def itemsize(self) -> int:
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/arrays/categorical/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ def test_astype(self, ordered):
expected = np.array(cat)
tm.assert_numpy_array_equal(result, expected)

msg = "could not convert string to float"
msg = r"Cannot cast object dtype to <class 'float'>"
with pytest.raises(ValueError, match=msg):
cat.astype(float)

Expand All @@ -138,7 +138,7 @@ def test_astype(self, ordered):
tm.assert_numpy_array_equal(result, expected)

result = cat.astype(int)
expected = np.array(cat, dtype=int)
expected = np.array(cat, dtype="int64")
tm.assert_numpy_array_equal(result, expected)

result = cat.astype(float)
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/series/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,15 +60,15 @@ def test_astype_categorical_to_other(self):
expected = ser
tm.assert_series_equal(ser.astype("category"), expected)
tm.assert_series_equal(ser.astype(CategoricalDtype()), expected)
msg = r"could not convert string to float|invalid literal for float\(\)"
msg = r"Cannot cast object dtype to float64"
with pytest.raises(ValueError, match=msg):
ser.astype("float64")

cat = Series(Categorical(["a", "b", "b", "a", "a", "c", "c", "c"]))
exp = Series(["a", "b", "b", "a", "a", "c", "c", "c"])
tm.assert_series_equal(cat.astype("str"), exp)
s2 = Series(Categorical(["1", "2", "3", "4"]))
exp2 = Series([1, 2, 3, 4]).astype(int)
exp2 = Series([1, 2, 3, 4]).astype("int64")
tm.assert_series_equal(s2.astype("int"), exp2)

# object don't sort correctly, so just compare that we have the same
Expand Down