-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/arrays/categorical.py
Outdated
@@ -578,7 +578,11 @@ def from_codes(cls, codes, categories, ordered=False): | |||
unordered. | |||
""" | |||
try: | |||
codes = np.asarray(codes, np.int64) | |||
if (type(codes) == np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, is it sufficient to call coerce_indexer_dtype(codes, categories)
? If so, how's the performance of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arr = np.ones(10000000, 'int8')
# patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.13 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# coerce_indexer_dtype
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.14 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But for lists,
arr = [1 for x in range(10000000)]
# patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
600 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# coerce_indexer_dtype
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
2.32 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I don't think coerce_indexer_dtype
turns codes
into a Numpy array.
diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py
index 93e7aa7ca..8a096e4c8 100644
--- a/pandas/core/arrays/categorical.py
+++ b/pandas/core/arrays/categorical.py
@@ -578,18 +578,14 @@ class Categorical(ExtensionArray, PandasObject):
unordered.
"""
try:
- if (type(codes) == np.ndarray
- and np.issubdtype(codes.dtype, np.integer)):
- codes = np.asarray(codes)
- else:
- codes = np.asarray(codes, np.int64)
+ coerce_indexer_dtype(codes, categories)
except (ValueError, TypeError):
raise ValueError(
"codes need to be convertible to an arrays of integers")
categories = CategoricalDtype.validate_categories(categories)
- if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
+ if len(codes) and (np.max(codes) >= len(categories) or np.min(codes) < -1):
raise ValueError("codes need to be between -1 and "
"len(categories)-1")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. And if you combine the two approaches like
codes = coerce_indexer_dtype(np.asarray(codes), categories)
does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it works and performance seems on par.
In [3]: arr = np.ones(10000000, 'int8')
# patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.06 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# coerce_indexer_dtype + np.asarray
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.03 ms ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: lst = [1 for x in range(10000000)]
# patch
In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
588 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# coerce_indexer_dtype + np.asarray
In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
583 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on that implementation? I slightly prefer it to the if / else
version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it as well. I pushed a commit with your implementation.
Could you also add a release note (performance improvement)? |
Codecov Report
@@ Coverage Diff @@
## master #21000 +/- ##
==========================================
- Coverage 91.82% 91.82% -0.01%
==========================================
Files 153 153
Lines 49505 49502 -3
==========================================
- Hits 45460 45457 -3
Misses 4045 4045
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add an asv for this
Updated ASV:
|
I think @jbreack was requesting a new ASV that measures It can go in |
Sorry, I misunderstood. Thanks for the clarification. Is the following asv sufficient?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks.
git diff upstream/master -u -- "*.py" | flake8 --diff