Skip to content

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 15, 2018
Merged

Conversation

nlee737
Copy link
Contributor

@nlee737 nlee737 commented May 10, 2018

In [3]: arr = np.ones(10000000,dtype='int8')

# master
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
44.2 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# after patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
       before           after         ratio
     [6d5d7015]       [fb1f7b84]
         9.24±1ms       9.94±0.6ms     1.08  categoricals.Concat.time_concat
       5.52±0.1ms      5.41±0.05ms     0.98  categoricals.Concat.time_union
       32.0±0.3ms       32.3±0.3ms     1.01  categoricals.Constructor.time_all_nan
      1.32±0.02ms      1.28±0.01ms     0.97  categoricals.Constructor.time_datetimes
      1.26±0.01ms      1.29±0.02ms     1.02  categoricals.Constructor.time_datetimes_with_nat
          354±3μs          349±7μs     0.99  categoricals.Constructor.time_fastpath
      20.0±0.08ms       20.1±0.3ms     1.01  categoricals.Constructor.time_regular
          185±1ms        186±0.6ms     1.01  categoricals.Constructor.time_with_nan
           10.1ms           10.1ms     0.99  categoricals.Isin.time_isin_categorical('int64')
      10.7±0.08ms      10.8±0.07ms     1.00  categoricals.Isin.time_isin_categorical('object')
       9.11±0.1ms       9.04±0.2ms     0.99  categoricals.Rank.time_rank_int
       9.33±0.1ms       9.37±0.1ms     1.00  categoricals.Rank.time_rank_int_cat
       9.13±0.1ms      8.97±0.05ms     0.98  categoricals.Rank.time_rank_int_cat_ordered
        141±0.9ms          136±1ms     0.97  categoricals.Rank.time_rank_string
       11.2±0.2ms       11.1±0.1ms     0.99  categoricals.Rank.time_rank_string_cat
       9.04±0.1ms       9.23±0.1ms     1.02  categoricals.Rank.time_rank_string_cat_ordered
          592±5μs          586±3μs     0.99  categoricals.Repr.time_rendering
         32.8±2ms       28.4±0.6ms    ~0.86  categoricals.SetCategories.time_set_categories
         31.8±2ms       29.6±0.1ms     0.93  categoricals.ValueCounts.time_value_counts(False)
       30.7±0.1ms       29.3±0.2ms     0.96  categoricals.ValueCounts.time_value_counts(True)

@@ -578,7 +578,11 @@ def from_codes(cls, codes, categories, ordered=False):
unordered.
"""
try:
codes = np.asarray(codes, np.int64)
if (type(codes) == np.ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, is it sufficient to call coerce_indexer_dtype(codes, categories)? If so, how's the performance of that?

Copy link
Contributor Author

@nlee737 nlee737 May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arr = np.ones(10000000, 'int8')
# patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.13 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# coerce_indexer_dtype
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.14 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But for lists,

arr = [1 for x in range(10000000)]
# patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
600 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# coerce_indexer_dtype
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
2.32 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I don't think coerce_indexer_dtype turns codes into a Numpy array.

diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py
index 93e7aa7ca..8a096e4c8 100644
--- a/pandas/core/arrays/categorical.py
+++ b/pandas/core/arrays/categorical.py
@@ -578,18 +578,14 @@ class Categorical(ExtensionArray, PandasObject):
             unordered.
         """
         try:
-            if (type(codes) == np.ndarray
-                    and np.issubdtype(codes.dtype, np.integer)):
-                codes = np.asarray(codes)
-            else:
-                codes = np.asarray(codes, np.int64)
+            coerce_indexer_dtype(codes, categories)
         except (ValueError, TypeError):
             raise ValueError(
                 "codes need to be convertible to an arrays of integers")

         categories = CategoricalDtype.validate_categories(categories)

-        if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
+        if len(codes) and (np.max(codes) >= len(categories) or np.min(codes) < -1):
             raise ValueError("codes need to be between -1 and "
                              "len(categories)-1")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. And if you combine the two approaches like

codes = coerce_indexer_dtype(np.asarray(codes), categories)

does that work?

Copy link
Contributor Author

@nlee737 nlee737 May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it works and performance seems on par.

In [3]: arr = np.ones(10000000, 'int8')
# patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.06 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# coerce_indexer_dtype + np.asarray
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9.03 ms ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: lst = [1 for x in range(10000000)]
# patch
In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
588 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# coerce_indexer_dtype + np.asarray
In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
583 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on that implementation? I slightly prefer it to the if / else version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer it as well. I pushed a commit with your implementation.

@TomAugspurger
Copy link
Contributor

Could you also add a release note (performance improvement)?

@codecov
Copy link

codecov bot commented May 12, 2018

Codecov Report

Merging #21000 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21000      +/-   ##
==========================================
- Coverage   91.82%   91.82%   -0.01%     
==========================================
  Files         153      153              
  Lines       49505    49502       -3     
==========================================
- Hits        45460    45457       -3     
  Misses       4045     4045
Flag Coverage Δ
#multiple 90.22% <100%> (-0.01%) ⬇️
#single 41.88% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.67% <100%> (ø) ⬆️
pandas/core/dtypes/concat.py 99.17% <0%> (-0.01%) ⬇️
pandas/core/indexes/base.py 96.64% <0%> (-0.01%) ⬇️
pandas/io/parsers.py 95.46% <0%> (ø) ⬆️
pandas/core/internals.py 95.59% <0%> (ø) ⬆️
pandas/core/window.py 96.28% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d03fdb...75da8a3. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add an asv for this

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 12, 2018
@nlee737
Copy link
Contributor Author

nlee737 commented May 12, 2018

Updated ASV:

       before           after         ratio
     [3d03fdb2]       [d83abca4]
       9.27±0.6ms       9.13±0.3ms     0.98  categoricals.Concat.time_concat
       5.80±0.1ms      6.22±0.03ms     1.07  categoricals.Concat.time_union
       32.7±0.6ms       32.8±0.4ms     1.00  categoricals.Constructor.time_all_nan
      1.26±0.03ms      1.24±0.01ms     0.99  categoricals.Constructor.time_datetimes
         1.28±0ms      1.28±0.01ms     1.00  categoricals.Constructor.time_datetimes_with_nat
          364±8μs          348±5μs     0.96  categoricals.Constructor.time_fastpath
      19.9±0.08ms      20.0±0.06ms     1.00  categoricals.Constructor.time_regular
        181±0.5ms          182±2ms     1.00  categoricals.Constructor.time_with_nan
           10.2ms           10.3ms     1.01  categoricals.Isin.time_isin_categorical('int64')
       10.8±0.1ms      10.5±0.09ms     0.97  categoricals.Isin.time_isin_categorical('object')
      8.89±0.08ms       8.94±0.1ms     1.01  categoricals.Rank.time_rank_int
       9.33±0.2ms      9.44±0.05ms     1.01  categoricals.Rank.time_rank_int_cat
      8.95±0.08ms      8.95±0.05ms     1.00  categoricals.Rank.time_rank_int_cat_ordered
          135±2ms          140±2ms     1.03  categoricals.Rank.time_rank_string
      11.0±0.06ms       11.1±0.2ms     1.01  categoricals.Rank.time_rank_string_cat
       9.14±0.2ms       9.11±0.1ms     1.00  categoricals.Rank.time_rank_string_cat_ordered
          590±7μs         599±20μs     1.01  categoricals.Repr.time_rendering
       27.7±0.5ms       29.1±0.2ms     1.05  categoricals.SetCategories.time_set_categories
       30.0±0.4ms       29.6±0.4ms     0.99  categoricals.ValueCounts.time_value_counts(False)
       29.6±0.4ms       29.6±0.2ms     1.00  categoricals.ValueCounts.time_value_counts(True)

@TomAugspurger
Copy link
Contributor

I think @jbreack was requesting a new ASV that measures Categorical.from_codes with an int8 type array.

It can go in asv_bench/benchmarks/categoricals.py under Constructor.

@nlee737
Copy link
Contributor Author

nlee737 commented May 12, 2018

Sorry, I misunderstood. Thanks for the clarification. Is the following asv sufficient?

diff --git a/asv_bench/benchmarks/categoricals.py b/asv_bench/benchmarks/categoricals.py
index 0ffd5f881..ae1d70292 100644
--- a/asv_bench/benchmarks/categoricals.py
+++ b/asv_bench/benchmarks/categoricals.py
@@ -51,6 +51,7 @@ class Constructor(object):

         self.values_some_nan = list(np.tile(self.categories + [np.nan], N))
         self.values_all_nan = [np.nan] * len(self.values)
+        self.values_all_int8 = np.ones(N, 'int8')

     def time_regular(self):
         pd.Categorical(self.values, self.categories)
@@ -70,6 +71,9 @@ class Constructor(object):
     def time_all_nan(self):
         pd.Categorical(self.values_all_nan)

+    def time_from_codes_all_int8(self):
+        pd.Categorical.from_codes(self.values_all_int8, self.categories)
+

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks.

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone May 15, 2018
@TomAugspurger TomAugspurger merged commit 363426f into pandas-dev:master May 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categorical.from_codes shouldn't coerce to int64
3 participants