PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

nlee737 · 2018-05-10T14:38:06Z

closes Categorical.from_codes shouldn't coerce to int64 #18501
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In [3]: arr = np.ones(10000000,dtype='int8')

# master
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
44.2 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# after patch
In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
9 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

       before           after         ratio
     [6d5d7015]       [fb1f7b84]
         9.24±1ms       9.94±0.6ms     1.08  categoricals.Concat.time_concat
       5.52±0.1ms      5.41±0.05ms     0.98  categoricals.Concat.time_union
       32.0±0.3ms       32.3±0.3ms     1.01  categoricals.Constructor.time_all_nan
      1.32±0.02ms      1.28±0.01ms     0.97  categoricals.Constructor.time_datetimes
      1.26±0.01ms      1.29±0.02ms     1.02  categoricals.Constructor.time_datetimes_with_nat
          354±3μs          349±7μs     0.99  categoricals.Constructor.time_fastpath
      20.0±0.08ms       20.1±0.3ms     1.01  categoricals.Constructor.time_regular
          185±1ms        186±0.6ms     1.01  categoricals.Constructor.time_with_nan
           10.1ms           10.1ms     0.99  categoricals.Isin.time_isin_categorical('int64')
      10.7±0.08ms      10.8±0.07ms     1.00  categoricals.Isin.time_isin_categorical('object')
       9.11±0.1ms       9.04±0.2ms     0.99  categoricals.Rank.time_rank_int
       9.33±0.1ms       9.37±0.1ms     1.00  categoricals.Rank.time_rank_int_cat
       9.13±0.1ms      8.97±0.05ms     0.98  categoricals.Rank.time_rank_int_cat_ordered
        141±0.9ms          136±1ms     0.97  categoricals.Rank.time_rank_string
       11.2±0.2ms       11.1±0.1ms     0.99  categoricals.Rank.time_rank_string_cat
       9.04±0.1ms       9.23±0.1ms     1.02  categoricals.Rank.time_rank_string_cat_ordered
          592±5μs          586±3μs     0.99  categoricals.Repr.time_rendering
         32.8±2ms       28.4±0.6ms    ~0.86  categoricals.SetCategories.time_set_categories
         31.8±2ms       29.6±0.1ms     0.93  categoricals.ValueCounts.time_value_counts(False)
       30.7±0.1ms       29.3±0.2ms     0.96  categoricals.ValueCounts.time_value_counts(True)

TomAugspurger · 2018-05-10T15:52:10Z

pandas/core/arrays/categorical.py

@@ -578,7 +578,11 @@ def from_codes(cls, codes, categories, ordered=False):
            unordered.
        """
        try:
-            codes = np.asarray(codes, np.int64)
+            if (type(codes) == np.ndarray


I wonder, is it sufficient to call coerce_indexer_dtype(codes, categories)? If so, how's the performance of that?

arr = np.ones(10000000, 'int8') # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.13 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # coerce_indexer_dtype In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.14 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But for lists,

arr = [1 for x in range(10000000)] # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 600 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # coerce_indexer_dtype In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 2.32 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I don't think coerce_indexer_dtype turns codes into a Numpy array.

diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index 93e7aa7ca..8a096e4c8 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -578,18 +578,14 @@ class Categorical(ExtensionArray, PandasObject): unordered. """ try: - if (type(codes) == np.ndarray - and np.issubdtype(codes.dtype, np.integer)): - codes = np.asarray(codes) - else: - codes = np.asarray(codes, np.int64) + coerce_indexer_dtype(codes, categories) except (ValueError, TypeError): raise ValueError( "codes need to be convertible to an arrays of integers") categories = CategoricalDtype.validate_categories(categories) - if len(codes) and (codes.max() >= len(categories) or codes.min() < -1): + if len(codes) and (np.max(codes) >= len(categories) or np.min(codes) < -1): raise ValueError("codes need to be between -1 and " "len(categories)-1")

Indeed. And if you combine the two approaches like

codes = coerce_indexer_dtype(np.asarray(codes), categories)

does that work?

Yep, it works and performance seems on par.

In [3]: arr = np.ones(10000000, 'int8') # patch In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.06 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # coerce_indexer_dtype + np.asarray In [4]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 9.03 ms ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: lst = [1 for x in range(10000000)] # patch In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 588 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # coerce_indexer_dtype + np.asarray In [6]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar']) 583 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thoughts on that implementation? I slightly prefer it to the if / else version.

I prefer it as well. I pushed a commit with your implementation.

TomAugspurger · 2018-05-11T21:18:38Z

Could you also add a release note (performance improvement)?

…m_codes

codecov · 2018-05-12T12:53:47Z

Codecov Report

Merging #21000 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21000      +/-   ##
==========================================
- Coverage   91.82%   91.82%   -0.01%     
==========================================
  Files         153      153              
  Lines       49505    49502       -3     
==========================================
- Hits        45460    45457       -3     
  Misses       4045     4045

Flag	Coverage Δ
#multiple	`90.22% <100%> (-0.01%)`	⬇️
#single	`41.88% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.67% <100%> (ø)`	⬆️
pandas/core/dtypes/concat.py	`99.17% <0%> (-0.01%)`	⬇️
pandas/core/indexes/base.py	`96.64% <0%> (-0.01%)`	⬇️
pandas/io/parsers.py	`95.46% <0%> (ø)`	⬆️
pandas/core/internals.py	`95.59% <0%> (ø)`	⬆️
pandas/core/window.py	`96.28% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d03fdb...75da8a3. Read the comment docs.

jreback

can u add an asv for this

nlee737 · 2018-05-12T14:06:06Z

Updated ASV:

       before           after         ratio
     [3d03fdb2]       [d83abca4]
       9.27±0.6ms       9.13±0.3ms     0.98  categoricals.Concat.time_concat
       5.80±0.1ms      6.22±0.03ms     1.07  categoricals.Concat.time_union
       32.7±0.6ms       32.8±0.4ms     1.00  categoricals.Constructor.time_all_nan
      1.26±0.03ms      1.24±0.01ms     0.99  categoricals.Constructor.time_datetimes
         1.28±0ms      1.28±0.01ms     1.00  categoricals.Constructor.time_datetimes_with_nat
          364±8μs          348±5μs     0.96  categoricals.Constructor.time_fastpath
      19.9±0.08ms      20.0±0.06ms     1.00  categoricals.Constructor.time_regular
        181±0.5ms          182±2ms     1.00  categoricals.Constructor.time_with_nan
           10.2ms           10.3ms     1.01  categoricals.Isin.time_isin_categorical('int64')
       10.8±0.1ms      10.5±0.09ms     0.97  categoricals.Isin.time_isin_categorical('object')
      8.89±0.08ms       8.94±0.1ms     1.01  categoricals.Rank.time_rank_int
       9.33±0.2ms      9.44±0.05ms     1.01  categoricals.Rank.time_rank_int_cat
      8.95±0.08ms      8.95±0.05ms     1.00  categoricals.Rank.time_rank_int_cat_ordered
          135±2ms          140±2ms     1.03  categoricals.Rank.time_rank_string
      11.0±0.06ms       11.1±0.2ms     1.01  categoricals.Rank.time_rank_string_cat
       9.14±0.2ms       9.11±0.1ms     1.00  categoricals.Rank.time_rank_string_cat_ordered
          590±7μs         599±20μs     1.01  categoricals.Repr.time_rendering
       27.7±0.5ms       29.1±0.2ms     1.05  categoricals.SetCategories.time_set_categories
       30.0±0.4ms       29.6±0.4ms     0.99  categoricals.ValueCounts.time_value_counts(False)
       29.6±0.4ms       29.6±0.2ms     1.00  categoricals.ValueCounts.time_value_counts(True)

TomAugspurger · 2018-05-12T18:43:22Z

I think @jbreack was requesting a new ASV that measures Categorical.from_codes with an int8 type array.

It can go in asv_bench/benchmarks/categoricals.py under Constructor.

nlee737 · 2018-05-12T19:51:13Z

Sorry, I misunderstood. Thanks for the clarification. Is the following asv sufficient?

diff --git a/asv_bench/benchmarks/categoricals.py b/asv_bench/benchmarks/categoricals.py
index 0ffd5f881..ae1d70292 100644
--- a/asv_bench/benchmarks/categoricals.py
+++ b/asv_bench/benchmarks/categoricals.py
@@ -51,6 +51,7 @@ class Constructor(object):

         self.values_some_nan = list(np.tile(self.categories + [np.nan], N))
         self.values_all_nan = [np.nan] * len(self.values)
+        self.values_all_int8 = np.ones(N, 'int8')

     def time_regular(self):
         pd.Categorical(self.values, self.categories)
@@ -70,6 +71,9 @@ class Constructor(object):
     def time_all_nan(self):
         pd.Categorical(self.values_all_nan)

+    def time_from_codes_all_int8(self):
+        pd.Categorical.from_codes(self.values_all_int8, self.categories)
+

TomAugspurger

Perfect, thanks.

nlee737 mentioned this pull request May 10, 2018

PERF: removed coercion to int64 in Categorical.from_codes #20961

Closed

4 tasks

TomAugspurger reviewed May 10, 2018

View reviewed changes

PERF: removed coercion to int64 for arrays of ints in Categorical.fro…

49ffb8f

…m_codes

CLN: using coerce_indexer_dtype for codes in Categorical.from_codes

d83abca

jreback requested changes May 12, 2018

View reviewed changes

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 12, 2018

PERF: added asv for Categorical.from_codes (#18501)

75da8a3

TomAugspurger approved these changes May 15, 2018

View reviewed changes

TomAugspurger added this to the 0.23.0 milestone May 15, 2018

TomAugspurger merged commit 363426f into pandas-dev:master May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

nlee737 commented May 10, 2018 •

edited

Loading

TomAugspurger May 10, 2018

nlee737 May 10, 2018 •

edited

Loading

TomAugspurger May 11, 2018

nlee737 May 11, 2018 •

edited

Loading

TomAugspurger May 11, 2018

nlee737 May 12, 2018

TomAugspurger commented May 11, 2018

codecov bot commented May 12, 2018 •

edited

Loading

jreback left a comment

nlee737 commented May 12, 2018 •

edited

Loading

TomAugspurger commented May 12, 2018

nlee737 commented May 12, 2018

TomAugspurger left a comment

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

PERF: removed coercion to int64 for arrays of ints in Categorical.from_codes #21000

Conversation

nlee737 commented May 10, 2018 • edited Loading

TomAugspurger May 10, 2018

Choose a reason for hiding this comment

nlee737 May 10, 2018 • edited Loading

Choose a reason for hiding this comment

TomAugspurger May 11, 2018

Choose a reason for hiding this comment

nlee737 May 11, 2018 • edited Loading

Choose a reason for hiding this comment

TomAugspurger May 11, 2018

Choose a reason for hiding this comment

nlee737 May 12, 2018

Choose a reason for hiding this comment

TomAugspurger commented May 11, 2018

codecov bot commented May 12, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

nlee737 commented May 12, 2018 • edited Loading

TomAugspurger commented May 12, 2018

nlee737 commented May 12, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

nlee737 commented May 10, 2018 •

edited

Loading

nlee737 May 10, 2018 •

edited

Loading

nlee737 May 11, 2018 •

edited

Loading

codecov bot commented May 12, 2018 •

edited

Loading

nlee737 commented May 12, 2018 •

edited

Loading