CLN: avoid _internal_get_values in Categorical.iter #32555

jbrockmendel · 2020-03-09T17:23:13Z

No description provided.

topper-123 · 2020-03-09T21:12:31Z

LGTM. Isn't there a performance improvement here, because we avoid creating the list?

jbrockmendel · 2020-03-09T22:41:21Z

Isn't there a performance improvement here, because we avoid creating the list?

Definitely should be better on memory, speed should be a wash

topper-123 · 2020-03-09T23:18:46Z

Still, memory usage and speed are usually two sides of the same thing in numerical computing, e.g.

>>> cat = pd.Categorical(range(100_000))
>>> for i, x in enumerate(cat):
...    if i> 10: break
...    else: # do something

should be considerably faster in the new impl, i.e. partial iteration of categoricals will be considerable faster. Worth a whatsnew note?

jbrockmendel · 2020-03-10T01:36:52Z

Its appreciably faster when iterating over only a few elements, but slower when iterating over everything

jorisvandenbossche

Can you check the performance of this for some different dtypes for the categories? (eg numerical and datetime-like)

I remember previous issues with trying to optimize this for either memory (making this actually lazy, as what this is doing) or speed, making the other much worse. Eg here, converting scalar by scalar is probably slower than a vectorized conversion.

topper-123 · 2020-03-10T10:58:29Z

IMO converting scalar by scalar is the reasonable thing to do in __iter__ and accepting that this is slower than a full-array vector op is reasonable.

So I'd be ok with this, even if this is significantly slower than a full-array vector version, and IMO _iter__ should be considered a specialty method to be used only in a specific use cases, where full-array op performance is not a concern.

jorisvandenbossche · 2020-03-10T13:03:08Z

There are still some use cases where iter is actually used for full-array conversions, like conversion to list (although it can also be questionable if conversion to list should be our prime target to optimize for iter).

Eg:

cat = pd.Categorical.from_codes(np.random.randint(0,100,100000), categories=pd.date_range("2000", periods=100))
%timeit list(cat)

goes from 220 ms to 1.6 s based on a quick test.

(this specific example might be a reason to have a tolist method, but I assume there are similar other use cases)

In some other __iter__'s, we actually do it in chunks, eg

pandas/pandas/core/arrays/datetimes.py

Lines 554 to 575 in db9a50c

    
               def __iter__(self): 
        
                   """ 
        
                   Return an iterator over the boxed values 
        
                   Yields 
        
                   ------ 
        
                   tstamp : Timestamp 
        
                   """ 
        
                   # convert in chunks of 10k for efficiency 
        
                   data = self.asi8 
        
                   length = len(self) 
        
                   chunksize = 10000 
        
                   chunks = int(length / chunksize) + 1 
        
                   for i in range(chunks): 
        
                       start_i = i * chunksize 
        
                       end_i = min((i + 1) * chunksize, length) 
        
                       converted = tslib.ints_to_pydatetime( 
        
                           data[start_i:end_i], tz=self.tz, freq=self.freq, box="timestamp" 
        
                       ) 
        
                       for v in converted: 
        
                           yield v

which I suppose keeps a bit a middle ground between memory benefit of lazy iteration vs performance of vectorized conversion.

topper-123 · 2020-03-10T13:37:10Z

Interesting, list calls __iter__, so I assume it's not possible to optimize that, though Categorical.tolist could have an optimized path.

Also, iterating over arrays is in a lot of cases a code smell, so I would not mind prioritizing clean code over performance and @jbrockmendel PR is certainly more readable the the old code and recommend people use cat.to_list() over list(cat) if performance when creating lists is a concern..

jbrockmendel · 2020-03-10T15:17:42Z

Can you check the performance of this for some different dtypes for the categories?

In [3]: arr = np.arange(10**6)                                                                                                                                                    
In [4]: cat = pd.Categorical(arr)                                                                                                                                                 
In [5]: %timeit res = list(cat)                                                                                                                                                   
4.38 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- PR
55.6 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- master

In [6]: cat2 = pd.Categorical(arr.astype(object))                                                                                                                                 
In [7]: %timeit res = list(cat2)                                                                                                                                                  
4.34 s ± 178 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- PR
61.7 ms ± 4.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- master

In [8]: dti = pd.date_range('2016-01-01', periods=10**6, freq='S')                                                                                                                
In [9]: cat3 = pd.Categorical(dti)                                                                                                                                                
In [11]: %timeit res = list(cat3)                                                 
14.6 s ± 484 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- PR
2.73 s ± 35.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- master

Thats pretty rough. I'm open to suggestions, as long as they don't involve _internal_get_values.

topper-123 · 2020-03-10T17:23:12Z

How about chunking , i.e. Converting e.g 100 or 1000 at a time, then yield 1 item at a time?

jorisvandenbossche · 2020-03-10T23:16:31Z

Thats pretty rough. I'm open to suggestions, as long as they don't involve _internal_get_values.

I think we need a method to get the "dense" version of a Categorical, anyway, so if not _internal_get_values, we ideally should have a public method that basically does this IMO.
Currently, you can only do np.array(cat) (or cat.to_dense()), but that of course looses information for extension arrays.

jorisvandenbossche · 2020-03-10T23:17:20Z

Which is basically #26410

jbrockmendel · 2020-03-11T23:45:21Z

Closing.

We're going to have to keep something resembling Categorical._internal_get_values (I'm leaning towards "_ensure_dense", open to suggestions), will update __iter__ to use that once that becomes a reality.

CLN: avoid _internal_get_values in Categorical.__iter__

97e5546

topper-123 added Clean Categorical Categorical Data Type labels Mar 9, 2020

topper-123 added this to the 1.1 milestone Mar 9, 2020

topper-123 added the Performance Memory or execution speed performance label Mar 9, 2020

jorisvandenbossche requested changes Mar 10, 2020

View reviewed changes

jbrockmendel closed this Mar 11, 2020

jbrockmendel deleted the no-internal_get_values-item branch March 11, 2020 23:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: avoid _internal_get_values in Categorical.iter #32555

CLN: avoid _internal_get_values in Categorical.iter #32555

jbrockmendel commented Mar 9, 2020

topper-123 commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

topper-123 commented Mar 9, 2020 •

edited

Loading

jbrockmendel commented Mar 10, 2020

jorisvandenbossche left a comment •

edited

Loading

topper-123 commented Mar 10, 2020

jorisvandenbossche commented Mar 10, 2020

topper-123 commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

topper-123 commented Mar 10, 2020 •

edited

Loading

jorisvandenbossche commented Mar 10, 2020

jorisvandenbossche commented Mar 10, 2020

jbrockmendel commented Mar 11, 2020

CLN: avoid _internal_get_values in Categorical.__iter__ #32555

CLN: avoid _internal_get_values in Categorical.__iter__ #32555

Conversation

jbrockmendel commented Mar 9, 2020

topper-123 commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

topper-123 commented Mar 9, 2020 • edited Loading

jbrockmendel commented Mar 10, 2020

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

topper-123 commented Mar 10, 2020

jorisvandenbossche commented Mar 10, 2020

topper-123 commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

topper-123 commented Mar 10, 2020 • edited Loading

jorisvandenbossche commented Mar 10, 2020

jorisvandenbossche commented Mar 10, 2020

jbrockmendel commented Mar 11, 2020

CLN: avoid _internal_get_values in Categorical.iter #32555

CLN: avoid _internal_get_values in Categorical.iter #32555

topper-123 commented Mar 9, 2020 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

topper-123 commented Mar 10, 2020 •

edited

Loading