Skip to content

CLN: avoid _internal_get_values in Categorical.__iter__ #32555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jbrockmendel
Copy link
Member

No description provided.

@topper-123 topper-123 added Clean Categorical Categorical Data Type labels Mar 9, 2020
@topper-123 topper-123 added this to the 1.1 milestone Mar 9, 2020
@topper-123
Copy link
Contributor

LGTM. Isn't there a performance improvement here, because we avoid creating the list?

@jbrockmendel
Copy link
Member Author

Isn't there a performance improvement here, because we avoid creating the list?

Definitely should be better on memory, speed should be a wash

@topper-123
Copy link
Contributor

topper-123 commented Mar 9, 2020

Still, memory usage and speed are usually two sides of the same thing in numerical computing, e.g.

>>> cat = pd.Categorical(range(100_000))
>>> for i, x in enumerate(cat):
...    if i> 10: break
...    else: # do something

should be considerably faster in the new impl, i.e. partial iteration of categoricals will be considerable faster. Worth a whatsnew note?

@topper-123 topper-123 added the Performance Memory or execution speed performance label Mar 9, 2020
@jbrockmendel
Copy link
Member Author

Its appreciably faster when iterating over only a few elements, but slower when iterating over everything

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the performance of this for some different dtypes for the categories? (eg numerical and datetime-like)

I remember previous issues with trying to optimize this for either memory (making this actually lazy, as what this is doing) or speed, making the other much worse. Eg here, converting scalar by scalar is probably slower than a vectorized conversion.

@topper-123
Copy link
Contributor

IMO converting scalar by scalar is the reasonable thing to do in __iter__ and accepting that this is slower than a full-array vector op is reasonable.

So I'd be ok with this, even if this is significantly slower than a full-array vector version, and IMO _iter__ should be considered a specialty method to be used only in a specific use cases, where full-array op performance is not a concern.

@jorisvandenbossche
Copy link
Member

There are still some use cases where iter is actually used for full-array conversions, like conversion to list (although it can also be questionable if conversion to list should be our prime target to optimize for iter).

Eg:

cat = pd.Categorical.from_codes(np.random.randint(0,100,100000), categories=pd.date_range("2000", periods=100))
%timeit list(cat)

goes from 220 ms to 1.6 s based on a quick test.

(this specific example might be a reason to have a tolist method, but I assume there are similar other use cases)

In some other __iter__'s, we actually do it in chunks, eg

def __iter__(self):
"""
Return an iterator over the boxed values
Yields
------
tstamp : Timestamp
"""
# convert in chunks of 10k for efficiency
data = self.asi8
length = len(self)
chunksize = 10000
chunks = int(length / chunksize) + 1
for i in range(chunks):
start_i = i * chunksize
end_i = min((i + 1) * chunksize, length)
converted = tslib.ints_to_pydatetime(
data[start_i:end_i], tz=self.tz, freq=self.freq, box="timestamp"
)
for v in converted:
yield v

which I suppose keeps a bit a middle ground between memory benefit of lazy iteration vs performance of vectorized conversion.

@topper-123
Copy link
Contributor

Interesting, list calls __iter__, so I assume it's not possible to optimize that, though Categorical.tolist could have an optimized path.

Also, iterating over arrays is in a lot of cases a code smell, so I would not mind prioritizing clean code over performance and @jbrockmendel PR is certainly more readable the the old code and recommend people use cat.to_list() over list(cat) if performance when creating lists is a concern..

@jbrockmendel
Copy link
Member Author

Can you check the performance of this for some different dtypes for the categories?

In [3]: arr = np.arange(10**6)                                                                                                                                                    
In [4]: cat = pd.Categorical(arr)                                                                                                                                                 
In [5]: %timeit res = list(cat)                                                                                                                                                   
4.38 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- PR
55.6 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- master

In [6]: cat2 = pd.Categorical(arr.astype(object))                                                                                                                                 
In [7]: %timeit res = list(cat2)                                                                                                                                                  
4.34 s ± 178 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- PR
61.7 ms ± 4.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- master

In [8]: dti = pd.date_range('2016-01-01', periods=10**6, freq='S')                                                                                                                
In [9]: cat3 = pd.Categorical(dti)                                                                                                                                                
In [11]: %timeit res = list(cat3)                                                 
14.6 s ± 484 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- PR
2.73 s ± 35.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- master

Thats pretty rough. I'm open to suggestions, as long as they don't involve _internal_get_values.

@topper-123
Copy link
Contributor

topper-123 commented Mar 10, 2020

How about chunking , i.e. Converting e.g 100 or 1000 at a time, then yield 1 item at a time?

@jorisvandenbossche
Copy link
Member

Thats pretty rough. I'm open to suggestions, as long as they don't involve _internal_get_values.

I think we need a method to get the "dense" version of a Categorical, anyway, so if not _internal_get_values, we ideally should have a public method that basically does this IMO.
Currently, you can only do np.array(cat) (or cat.to_dense()), but that of course looses information for extension arrays.

@jorisvandenbossche
Copy link
Member

Which is basically #26410

@jbrockmendel
Copy link
Member Author

Closing.

We're going to have to keep something resembling Categorical._internal_get_values (I'm leaning towards "_ensure_dense", open to suggestions), will update __iter__ to use that once that becomes a reality.

@jbrockmendel jbrockmendel deleted the no-internal_get_values-item branch March 11, 2020 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Clean Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants