-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN: avoid _internal_get_values in Categorical.__iter__ #32555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN: avoid _internal_get_values in Categorical.__iter__ #32555
Conversation
LGTM. Isn't there a performance improvement here, because we avoid creating the list? |
Definitely should be better on memory, speed should be a wash |
Still, memory usage and speed are usually two sides of the same thing in numerical computing, e.g. >>> cat = pd.Categorical(range(100_000))
>>> for i, x in enumerate(cat):
... if i> 10: break
... else: # do something should be considerably faster in the new impl, i.e. partial iteration of categoricals will be considerable faster. Worth a whatsnew note? |
Its appreciably faster when iterating over only a few elements, but slower when iterating over everything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the performance of this for some different dtypes for the categories? (eg numerical and datetime-like)
I remember previous issues with trying to optimize this for either memory (making this actually lazy, as what this is doing) or speed, making the other much worse. Eg here, converting scalar by scalar is probably slower than a vectorized conversion.
IMO converting scalar by scalar is the reasonable thing to do in So I'd be ok with this, even if this is significantly slower than a full-array vector version, and IMO |
There are still some use cases where iter is actually used for full-array conversions, like conversion to list (although it can also be questionable if conversion to list should be our prime target to optimize for iter). Eg:
goes from 220 ms to 1.6 s based on a quick test. (this specific example might be a reason to have a In some other pandas/pandas/core/arrays/datetimes.py Lines 554 to 575 in db9a50c
which I suppose keeps a bit a middle ground between memory benefit of lazy iteration vs performance of vectorized conversion. |
Interesting, Also, iterating over arrays is in a lot of cases a code smell, so I would not mind prioritizing clean code over performance and @jbrockmendel PR is certainly more readable the the old code and recommend people use |
Thats pretty rough. I'm open to suggestions, as long as they don't involve _internal_get_values. |
How about chunking , i.e. Converting e.g 100 or 1000 at a time, then yield 1 item at a time? |
I think we need a method to get the "dense" version of a Categorical, anyway, so if not |
Which is basically #26410 |
Closing. We're going to have to keep something resembling Categorical._internal_get_values (I'm leaning towards "_ensure_dense", open to suggestions), will update |
No description provided.