-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improves performance and memory usage of DataFrame.duplicated #9398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice! I noticed the other day that duplicated did not use the klib hashtable, which seemed strange, but I didn't have the time to dig into it. Can you safely get ride of |
|
||
size_hint = min(len(self), _SIZE_HINT_LIMIT) | ||
|
||
def factorize(vals): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you just use pandas.core.algorithms.factorize
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's because you don't want to bother with some calculations involving uniques, I would say:
- benchmark to see if it really matters
- if necessary, separate
factorize
out into two parts and leave all the private methods access inalgorithms.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more so because of simplicity. it is only 4 lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still rather reuse factorize
here than duplicate these four lines which use a private API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"private" is with respect to public user api, not the library itself.
for example see the top of the same file where many private functions are imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I still find it clearer to use public APIs internally when possible (especially to avoid duplicated code).
|
(hash_klass, vec_klass), vals = \ | ||
algos._get_data_algo(vals, algos._hashtables) | ||
|
||
uniques, table = vec_klass(), hash_klass(size_hint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzadnouri I think what @shoyer means as you are using the private cython impl of indexes. This is currently only used in algos
. and does not need to be exposed to general reader of frame.py
. So I would make a helper private function in algos.py
that does these 3 lines (that return the labels/uniques).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should be refactored into a private function as well
@behzadnouri if you could refactor this a bit to make the use of the hashtables to be a separate function isolated in core/algos.py would be gr8 |
@jreback honestly I do not see how to do this in a way i am comfortable with. this is a very special usage where only the number of unique values matter not their actual values; |
I meant someting like this: jreback@923e35c |
merged via 7da9178 thanks @behzadnouri |
on master:
on branch:
benchmarks: