-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improves performance and memory usage of DataFrame.duplicated #9398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2734,30 +2734,32 @@ def duplicated(self, subset=None, take_last=False): | |
------- | ||
duplicated : Series | ||
""" | ||
# kludge for #1833 | ||
def _m8_to_i8(x): | ||
if issubclass(x.dtype.type, np.datetime64): | ||
return x.view(np.int64) | ||
return x | ||
from pandas.core.groupby import get_group_index | ||
from pandas.hashtable import duplicated_int64, _SIZE_HINT_LIMIT | ||
|
||
size_hint = min(len(self), _SIZE_HINT_LIMIT) | ||
|
||
def factorize(vals): | ||
(hash_klass, vec_klass), vals = \ | ||
algos._get_data_algo(vals, algos._hashtables) | ||
|
||
uniques, table = vec_klass(), hash_klass(size_hint) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @behzadnouri I think what @shoyer means as you are using the private cython impl of indexes. This is currently only used in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that should be refactored into a private function as well |
||
labels = table.get_labels(vals, uniques, 0, -1) | ||
|
||
return labels.astype('i8', copy=False), len(uniques) | ||
|
||
# if we are only duplicating on Categoricals this can be much faster | ||
if subset is None: | ||
values = list(_m8_to_i8(self.get_values().T)) | ||
else: | ||
if np.iterable(subset) and not isinstance(subset, compat.string_types): | ||
if isinstance(subset, tuple): | ||
if subset in self.columns: | ||
values = [self[subset].get_values()] | ||
else: | ||
values = [_m8_to_i8(self[x].get_values()) for x in subset] | ||
else: | ||
values = [_m8_to_i8(self[x].get_values()) for x in subset] | ||
else: | ||
values = [self[subset].get_values()] | ||
subset = self.columns | ||
elif not np.iterable(subset) or \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not really important, but per PEP8 I usually prefer a style of enclosing parentheses over explicit continuations with |
||
isinstance(subset, compat.string_types) or \ | ||
isinstance(subset, tuple) and subset in self.columns: | ||
subset = subset, | ||
|
||
vals = (self[col].values for col in subset) | ||
labels, shape = map(list, zip( * map(factorize, vals))) | ||
|
||
keys = lib.fast_zip_fillna(values) | ||
duplicated = lib.duplicated(keys, take_last=take_last) | ||
return Series(duplicated, index=self.index) | ||
ids = get_group_index(labels, shape, sort=False, xnull=False) | ||
return Series(duplicated_int64(ids, take_last), index=self.index) | ||
|
||
#---------------------------------------------------------------------- | ||
# Sorting | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you just use
pandas.core.algorithms.factorize
here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's because you don't want to bother with some calculations involving uniques, I would say:
factorize
out into two parts and leave all the private methods access inalgorithms.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more so because of simplicity. it is only 4 lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still rather reuse
factorize
here than duplicate these four lines which use a private API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"private" is with respect to public user api, not the library itself.
for example see the top of the same file where many private functions are imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I still find it clearer to use public APIs internally when possible (especially to avoid duplicated code).