-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/PERF: Sort mixed-int in Py3, fix Index.difference #13514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -163,6 +163,104 @@ def isin(comps, values): | |
return f(comps, values) | ||
|
||
|
||
def safe_sort(values, labels=None, na_sentinel=-1, assume_unique=False): | ||
""" | ||
Sort ``values`` and reorder corresponding ``labels``. | ||
``values`` should be unique if ``labels`` is not None. | ||
Safe for use with mixed types (int, str), orders ints before strs. | ||
|
||
.. versionadded:: 0.19.0 | ||
|
||
Parameters | ||
---------- | ||
values : list-like | ||
Sequence; must be unique if ``labels`` is not None. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we check this? (the check is pretty cheap in any event) |
||
labels : list_like | ||
Indices to ``values``. All out of bound indices are treated as | ||
"not found" and will be masked with ``na_sentinel``. | ||
na_sentinel : int, default -1 | ||
Value in ``labels`` to mark "not found". | ||
Ignored when ``labels`` is None. | ||
assume_unique : bool, default False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we are checking do we need this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added vals = np.arange(10000)
np.random.shuffle(vals)
%timeit -n1 -r1 pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3])
1 loops, best of 1: 3.82 ms per loop
%timeit -n1 -r1 pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3], assume_unique=True)
1 loops, best of 1: 2.93 ms per loop
%timeit pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3])
100 loops, best of 3: 1.85 ms per loop
%timeit pd.core.algorithms.safe_sort(vals, labels=[1, 2, 3], assume_unique=True)
1000 loops, best of 3: 816 µs per loop A line profiler says the line If I use Should I remove it or let it stay? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess that's fine. Just document that in the doc-string (e.g. so a reader knows that when |
||
When True, ``values`` are assumed to be unique, which can speed up | ||
the calculation. Ignored when ``labels`` is None. | ||
|
||
Returns | ||
------- | ||
ordered : ndarray | ||
Sorted ``values`` | ||
new_labels : ndarray | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no spaces after the label e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure about it? All docstrings seem to have spaces there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doc string params should be like:
e.g. no space before the colon, 1 after; certainly possible other doc-strings are wrong |
||
Reordered ``labels``; returned when ``labels`` is not None. | ||
|
||
Raises | ||
------ | ||
TypeError | ||
* If ``values`` is not list-like or if ``labels`` is neither None | ||
nor list-like | ||
* If ``values`` cannot be sorted | ||
ValueError | ||
* If ``labels`` is not None and ``values`` contain duplicates. | ||
""" | ||
if not is_list_like(values): | ||
raise TypeError("Only list-like objects are allowed to be passed to" | ||
"safe_sort as values") | ||
values = np.array(values, copy=False) | ||
|
||
def sort_mixed(values): | ||
# order ints before strings, safe in py3 | ||
str_pos = np.array([isinstance(x, string_types) for x in values], | ||
dtype=bool) | ||
nums = np.sort(values[~str_pos]) | ||
strs = np.sort(values[str_pos]) | ||
return _ensure_object(np.concatenate([nums, strs])) | ||
|
||
sorter = None | ||
if compat.PY3 and lib.infer_dtype(values) == 'mixed-integer': | ||
# unorderable in py3 if mixed str/int | ||
ordered = sort_mixed(values) | ||
else: | ||
try: | ||
sorter = values.argsort() | ||
ordered = values.take(sorter) | ||
except TypeError: | ||
# try this anyway | ||
ordered = sort_mixed(values) | ||
|
||
# labels: | ||
|
||
if labels is None: | ||
return ordered | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
if not is_list_like(labels): | ||
raise TypeError("Only list-like objects or None are allowed to be" | ||
"passed to safe_sort as labels") | ||
labels = _ensure_platform_int(np.asarray(labels)) | ||
|
||
from pandas import Index | ||
if not assume_unique and not Index(values).is_unique: | ||
raise ValueError("values should be unique if labels is not None") | ||
|
||
if sorter is None: | ||
# mixed types | ||
(hash_klass, _), values = _get_data_algo(values, _hashtables) | ||
t = hash_klass(len(values)) | ||
t.map_locations(values) | ||
sorter = _ensure_platform_int(t.lookup(ordered)) | ||
|
||
reverse_indexer = np.empty(len(sorter), dtype=np.int_) | ||
reverse_indexer.put(sorter, np.arange(len(sorter))) | ||
|
||
mask = (labels < -len(values)) | (labels >= len(values)) | \ | ||
(labels == na_sentinel) | ||
|
||
# (Out of bound indices will be masked with `na_sentinel` next, so we may | ||
# deal with them here without performance loss using `mode='wrap'`.) | ||
new_labels = reverse_indexer.take(labels, mode='wrap') | ||
np.putmask(new_labels, mask, na_sentinel) | ||
|
||
return ordered, new_labels | ||
|
||
|
||
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | ||
""" | ||
Encode input values as an enumerated type or categorical variable | ||
|
@@ -210,33 +308,10 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | |
uniques = uniques.to_array() | ||
|
||
if sort and len(uniques) > 0: | ||
try: | ||
sorter = uniques.argsort() | ||
except: | ||
# unorderable in py3 if mixed str/int | ||
t = hash_klass(len(uniques)) | ||
t.map_locations(_ensure_object(uniques)) | ||
|
||
# order ints before strings | ||
ordered = np.concatenate([ | ||
np.sort(np.array([e for i, e in enumerate(uniques) if f(e)], | ||
dtype=object)) for f in | ||
[lambda x: not isinstance(x, string_types), | ||
lambda x: isinstance(x, string_types)]]) | ||
sorter = _ensure_platform_int(t.lookup( | ||
_ensure_object(ordered))) | ||
|
||
reverse_indexer = np.empty(len(sorter), dtype=np.int_) | ||
reverse_indexer.put(sorter, np.arange(len(sorter))) | ||
|
||
mask = labels < 0 | ||
labels = reverse_indexer.take(labels) | ||
np.putmask(labels, mask, -1) | ||
|
||
uniques = uniques.take(sorter) | ||
uniques, labels = safe_sort(uniques, labels, na_sentinel=na_sentinel, | ||
assume_unique=True) | ||
|
||
if is_datetimetz_type: | ||
|
||
# reset tz | ||
uniques = DatetimeIndex(uniques.astype('M8[ns]')).tz_localize( | ||
values.tz) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1773,7 +1773,7 @@ def _get_consensus_name(self, other): | |
else: | ||
name = None | ||
if self.name != name: | ||
return other._shallow_copy(name=name) | ||
return self._shallow_copy(name=name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really? must have not been tested well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's right. Only names are tested. |
||
return self | ||
|
||
def union(self, other): | ||
|
@@ -1920,7 +1920,8 @@ def difference(self, other): | |
Return a new Index with elements from the index that are not in | ||
`other`. | ||
|
||
This is the sorted set difference of two Index objects. | ||
This is the set difference of two Index objects. | ||
It's sorted if sorting is possible. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -1946,14 +1947,27 @@ def difference(self, other): | |
|
||
other, result_name = self._convert_can_do_setop(other) | ||
|
||
theDiff = sorted(set(self) - set(other)) | ||
return Index(theDiff, name=result_name) | ||
this = self._get_unique_index() | ||
|
||
indexer = this.get_indexer(other) | ||
indexer = indexer.take((indexer != -1).nonzero()[0]) | ||
|
||
label_diff = np.setdiff1d(np.arange(this.size), indexer, | ||
assume_unique=True) | ||
the_diff = this.values.take(label_diff) | ||
try: | ||
the_diff = algos.safe_sort(the_diff) | ||
except TypeError: | ||
pass | ||
|
||
return this._shallow_copy(the_diff, name=result_name) | ||
|
||
diff = deprecate('diff', difference) | ||
|
||
def symmetric_difference(self, other, result_name=None): | ||
""" | ||
Compute the sorted symmetric difference of two Index objects. | ||
Compute the symmetric difference of two Index objects. | ||
It's sorted if sorting is possible. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -1970,9 +1984,6 @@ def symmetric_difference(self, other, result_name=None): | |
``idx1`` or ``idx2`` but not both. Equivalent to the Index created by | ||
``(idx1 - idx2) + (idx2 - idx1)`` with duplicates dropped. | ||
|
||
The sorting of a result containing ``NaN`` values is not guaranteed | ||
across Python versions. See GitHub issue #6444. | ||
|
||
Examples | ||
-------- | ||
>>> idx1 = Index([1, 2, 3, 4]) | ||
|
@@ -1990,8 +2001,26 @@ def symmetric_difference(self, other, result_name=None): | |
if result_name is None: | ||
result_name = result_name_update | ||
|
||
the_diff = sorted(set((self.difference(other)). | ||
union(other.difference(self)))) | ||
this = self._get_unique_index() | ||
other = other._get_unique_index() | ||
indexer = this.get_indexer(other) | ||
|
||
# {this} minus {other} | ||
common_indexer = indexer.take((indexer != -1).nonzero()[0]) | ||
left_indexer = np.setdiff1d(np.arange(this.size), common_indexer, | ||
assume_unique=True) | ||
left_diff = this.values.take(left_indexer) | ||
|
||
# {other} minus {this} | ||
right_indexer = (indexer == -1).nonzero()[0] | ||
right_diff = other.values.take(right_indexer) | ||
|
||
the_diff = _concat._concat_compat([left_diff, right_diff]) | ||
try: | ||
the_diff = algos.safe_sort(the_diff) | ||
except TypeError: | ||
pass | ||
|
||
attribs = self._get_attributes_dict() | ||
attribs['name'] = result_name | ||
if 'freq' in attribs: | ||
|
@@ -2000,6 +2029,36 @@ def symmetric_difference(self, other, result_name=None): | |
|
||
sym_diff = deprecate('sym_diff', symmetric_difference) | ||
|
||
def _get_unique_index(self, dropna=False): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a doc-string (even if this is private). Is this testsed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not yet. |
||
""" | ||
Returns an index containing unique values. | ||
|
||
Parameters | ||
---------- | ||
dropna : bool | ||
If True, NaN values are dropped. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I no longer need a |
||
|
||
Returns | ||
------- | ||
uniques : index | ||
""" | ||
if self.is_unique and not dropna: | ||
return self | ||
|
||
values = self.values | ||
|
||
if not self.is_unique: | ||
values = self.unique() | ||
|
||
if dropna: | ||
try: | ||
if self.hasnans: | ||
values = values[~isnull(values)] | ||
except NotImplementedError: | ||
pass | ||
|
||
return self._shallow_copy(values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems useful. Think about if we can use this elsewhere in the index code as well (in future). |
||
|
||
def get_loc(self, key, method=None, tolerance=None): | ||
""" | ||
Get integer location for requested label | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Should I also add an entry in whatsnew, New Features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is an internal function.