-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: fast non-unique indexing #15468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
else: | ||
tgt_values = target._values | ||
src_values = self._values | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite messy, what are you trying to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Just factorised that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are adding much more code. Where exactly does this routine fail? The only scenario I could see is with actual mixed types (which I am happy to raise on when you have a non-unique index).
Further it looks like you are duplicating lots of functionailty that already exists, see pandas/core/algorithms.py w.r.t. counting / sorting / factorizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find routines un pandas/core/algorithms.py that I could use. I have now described the functions in index.pyx so I hope it is now clearer.
Yes, I was having problem with mixed types. For example,
v = np.array([1, 'danilo'], object)
v[0] < v[1]
raises a TypeError
exception.
@@ -371,6 +482,16 @@ cdef class IndexEngine: | |||
|
|||
return result[0:count], missing[0:count_missing] | |||
|
|||
def get_indexer_non_unique_orderable(self, ndarray targets, | |||
int64_t[:] idx0, | |||
int64_t[:] idx1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a massive amount of added code, what exactly are you doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that I have removed the code regarding lists. Most of the code consists in two functions: _count
and _map
. They implement two very specific algorithms that I couldn't find in the above-mentioned file.
Codecov Report
@@ Coverage Diff @@
## master #15468 +/- ##
==========================================
+ Coverage 90.37% 90.37% +<.01%
==========================================
Files 135 135
Lines 49473 49494 +21
==========================================
+ Hits 44709 44730 +21
Misses 4764 4764
Continue to review full report at Codecov.
|
Benchmark: import numpy as np
import pandas as pd
from time import time
from pandas import Index
n = 10000
r = 100
for j in range(1, 11):
i = Index(list(range(j*n)) * r)
slic = i[0:j*n]
start = time()
i.get_indexer_non_unique(slic)
print(time() - start) Elapsed time in seconds:
|
int64_t[:] mapping_count, int64_t[:] missing_count): | ||
""" | ||
Compute the number of times a `targets` value is in `values`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly value_counts
@horta you are adding more and more code. I want LESS code. you are rewriting the world here. |
pls add an asv. |
@cython.initializedcheck(False) | ||
cdef _map(ndarray values, ndarray targets, int64_t[:] idx0, int64_t[:] idx1, | ||
int64_t[:] start_mapping, int64_t[:] start_missing, | ||
int64_t[:] mapping, int64_t[:] missing): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this signature is WAY too complicated.
def _map_targets_to_values(values, targets, idx0, idx1): | ||
""" | ||
Map `targets` values to `values` positions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a take, but really not sure what you are doing
|
||
cdef: | ||
ndarray values | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add 1 function in cython, anything else is WAY too complicated.
|
||
is_mono0 = self.is_monotonic_increasing | ||
is_mono1 = target.is_monotonic_increasing | ||
(orderable, idx0, idx1) = _order_them(is_mono0, src_values, is_mono1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you insist on ordering, then simply order an unordered index, get the results and take the original. this is so much complexity.
indices = np.argsort(x, kind='mergesort') | ||
except TypeError: | ||
return (False, None) | ||
return (True, indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do ALL of this in the cython function. you are needlessly splitting things up.
@horta we don't tolerate that type of language |
closes #15364
This is essentially the same pull request as #15372 but now from a proper branch.