-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
get_indexer_non_unique for orderable indexes #15372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/indexes/base.py
Outdated
@@ -2509,7 +2509,18 @@ def get_indexer_non_unique(self, target): | |||
else: | |||
tgt_values = target._values | |||
|
|||
indexer, missing = self._engine.get_indexer_non_unique(tgt_values) | |||
try: | |||
if self.is_all_dates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you just need to check is_monotonic_increasing
to see if you can do this.
using lists here is not really great (not that the original is much better). |
need an asv to see if this is actually better |
pandas/index.pyx
Outdated
cdef _indexer_non_unique_orderable_loop(ndarray values, ndarray targets, | ||
int64_t[:] idx0, | ||
int64_t[:] idx1, | ||
list[:] result, list[:] missing): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int64Vector
from https://github.com/pandas-dev/pandas/blob/master/pandas/src/hashtable_class_helper.pxi.in would be a much faster way to handle accumulating these arrays.
move all remaining tests so that ALL tests are now under pandas/tests Author: Jeff Reback <[email protected]> Closes #15371 from jreback/tests and squashes the following commits: 43039e4 [Jeff Reback] add in data 118127b [Jeff Reback] wip bfa6a9c [Jeff Reback] fix data locations 79a79e6 [Jeff Reback] fix import 57437bf [Jeff Reback] fixes b407586 [Jeff Reback] move io e13bfe3 [Jeff Reback] move tools 0194e31 [Jeff Reback] move computation 0e6bcb4 [Jeff Reback] rename test_msgpack -> msgpack c5e4ab8 [Jeff Reback] move sparse 42e60e2 [Jeff Reback] move api tests
The last commit speed up quite a lot: import numpy as np
import pandas as pd
from pandas import Index
from IPython import get_ipython
ipython = get_ipython()
i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())
ipython.magic("timeit i.get_indexer_non_unique(i[0:1000000])") Previous implementation: 1 loop, best of 3: 3.42 s per loop |
@horta Can you simply reuse |
@shoyer I've tried and couldn't make it work. Also, I don't thing contiguous array is the way to go for such a problem as some indices might not repeat, or might repeat a lot. But I'm working on another implementation that relies only on fixed-size arrays, which means that I won't need any of those two types (Int64List or Int64Vector). |
Can you clarify? You couldn't get it to compile or you couldn't understand how to use it?
Couldn't we simply accumulative the indexes of matching elements, e.g., by repeatedly appending to a Python list of integers? That's effectively what I would rather not add new low level data structure for this specific method unless we absolutely need it. My understanding is that vectors are usually faster than linked lists, anyways. It's certainly more memory efficient. |
I could not import it. Looks like it is private (only in pyx). I tried to export defining them on pxd and had some difficults. Then I tried to copy/paste Int64Vector and still had difficults. Basically it needs documentation...
The implementation I'm working on makes no use of new types: I will be using basic C types. It is a matter of looping over the src and tgt values twice, calling Malloc twice and realloc once. |
Take a look at the IntervalIndex PR: https://github.com/pandas-dev/pandas/pull/15309/files#diff-cdfa55be00a117a8466e28b72f60cfca You should be able to just copy the lines from that change for pandas/hashtable.pxd and then use |
I may acutally push some of the low-level code into master later / tomrrow if I get a chance to make this easier anyhow. |
I just pushed these to master: 010393c |
pls update when you can |
* Converted index name to string to fix issue #15404 - BUG: to_sql errors with numeric index name - needs conversion to string * Additional int to string conversion added. Associated test cases added. * PEP 8 compliance edits * Removed extraneous brackets
Just a small thing I noticed in a [footnote here](https://danluu.com/web-bloat/#appendix-irony). Probably can't do much about the extra classes, but rowspan/colspan seem like easy fixes to save a few bytes per row/col and it's already done in the other code path. Author: Elliott Sales de Andrade <[email protected]> Closes #15403 from QuLogic/no-extra-span and squashes the following commits: 9a8fcee [Elliott Sales de Andrade] Don't add rowspan/colspan if it's 1.
Remove unnecessary pd.Timedelta
- we are passing builds which actually have an error - fix the small dtype issues Author: Jeff Reback <[email protected]> Closes #15445 from jreback/windows and squashes the following commits: a5b7fb3 [Jeff Reback] change integer to power comparisions eab15c4 [Jeff Reback] don't force remove pandas cf3b9bd [Jeff Reback] more windows fixing efe6a76 [Jeff Reback] add cython to build 8194e63 [Jeff Reback] don't use appveyor recipe, just build inplace e064825 [Jeff Reback] TST: resample dtype issue xref #15418 10d9b26 [Jeff Reback] TST: run windows tests so failures show up in appeveyor
C level list no gil capture index error wrong exception handling
Codecov Report
@@ Coverage Diff @@
## master #15372 +/- ##
=========================================
+ Coverage 90.32% 90.43% +0.1%
=========================================
Files 135 134 -1
Lines 49413 49376 -37
=========================================
+ Hits 44633 44651 +18
+ Misses 4780 4725 -55
Continue to review full report at Codecov.
|
import numpy as np
import pandas as pd
from pandas import Index
from IPython import get_ipython
ipython = get_ipython()
i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())
ipython.magic("timeit i.get_indexer_non_unique(i[0:1000000])") The above test runs in '714 ms'. The implementation regarding the last commit bf4b3f5 makes use of fixed-size arrays only, no resize needed. I think that is the best we can have in terms of data structure. But it can be further improved by, for example, getting rid of PyObject_RichCompare calls due to Python object comparisons. Maybe those Cython fuse types? |
@horta please rebase on origin/master.
|
@jreback after doing
Not sure what to do now... |
so the issue is you are using your
|
I think I need to start a new pull request because I don't know what is going on anymore... (I've tried the above command) |
superseded by #15468 |
Tackles the performance issue raised at #15364 by looping simultaneously over source and target indexes, avoiding dictionary and set lookups in for/while loops.
I guess it can be further improved in terms of static typing. For example, if both source and target indexes are simple C types, the function
_indexer_non_unique_orderable_loop
should be way faster if we provide that information.