Skip to content

get_indexer_non_unique for orderable indexes #15372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 54 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
c1b657e
get_indexer_non_unique for orderable indexes
horta Feb 12, 2017
2f971a2
BUG: Avoid grafting missing examples directory (#15373)
neirbowj Feb 12, 2017
1bad601
CLN: remove pandas/io/auth.py, from ga.py (now removed) (#15374)
jreback Feb 12, 2017
5fb5228
TST: consolidate remaining tests under pandas.tests
jreback Feb 12, 2017
1bcc10d
TST: fix locations for github based url tests
jreback Feb 12, 2017
f87db63
DOC: fix path in whatsnew
jreback Feb 12, 2017
1190ac6
TST: use xdist for multiple cpu testing
jreback Feb 11, 2017
0915857
Typo (#15377)
andrewkittredge Feb 12, 2017
a0f7fc0
TST: control skipping of numexpr tests if its installed / used
jreback Feb 12, 2017
dda3c42
TST: make test_gbq single cpu
jreback Feb 12, 2017
47f7ce3
C level list
horta Feb 12, 2017
09dd91b
no gil
horta Feb 12, 2017
010393c
ENH: expose Int64VectorData in hashtable.pxd
jreback Feb 13, 2017
d9e75c7
TST: xfail most test_gbq tests for now
jreback Feb 13, 2017
2e55efc
capture index error
horta Feb 13, 2017
6916dad
wrong exception handling
horta Feb 13, 2017
86ca84d
TST: Fix gbq integration tests. gbq._Dataset.dataset() would not retu…
parthea Feb 14, 2017
ff0deec
Bug: Raise ValueError with interpolate & fillna limit = 0 (#9217)
mroeschke Feb 14, 2017
5959fe1
CLN: create core/sorting.py
jreback Feb 14, 2017
4b97db4
TST: disable gbq tests again
jreback Feb 15, 2017
25fb173
TST: fix incorrect url in compressed url network tests in parser
jreback Feb 15, 2017
03bb900
TST: incorrect skip in when --skip-network is run
jreback Feb 15, 2017
bbb583c
TST: fix test_nework.py fixture under py27
jreback Feb 15, 2017
2372d27
BLD: Numexpr 2.4.6 required
Feb 15, 2017
b261dfe
TST: print skipped tests files
jreback Feb 15, 2017
e351ed0
PERF: high memory in MI
jreback Feb 15, 2017
93f5e3a
STYLE: flake8 upgraded to 3.3 on conda (#15412)
jreback Feb 15, 2017
86ef3ca
DOC: use shared_docs for Index.get_indexer, get_indexer_non_unique (#…
jreback Feb 15, 2017
d6f8b46
BLD: use latest conda version with latest miniconda installer on appv…
jreback Feb 15, 2017
f2246cf
TST: convert yield based test_pickle.py to parametrized to remove war…
jreback Feb 16, 2017
ddb22f5
TST: Parametrize simple yield tests
QuLogic Feb 16, 2017
5a8883b
BUG: Ensure the right values are set in SeriesGroupBy.nunique
Feb 16, 2017
c7300ea
BUG: Concat with inner join and empty DataFrame
abaldenko Feb 16, 2017
9b5d848
ENH: Added ability to freeze panes from DataFrame.to_excel() (#15160)
jeffcarey Feb 16, 2017
c588dd1
Documents touch-up for DataFrame.to_excel() freeze_panes option (#15436)
jeffcarey Feb 17, 2017
f4e672c
BUG: to_sql convert index name to string (#15404) (#15423)
redbullpeter Feb 17, 2017
54b6c6e
DOC: add whatsnew for #15423
jorisvandenbossche Feb 17, 2017
763f42f
TST: remove yielding tests from test_msgpacks.py (#15427)
jreback Feb 17, 2017
f65a641
ENH: Don't add rowspan/colspan if it's 1.
QuLogic Feb 17, 2017
a17a03a
DOC: correct rpy2 examples (GH15142) (#15450)
jorisvandenbossche Feb 18, 2017
29aeffb
BUG: rolling not accepting Timedelta-like window args (#15443)
mroeschke Feb 18, 2017
be4a63f
BUG: testing on windows
jreback Feb 18, 2017
c7a1e00
get_indexer_non_unique for orderable indexes
horta Feb 12, 2017
34545d4
Merge branch 'master' of https://github.com/Horta/pandas
horta Feb 20, 2017
390bfb2
get_indexer_non_unique for orderable indexes
horta Feb 12, 2017
f38cf52
C level list
horta Feb 12, 2017
9dabf34
no gil
horta Feb 12, 2017
f61b98f
capture index error
horta Feb 13, 2017
6afb8c9
wrong exception handling
horta Feb 13, 2017
5494a4c
Merge branch 'master' of https://github.com/Horta/pandas
horta Feb 20, 2017
bf4b3f5
fixed-size arrays for get_index mapping
horta Feb 20, 2017
0f37a64
dtype=np.int64
horta Feb 20, 2017
3c218ce
empty and zeros with np.int64
horta Feb 20, 2017
74ce239
as array
horta Feb 20, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions pandas/index.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,47 @@ cdef extern from "Python.h":
int PySlice_Check(object)


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef _indexer_non_unique_orderable_loop(ndarray values, ndarray targets,
int64_t[:] idx0,
int64_t[:] idx1,
list[:] result, list[:] missing):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int64Vector from https://github.com/pandas-dev/pandas/blob/master/pandas/src/hashtable_class_helper.pxi.in would be a much faster way to handle accumulating these arrays.

cdef:
Py_ssize_t i = 0, j = 0, n = idx0.shape[0], n_t = idx1.shape[0]

while i < n and j < n_t:

val0 = values[idx0[i]]
val1 = targets[idx1[j]]

if val0 == val1:

while i < n and values[idx0[i]] == val1:
result[idx1[j]].append(idx0[i])
i += 1

j += 1
while j < n_t and val0 == targets[idx1[j]]:
result[idx1[j]] = result[idx1[j-1]]
j += 1

elif val0 > val1:

result[idx1[j]].append(-1)
missing[idx1[j]].append(idx1[j])
j += 1

else:
i += 1

while j < n_t:
result[idx1[j]].append(-1)
missing[idx1[j]].append(idx1[j])
j += 1


cdef inline is_definitely_invalid_key(object val):
if PyTuple_Check(val):
try:
Expand Down Expand Up @@ -372,6 +413,34 @@ cdef class IndexEngine:

return result[0:count], missing[0:count_missing]

def get_indexer_non_unique_orderable(self, ndarray targets,
int64_t[:] idx0,
int64_t[:] idx1):

cdef:
ndarray values
object val0, val1
Py_ssize_t i, n_t

self._ensure_mapping_populated()
values = self._get_index_values()
n_t = len(targets)

result = np.empty((n_t,), dtype=np.object_)
result.fill([])
result = np.frompyfunc(list,1,1)(result)

missing = np.empty((n_t,), dtype=np.object_)
missing.fill([])
missing = np.frompyfunc(list,1,1)(missing)

_indexer_non_unique_orderable_loop(values, targets, idx0, idx1,
result, missing)

result = np.concatenate(result)
missing = np.asarray(np.concatenate(missing), np.int64)

return result, missing

cdef Py_ssize_t _bin_search(ndarray values, object val) except -1:
cdef:
Expand Down
13 changes: 12 additions & 1 deletion pandas/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2509,7 +2509,18 @@ def get_indexer_non_unique(self, target):
else:
tgt_values = target._values

indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
try:
if self.is_all_dates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need to check is_monotonic_increasing to see if you can do this.

idx0 = np.argsort(self.asi8, kind='mergesort')
else:
idx0 = np.argsort(self._values, kind='mergesort')

idx1 = np.argsort(tgt_values, kind='mergesort')
indexer, missing = self._engine.get_indexer_non_unique_orderable(tgt_values, idx0, idx1)

except TypeError:
indexer, missing = self._engine.get_indexer_non_unique(tgt_values)

return Index(indexer), missing

def get_indexer_for(self, target, **kwargs):
Expand Down