get_indexer_non_unique for orderable indexes #15372

horta · 2017-02-12T00:27:45Z

Tackles the performance issue raised at #15364 by looping simultaneously over source and target indexes, avoiding dictionary and set lookups in for/while loops.

I guess it can be further improved in terms of static typing. For example, if both source and target indexes are simple C types, the function _indexer_non_unique_orderable_loop should be way faster if we provide that information.

jreback · 2017-02-12T00:32:47Z

pandas/indexes/base.py

@@ -2509,7 +2509,18 @@ def get_indexer_non_unique(self, target):
        else:
            tgt_values = target._values

-        indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
+        try:
+            if self.is_all_dates:


you just need to check is_monotonic_increasing to see if you can do this.

jreback · 2017-02-12T00:34:08Z

using lists here is not really great (not that the original is much better).

jreback · 2017-02-12T00:34:21Z

need an asv to see if this is actually better

shoyer · 2017-02-12T00:59:36Z

pandas/index.pyx

+cdef _indexer_non_unique_orderable_loop(ndarray values, ndarray targets,
+                                        int64_t[:] idx0,
+                                        int64_t[:] idx1,
+                                        list[:] result, list[:] missing):


Int64Vector from https://github.com/pandas-dev/pandas/blob/master/pandas/src/hashtable_class_helper.pxi.in would be a much faster way to handle accumulating these arrays.

move all remaining tests so that ALL tests are now under pandas/tests Author: Jeff Reback <[email protected]> Closes #15371 from jreback/tests and squashes the following commits: 43039e4 [Jeff Reback] add in data 118127b [Jeff Reback] wip bfa6a9c [Jeff Reback] fix data locations 79a79e6 [Jeff Reback] fix import 57437bf [Jeff Reback] fixes b407586 [Jeff Reback] move io e13bfe3 [Jeff Reback] move tools 0194e31 [Jeff Reback] move computation 0e6bcb4 [Jeff Reback] rename test_msgpack -> msgpack c5e4ab8 [Jeff Reback] move sparse 42e60e2 [Jeff Reback] move api tests

closes #15369

horta · 2017-02-12T20:51:31Z

The last commit speed up quite a lot:

import numpy as np
import pandas as pd
from pandas import Index

from IPython import get_ipython
ipython = get_ipython()

i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

ipython.magic("timeit i.get_indexer_non_unique(i[0:1000000])")

Previous implementation: 1 loop, best of 3: 3.42 s per loop
Pull request implementation: 1 loop, best of 3: 681 ms per loop

shoyer · 2017-02-12T22:50:17Z

@horta Can you simply reuse Int64Vector instead of making a new Int64List type?

horta · 2017-02-12T23:10:59Z

@shoyer I've tried and couldn't make it work. Also, I don't thing contiguous array is the way to go for such a problem as some indices might not repeat, or might repeat a lot.

But I'm working on another implementation that relies only on fixed-size arrays, which means that I won't need any of those two types (Int64List or Int64Vector).

shoyer · 2017-02-12T23:23:43Z

I've tried and couldn't make it work

Can you clarify? You couldn't get it to compile or you couldn't understand how to use it?

Also, I don't thing contiguous array is the way to go for such a problem as some indices might not repeat, or might repeat a lot.

Couldn't we simply accumulative the indexes of matching elements, e.g., by repeatedly appending to a Python list of integers? That's effectively what Int64Vector is for, except without the overhead of boxing in Python objects. It's essentially a C++ vector written in Cython.

I would rather not add new low level data structure for this specific method unless we absolutely need it. My understanding is that vectors are usually faster than linked lists, anyways. It's certainly more memory efficient.

horta · 2017-02-12T23:31:57Z

Can you clarify? You couldn't get it to compile or you couldn't understand how to use it?

I could not import it. Looks like it is private (only in pyx). I tried to export defining them on pxd and had some difficults. Then I tried to copy/paste Int64Vector and still had difficults. Basically it needs documentation...

I would rather not add new low level data structure for this specific method unless we absolutely need it.

The implementation I'm working on makes no use of new types: I will be using basic C types. It is a matter of looping over the src and tgt values twice, calling Malloc twice and realloc once.

shoyer · 2017-02-12T23:37:44Z

Take a look at the IntervalIndex PR: https://github.com/pandas-dev/pandas/pull/15309/files#diff-cdfa55be00a117a8466e28b72f60cfca

You should be able to just copy the lines from that change for pandas/hashtable.pxd and then use from hashtable cimport Int64Vector

jreback · 2017-02-12T23:56:54Z

I may acutally push some of the low-level code into master later / tomrrow if I get a chance to make this easier anyhow.

jreback · 2017-02-13T02:55:48Z

I just pushed these to master: 010393c

jreback · 2017-02-16T17:52:05Z

pls update when you can

* Converted index name to string to fix issue #15404 - BUG: to_sql errors with numeric index name - needs conversion to string * Additional int to string conversion added. Associated test cases added. * PEP 8 compliance edits * Removed extraneous brackets

Just a small thing I noticed in a [footnote here](https://danluu.com/web-bloat/#appendix-irony). Probably can't do much about the extra classes, but rowspan/colspan seem like easy fixes to save a few bytes per row/col and it's already done in the other code path. Author: Elliott Sales de Andrade <[email protected]> Closes #15403 from QuLogic/no-extra-span and squashes the following commits: 9a8fcee [Elliott Sales de Andrade] Don't add rowspan/colspan if it's 1.

Remove unnecessary pd.Timedelta

- we are passing builds which actually have an error - fix the small dtype issues Author: Jeff Reback <[email protected]> Closes #15445 from jreback/windows and squashes the following commits: a5b7fb3 [Jeff Reback] change integer to power comparisions eab15c4 [Jeff Reback] don't force remove pandas cf3b9bd [Jeff Reback] more windows fixing efe6a76 [Jeff Reback] add cython to build 8194e63 [Jeff Reback] don't use appveyor recipe, just build inplace e064825 [Jeff Reback] TST: resample dtype issue xref #15418 10d9b26 [Jeff Reback] TST: run windows tests so failures show up in appeveyor

C level list no gil capture index error wrong exception handling

codecov-io · 2017-02-20T10:44:35Z

Codecov Report

Merging #15372 into master will increase coverage by 0.1%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #15372     +/-   ##
=========================================
+ Coverage   90.32%   90.43%   +0.1%     
=========================================
  Files         135      134      -1     
  Lines       49413    49376     -37     
=========================================
+ Hits        44633    44651     +18     
+ Misses       4780     4725     -55

Impacted Files	Coverage Δ
pandas/indexes/base.py	`96.27% <100%> (+0.04%)`	✅
pandas/core/common.py	`91.02% <ø> (-0.34%)`	❌
pandas/tseries/offsets.py	`97.13% <ø> (-0.13%)`	❌
pandas/core/frame.py	`97.87% <ø> (ø)`	✅
pandas/io/auth.py
pandas/util/terminal.py	`47.82% <ø> (+2.89%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61a243b...74ce239. Read the comment docs.

horta · 2017-02-20T10:58:40Z

import numpy as np
import pandas as pd
from pandas import Index

from IPython import get_ipython
ipython = get_ipython()

i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

ipython.magic("timeit i.get_indexer_non_unique(i[0:1000000])")

The above test runs in '714 ms'. The implementation regarding the last commit bf4b3f5 makes use of fixed-size arrays only, no resize needed.

I think that is the best we can have in terms of data structure. But it can be further improved by, for example, getting rid of PyObject_RichCompare calls due to Python object comparisons. Maybe those Cython fuse types?

jreback · 2017-02-20T14:10:50Z

@horta please rebase on origin/master.

git rebase -i origin/master and git push yourremote thisbranch -f

horta · 2017-02-20T23:51:34Z

@jreback after doing git rebase -i origin/master now I got

> git status
interactive rebase in progress; onto be4a63fe7
Last commands done (6 commands done):
   pick 6afb8c910 wrong exception handling
   pick c7a1e0090 get_indexer_non_unique for orderable indexes
  (see more in file .git/rebase-merge/done)
Next commands to do (9 remaining commands):
   pick c1b657e19 get_indexer_non_unique for orderable indexes
   pick 47f7ce325 C level list
  (use "git rebase --edit-todo" to view and edit)
You are currently editing a commit while rebasing branch 'master' on 'be4a63fe7'.
  (use "git commit --amend" to amend the current commit)
  (use "git rebase --continue" once you are satisfied with your changes)

nothing to commit, working tree clean

Not sure what to do now...

jreback · 2017-02-21T00:03:16Z

so the issue is you are using your master branch like a branch which is generally not recommended, so to fix you can do:

git checkout yourbranch
git fetch origin/master
git rebase -ii origin/master

horta · 2017-02-21T09:56:48Z

I think I need to start a new pull request because I don't know what is going on anymore... (I've tried the above command)

jreback · 2017-02-21T12:56:50Z

superseded by #15468

get_indexer_non_unique for orderable indexes

c1b657e

jreback reviewed Feb 12, 2017

View reviewed changes

shoyer reviewed Feb 12, 2017

View reviewed changes

neirbowj and others added 2 commits February 11, 2017 21:21

BUG: Avoid grafting missing examples directory (#15373)

2f971a2

CLN: remove pandas/io/auth.py, from ga.py (now removed) (#15374)

1bad601

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 12, 2017

jreback and others added 8 commits February 12, 2017 09:51

TST: fix locations for github based url tests

1bcc10d

DOC: fix path in whatsnew

f87db63

TST: use xdist for multiple cpu testing

1190ac6

closes #15369

Typo (#15377)

0915857

TST: control skipping of numexpr tests if its installed / used

a0f7fc0

TST: make test_gbq single cpu

dda3c42

C level list

47f7ce3

no gil

09dd91b

jreback added 2 commits February 12, 2017 21:48

ENH: expose Int64VectorData in hashtable.pxd

010393c

TST: xfail most test_gbq tests for now

d9e75c7

horta added 2 commits February 13, 2017 07:27

capture index error

2e55efc

wrong exception handling

6916dad

jeffcarey and others added 10 commits February 17, 2017 09:17

Documents touch-up for DataFrame.to_excel() freeze_panes option (#15436)

c588dd1

DOC: add whatsnew for #15423

54b6c6e

TST: remove yielding tests from test_msgpacks.py (#15427)

763f42f

DOC: correct rpy2 examples (GH15142) (#15450)

a17a03a

BUG: rolling not accepting Timedelta-like window args (#15443)

29aeffb

Remove unnecessary pd.Timedelta

get_indexer_non_unique for orderable indexes

c7a1e00

C level list no gil capture index error wrong exception handling

Merge branch 'master' of https://github.com/Horta/pandas

34545d4

horta added 7 commits February 20, 2017 10:45

get_indexer_non_unique for orderable indexes

390bfb2

C level list

f38cf52

no gil

9dabf34

capture index error

f61b98f

wrong exception handling

6afb8c9

Merge branch 'master' of https://github.com/Horta/pandas

5494a4c

fixed-size arrays for get_index mapping

bf4b3f5

horta added 3 commits February 20, 2017 12:12

dtype=np.int64

0f37a64

empty and zeros with np.int64

3c218ce

as array

74ce239

horta mentioned this pull request Feb 21, 2017

PERF: fast non-unique indexing #15468

Closed

jreback closed this Feb 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_indexer_non_unique for orderable indexes #15372

get_indexer_non_unique for orderable indexes #15372

horta commented Feb 12, 2017

jreback Feb 12, 2017

jreback commented Feb 12, 2017

jreback commented Feb 12, 2017

shoyer Feb 12, 2017

horta commented Feb 12, 2017

shoyer commented Feb 12, 2017

horta commented Feb 12, 2017

shoyer commented Feb 12, 2017

horta commented Feb 12, 2017

shoyer commented Feb 12, 2017

jreback commented Feb 12, 2017

jreback commented Feb 13, 2017

jreback commented Feb 16, 2017

codecov-io commented Feb 20, 2017 •

edited

Loading

horta commented Feb 20, 2017 •

edited

Loading

jreback commented Feb 20, 2017

horta commented Feb 20, 2017 •

edited

Loading

jreback commented Feb 21, 2017

horta commented Feb 21, 2017

jreback commented Feb 21, 2017

get_indexer_non_unique for orderable indexes #15372

get_indexer_non_unique for orderable indexes #15372

Conversation

horta commented Feb 12, 2017

jreback Feb 12, 2017

Choose a reason for hiding this comment

jreback commented Feb 12, 2017

jreback commented Feb 12, 2017

shoyer Feb 12, 2017

Choose a reason for hiding this comment

horta commented Feb 12, 2017

shoyer commented Feb 12, 2017

horta commented Feb 12, 2017

shoyer commented Feb 12, 2017

horta commented Feb 12, 2017

shoyer commented Feb 12, 2017

jreback commented Feb 12, 2017

jreback commented Feb 13, 2017

jreback commented Feb 16, 2017

codecov-io commented Feb 20, 2017 • edited Loading

Codecov Report

horta commented Feb 20, 2017 • edited Loading

jreback commented Feb 20, 2017

horta commented Feb 20, 2017 • edited Loading

jreback commented Feb 21, 2017

horta commented Feb 21, 2017

jreback commented Feb 21, 2017

codecov-io commented Feb 20, 2017 •

edited

Loading

horta commented Feb 20, 2017 •

edited

Loading

horta commented Feb 20, 2017 •

edited

Loading