Skip to content

get_indexer_non_unique for orderable indexes #15372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 54 commits into from
Closed

get_indexer_non_unique for orderable indexes #15372

wants to merge 54 commits into from

Conversation

horta
Copy link
Contributor

@horta horta commented Feb 12, 2017

Tackles the performance issue raised at #15364 by looping simultaneously over source and target indexes, avoiding dictionary and set lookups in for/while loops.

I guess it can be further improved in terms of static typing. For example, if both source and target indexes are simple C types, the function _indexer_non_unique_orderable_loop should be way faster if we provide that information.

@@ -2509,7 +2509,18 @@ def get_indexer_non_unique(self, target):
else:
tgt_values = target._values

indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
try:
if self.is_all_dates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need to check is_monotonic_increasing to see if you can do this.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2017

using lists here is not really great (not that the original is much better).

@jreback
Copy link
Contributor

jreback commented Feb 12, 2017

need an asv to see if this is actually better

pandas/index.pyx Outdated
cdef _indexer_non_unique_orderable_loop(ndarray values, ndarray targets,
int64_t[:] idx0,
int64_t[:] idx1,
list[:] result, list[:] missing):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int64Vector from https://github.com/pandas-dev/pandas/blob/master/pandas/src/hashtable_class_helper.pxi.in would be a much faster way to handle accumulating these arrays.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 12, 2017
jreback and others added 8 commits February 12, 2017 09:51
move all remaining tests so that ALL tests are now
under pandas/tests

Author: Jeff Reback <[email protected]>

Closes #15371 from jreback/tests and squashes the following commits:

43039e4 [Jeff Reback] add in data
118127b [Jeff Reback] wip
bfa6a9c [Jeff Reback] fix data locations
79a79e6 [Jeff Reback] fix import
57437bf [Jeff Reback] fixes
b407586 [Jeff Reback] move io
e13bfe3 [Jeff Reback] move tools
0194e31 [Jeff Reback] move computation
0e6bcb4 [Jeff Reback] rename test_msgpack -> msgpack
c5e4ab8 [Jeff Reback] move sparse
42e60e2 [Jeff Reback] move api tests
@horta
Copy link
Contributor Author

horta commented Feb 12, 2017

The last commit speed up quite a lot:

import numpy as np
import pandas as pd
from pandas import Index

from IPython import get_ipython
ipython = get_ipython()

i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

ipython.magic("timeit i.get_indexer_non_unique(i[0:1000000])")

Previous implementation: 1 loop, best of 3: 3.42 s per loop
Pull request implementation: 1 loop, best of 3: 681 ms per loop

@shoyer
Copy link
Member

shoyer commented Feb 12, 2017

@horta Can you simply reuse Int64Vector instead of making a new Int64List type?

@horta
Copy link
Contributor Author

horta commented Feb 12, 2017

@shoyer I've tried and couldn't make it work. Also, I don't thing contiguous array is the way to go for such a problem as some indices might not repeat, or might repeat a lot.

But I'm working on another implementation that relies only on fixed-size arrays, which means that I won't need any of those two types (Int64List or Int64Vector).

@shoyer
Copy link
Member

shoyer commented Feb 12, 2017

I've tried and couldn't make it work

Can you clarify? You couldn't get it to compile or you couldn't understand how to use it?

Also, I don't thing contiguous array is the way to go for such a problem as some indices might not repeat, or might repeat a lot.

Couldn't we simply accumulative the indexes of matching elements, e.g., by repeatedly appending to a Python list of integers? That's effectively what Int64Vector is for, except without the overhead of boxing in Python objects. It's essentially a C++ vector written in Cython.

I would rather not add new low level data structure for this specific method unless we absolutely need it. My understanding is that vectors are usually faster than linked lists, anyways. It's certainly more memory efficient.

@horta
Copy link
Contributor Author

horta commented Feb 12, 2017

Can you clarify? You couldn't get it to compile or you couldn't understand how to use it?

I could not import it. Looks like it is private (only in pyx). I tried to export defining them on pxd and had some difficults. Then I tried to copy/paste Int64Vector and still had difficults. Basically it needs documentation...

I would rather not add new low level data structure for this specific method unless we absolutely need it.

The implementation I'm working on makes no use of new types: I will be using basic C types. It is a matter of looping over the src and tgt values twice, calling Malloc twice and realloc once.

@shoyer
Copy link
Member

shoyer commented Feb 12, 2017

Take a look at the IntervalIndex PR: https://github.com/pandas-dev/pandas/pull/15309/files#diff-cdfa55be00a117a8466e28b72f60cfca

You should be able to just copy the lines from that change for pandas/hashtable.pxd and then use from hashtable cimport Int64Vector

@jreback
Copy link
Contributor

jreback commented Feb 12, 2017

I may acutally push some of the low-level code into master later / tomrrow if I get a chance to make this easier anyhow.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2017

I just pushed these to master: 010393c

@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

pls update when you can

jeffcarey and others added 10 commits February 17, 2017 09:17
* Converted index name to string to fix issue #15404 - BUG: to_sql errors with numeric index name - needs conversion to string

* Additional int to string conversion added. Associated test cases added.

* PEP 8 compliance edits

* Removed extraneous brackets
Just a small thing I noticed in a [footnote
here](https://danluu.com/web-bloat/#appendix-irony). Probably can't do
much about the extra classes, but rowspan/colspan seem like easy fixes
to save a few bytes per row/col and it's already done in the other
code path.

Author: Elliott Sales de Andrade <[email protected]>

Closes #15403 from QuLogic/no-extra-span and squashes the following commits:

9a8fcee [Elliott Sales de Andrade] Don't add rowspan/colspan if it's 1.
- we are passing builds which actually have an error
- fix the small dtype issues

Author: Jeff Reback <[email protected]>

Closes #15445 from jreback/windows and squashes the following commits:

a5b7fb3 [Jeff Reback] change integer to power comparisions
eab15c4 [Jeff Reback] don't force remove pandas
cf3b9bd [Jeff Reback] more windows fixing
efe6a76 [Jeff Reback] add cython to build
8194e63 [Jeff Reback] don't use appveyor recipe, just build inplace
e064825 [Jeff Reback] TST: resample dtype issue xref #15418
10d9b26 [Jeff Reback] TST: run windows tests so failures show up in appeveyor
C level list

no gil

capture index error

wrong exception handling
@codecov-io
Copy link

codecov-io commented Feb 20, 2017

Codecov Report

Merging #15372 into master will increase coverage by 0.1%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #15372     +/-   ##
=========================================
+ Coverage   90.32%   90.43%   +0.1%     
=========================================
  Files         135      134      -1     
  Lines       49413    49376     -37     
=========================================
+ Hits        44633    44651     +18     
+ Misses       4780     4725     -55
Impacted Files Coverage Δ
pandas/indexes/base.py 96.27% <100%> (+0.04%)
pandas/core/common.py 91.02% <ø> (-0.34%)
pandas/tseries/offsets.py 97.13% <ø> (-0.13%)
pandas/core/frame.py 97.87% <ø> (ø)
pandas/io/auth.py
pandas/util/terminal.py 47.82% <ø> (+2.89%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61a243b...74ce239. Read the comment docs.

@horta
Copy link
Contributor Author

horta commented Feb 20, 2017

import numpy as np
import pandas as pd
from pandas import Index

from IPython import get_ipython
ipython = get_ipython()

i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

ipython.magic("timeit i.get_indexer_non_unique(i[0:1000000])")

The above test runs in '714 ms'. The implementation regarding the last commit bf4b3f5 makes use of fixed-size arrays only, no resize needed.

I think that is the best we can have in terms of data structure. But it can be further improved by, for example, getting rid of PyObject_RichCompare calls due to Python object comparisons. Maybe those Cython fuse types?

@jreback
Copy link
Contributor

jreback commented Feb 20, 2017

@horta please rebase on origin/master.

git rebase -i origin/master and git push yourremote thisbranch -f

@horta
Copy link
Contributor Author

horta commented Feb 20, 2017

@jreback after doing git rebase -i origin/master now I got

> git status
interactive rebase in progress; onto be4a63fe7
Last commands done (6 commands done):
   pick 6afb8c910 wrong exception handling
   pick c7a1e0090 get_indexer_non_unique for orderable indexes
  (see more in file .git/rebase-merge/done)
Next commands to do (9 remaining commands):
   pick c1b657e19 get_indexer_non_unique for orderable indexes
   pick 47f7ce325 C level list
  (use "git rebase --edit-todo" to view and edit)
You are currently editing a commit while rebasing branch 'master' on 'be4a63fe7'.
  (use "git commit --amend" to amend the current commit)
  (use "git rebase --continue" once you are satisfied with your changes)

nothing to commit, working tree clean

Not sure what to do now...

@jreback
Copy link
Contributor

jreback commented Feb 21, 2017

so the issue is you are using your master branch like a branch which is generally not recommended, so to fix you can do:

git checkout yourbranch
git fetch origin/master
git rebase -ii origin/master

@horta
Copy link
Contributor Author

horta commented Feb 21, 2017

I think I need to start a new pull request because I don't know what is going on anymore... (I've tried the above command)

@jreback
Copy link
Contributor

jreback commented Feb 21, 2017

superseded by #15468

@jreback jreback closed this Feb 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.