Skip to content

PERF: fast non-unique indexing #15468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

PERF: fast non-unique indexing #15468

wants to merge 4 commits into from

Conversation

horta
Copy link
Contributor

@horta horta commented Feb 21, 2017

closes #15364

This is essentially the same pull request as #15372 but now from a proper branch.

@jreback jreback changed the title fast non-unique indexing PERF: fast non-unique indexing Feb 21, 2017
else:
tgt_values = target._values
src_values = self._values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite messy, what are you trying to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Just factorised that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are adding much more code. Where exactly does this routine fail? The only scenario I could see is with actual mixed types (which I am happy to raise on when you have a non-unique index).

Further it looks like you are duplicating lots of functionailty that already exists, see pandas/core/algorithms.py w.r.t. counting / sorting / factorizing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find routines un pandas/core/algorithms.py that I could use. I have now described the functions in index.pyx so I hope it is now clearer.

Yes, I was having problem with mixed types. For example,

v = np.array([1, 'danilo'], object)
v[0] < v[1]

raises a TypeError exception.

@@ -371,6 +482,16 @@ cdef class IndexEngine:

return result[0:count], missing[0:count_missing]

def get_indexer_non_unique_orderable(self, ndarray targets,
int64_t[:] idx0,
int64_t[:] idx1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a massive amount of added code, what exactly are you doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that I have removed the code regarding lists. Most of the code consists in two functions: _count and _map. They implement two very specific algorithms that I couldn't find in the above-mentioned file.

@codecov-io
Copy link

codecov-io commented Feb 22, 2017

Codecov Report

Merging #15468 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15468      +/-   ##
==========================================
+ Coverage   90.37%   90.37%   +<.01%     
==========================================
  Files         135      135              
  Lines       49473    49494      +21     
==========================================
+ Hits        44709    44730      +21     
  Misses       4764     4764
Impacted Files Coverage Δ
pandas/indexes/base.py 96.28% <100%> (+0.04%)
pandas/core/common.py 91.02% <ø> (-0.34%)
pandas/tseries/index.py 95.42% <ø> (+0.09%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1d5407...62cd783. Read the comment docs.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 22, 2017
@horta
Copy link
Contributor Author

horta commented Feb 24, 2017

Benchmark:

import numpy as np
import pandas as pd
from time import time
from pandas import Index

n = 10000
r = 100


for j in range(1, 11):
    i = Index(list(range(j*n)) * r)
    slic = i[0:j*n]
    start = time()
    i.get_indexer_non_unique(slic)
    print(time() - start)

Elapsed time in seconds:

Previous | New
---------------
    0.92 | 0.18
    3.03 | 0.37
    6.25 | 0.64
   10.90 | 0.79
   17.82 | 1.00
   25.07 | 1.16
   31.49 | 1.33
   40.97 | 1.53
   51.54 | 2.03
   60.64 | 2.18

int64_t[:] mapping_count, int64_t[:] missing_count):
"""
Compute the number of times a `targets` value is in `values`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly value_counts

@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

@horta you are adding more and more code. I want LESS code. you are rewriting the world here.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

pls add an asv.

@cython.initializedcheck(False)
cdef _map(ndarray values, ndarray targets, int64_t[:] idx0, int64_t[:] idx1,
int64_t[:] start_mapping, int64_t[:] start_missing,
int64_t[:] mapping, int64_t[:] missing):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this signature is WAY too complicated.

def _map_targets_to_values(values, targets, idx0, idx1):
"""
Map `targets` values to `values` positions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a take, but really not sure what you are doing


cdef:
ndarray values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add 1 function in cython, anything else is WAY too complicated.


is_mono0 = self.is_monotonic_increasing
is_mono1 = target.is_monotonic_increasing
(orderable, idx0, idx1) = _order_them(is_mono0, src_values, is_mono1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you insist on ordering, then simply order an unordered index, get the results and take the original. this is so much complexity.

indices = np.argsort(x, kind='mergesort')
except TypeError:
return (False, None)
return (True, indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do ALL of this in the cython function. you are needlessly splitting things up.

@horta horta closed this Feb 25, 2017
@jreback
Copy link
Contributor

jreback commented Feb 25, 2017

@horta we don't tolerate that type of language

@pandas-dev pandas-dev locked and limited conversation to collaborators Feb 25, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Huge performance hit in get_indexer_non_unique
3 participants