PERF: fast non-unique indexing #15468

horta · 2017-02-21T10:15:44Z

This is essentially the same pull request as #15372 but now from a proper branch.

jreback · 2017-02-21T12:59:35Z

pandas/indexes/base.py

        else:
            tgt_values = target._values
+            src_values = self._values
+


this is quite messy, what are you trying to do here?

Agreed! Just factorised that.

you are adding much more code. Where exactly does this routine fail? The only scenario I could see is with actual mixed types (which I am happy to raise on when you have a non-unique index).

Further it looks like you are duplicating lots of functionailty that already exists, see pandas/core/algorithms.py w.r.t. counting / sorting / factorizing.

I didn't find routines un pandas/core/algorithms.py that I could use. I have now described the functions in index.pyx so I hope it is now clearer.

Yes, I was having problem with mixed types. For example,

v = np.array([1, 'danilo'], object) v[0] < v[1]

raises a TypeError exception.

jreback · 2017-02-21T13:00:05Z

pandas/index.pyx

@@ -371,6 +482,16 @@ cdef class IndexEngine:

        return result[0:count], missing[0:count_missing]

+    def get_indexer_non_unique_orderable(self, ndarray targets,
+                                         int64_t[:] idx0,
+                                         int64_t[:] idx1):


this is a massive amount of added code, what exactly are you doing?

Notice that I have removed the code regarding lists. Most of the code consists in two functions: _count and _map. They implement two very specific algorithms that I couldn't find in the above-mentioned file.

codecov-io · 2017-02-22T12:19:40Z

Codecov Report

Merging #15468 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15468      +/-   ##
==========================================
+ Coverage   90.37%   90.37%   +<.01%     
==========================================
  Files         135      135              
  Lines       49473    49494      +21     
==========================================
+ Hits        44709    44730      +21     
  Misses       4764     4764

Impacted Files	Coverage Δ
pandas/indexes/base.py	`96.28% <100%> (+0.04%)`	✅
pandas/core/common.py	`91.02% <ø> (-0.34%)`	❌
pandas/tseries/index.py	`95.42% <ø> (+0.09%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1d5407...62cd783. Read the comment docs.

horta · 2017-02-24T18:38:11Z

Benchmark:

import numpy as np
import pandas as pd
from time import time
from pandas import Index

n = 10000
r = 100


for j in range(1, 11):
    i = Index(list(range(j*n)) * r)
    slic = i[0:j*n]
    start = time()
    i.get_indexer_non_unique(slic)
    print(time() - start)

Elapsed time in seconds:

Previous | New
---------------
    0.92 | 0.18
    3.03 | 0.37
    6.25 | 0.64
   10.90 | 0.79
   17.82 | 1.00
   25.07 | 1.16
   31.49 | 1.33
   40.97 | 1.53
   51.54 | 2.03
   60.64 | 2.18

jreback · 2017-02-24T20:24:34Z

pandas/index.pyx

+            int64_t[:] mapping_count, int64_t[:] missing_count):
+    """
+    Compute the number of times a `targets` value is in `values`.
+


this is exactly value_counts

jreback · 2017-02-24T20:26:12Z

@horta you are adding more and more code. I want LESS code. you are rewriting the world here.

jreback · 2017-02-24T20:30:09Z

pls add an asv.

jreback · 2017-02-24T20:26:34Z

pandas/index.pyx

+@cython.initializedcheck(False)
+cdef _map(ndarray values, ndarray targets, int64_t[:] idx0, int64_t[:] idx1,
+            int64_t[:] start_mapping, int64_t[:] start_missing,
+            int64_t[:] mapping, int64_t[:] missing):


this signature is WAY too complicated.

jreback · 2017-02-24T20:27:16Z

pandas/index.pyx

+def _map_targets_to_values(values, targets, idx0, idx1):
+    """
+    Map `targets` values to `values` positions.
+


this looks like a take, but really not sure what you are doing

jreback · 2017-02-24T20:27:38Z

pandas/index.pyx

+
+        cdef:
+            ndarray values
+


you can add 1 function in cython, anything else is WAY too complicated.

jreback · 2017-02-24T20:28:36Z

pandas/indexes/base.py

+
+        is_mono0 = self.is_monotonic_increasing
+        is_mono1 = target.is_monotonic_increasing
+        (orderable, idx0, idx1) = _order_them(is_mono0, src_values, is_mono1,


if you insist on ordering, then simply order an unordered index, get the results and take the original. this is so much complexity.

jreback · 2017-02-24T20:29:25Z

pandas/indexes/base.py

+        indices = np.argsort(x, kind='mergesort')
+    except TypeError:
+        return (False, None)
+    return (True, indices)


do ALL of this in the cython function. you are needlessly splitting things up.

jreback · 2017-02-25T00:41:27Z

@horta we don't tolerate that type of language

fast non-unique indexing

d6bcef8

jreback mentioned this pull request Feb 21, 2017

get_indexer_non_unique for orderable indexes #15372

Closed

jreback changed the title ~~fast non-unique indexing~~ PERF: fast non-unique indexing Feb 21, 2017

jreback requested changes Feb 21, 2017

View reviewed changes

order them

dd14ddf

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Feb 22, 2017

horta added 2 commits February 24, 2017 15:44

zero length issue

e394829

doc

62cd783

jreback reviewed Feb 24, 2017

View reviewed changes

jreback requested changes Feb 24, 2017

View reviewed changes

horta closed this Feb 25, 2017

pandas-dev locked and limited conversation to collaborators Feb 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: fast non-unique indexing #15468

PERF: fast non-unique indexing #15468

horta commented Feb 21, 2017 •

edited by jreback

Loading

jreback Feb 21, 2017

horta Feb 22, 2017

jreback Feb 22, 2017

horta Feb 24, 2017

jreback Feb 21, 2017

horta Feb 24, 2017

codecov-io commented Feb 22, 2017 •

edited

Loading

horta commented Feb 24, 2017

jreback Feb 24, 2017

jreback commented Feb 24, 2017

jreback commented Feb 24, 2017

jreback Feb 24, 2017

jreback Feb 24, 2017

jreback Feb 24, 2017

jreback Feb 24, 2017

jreback Feb 24, 2017

jreback commented Feb 25, 2017

PERF: fast non-unique indexing #15468

PERF: fast non-unique indexing #15468

Conversation

horta commented Feb 21, 2017 • edited by jreback Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 22, 2017 • edited Loading

Codecov Report

horta commented Feb 24, 2017

Choose a reason for hiding this comment

jreback commented Feb 24, 2017

jreback commented Feb 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 25, 2017

horta commented Feb 21, 2017 •

edited by jreback

Loading

codecov-io commented Feb 22, 2017 •

edited

Loading