ENH: fastpath indexer API proposal (draft) #6328

immerrr · 2014-02-12T12:46:34Z

The discussion in #6134 has inspired an idea that I'm writing down for
discussion. The idea is pretty obvious so it should've been considered before,
but I still think pandas as it is right now can benefit from it.

My main complaint about pandas when using it in non-interactive way is that
lookups are significantly slower than with ndarray containers. I do realize
that this happens because of many ways the indexing may be done, but at some
point I've really started thinking about ditching pandas in some
performance-critical paths of my project and replacing them with the dreadful
dict/ndarray combo. Not only doing arr = df.values[df.idx.get_loc[key]]
gets old pretty fast but it's also slower when the frame contains different
dtypes and then you need to go deeper to fix that.

Now I thought what if this slowdown can be reduced by creating fastpath
indexers that look like the IndexSlice from #6134 and would convey a
message to pandas indexing facilities, like "trust me, I've done all the
preprocessing, just look it up already". I'm talking about something like that
(the names are arbitrary and chosen for illustrative purposes only):

masked_rows = df.fastloc[pd.bool_slice[bool_array]]
# or
masked_rows = df.fastloc[pd.bool_series_slice[bool_series]]
# or
rows_3_and_10 = df.fastloc[pd.pos_slice[3, 10]]
# or
rows_3_through_10 = df.fastloc[pd.range_slice[3:10]]
# or
rows_for_two_days = df.fastloc[pd.tpos_slice['2014-01-01', '2014-01-08']]

Given the actual slice objects will have a common base class, the
implementation could be as easy as:

class FastLocAttribute(object):
   def __init__(self, container):
      self._container = container

    def __getitem__(self, smth):
        if not isinstance(smth, FastpathIndexer):
            raise TypeError("Indexing object is not a FastpathIndexer")

        # open to custom FastpathIndexer implementations
        return smth.getitem(self._container)
        # or a better encapsulated, but not so open
        return self._container._index_method[type(smth)](smth)

Cons:

a change in public API
one more lookup type
inconvenient to use interactively

Pros:

adheres to the Zen of Python (explicit is better than implicit)
when used in programs, most of the time you know what will the indexing
object look like and how do you want to use its contents (e.g. no guessing if
np.array([0,1,0,1]) is a boolean mask or a series of "takeable" indices)
lengthier than existing lookup schemes but still shorter than jumping through
the hoops of NDFrame and Index internals to avoid premature
pessimization (also, more reliable w.r.t. new releases)
fastpath indexing API could be used in pandas internally for the speed (and
clarity, as in "interesting, what does this function pass to df.loc[...],
let's find this out")

The text was updated successfully, but these errors were encountered:

immerrr · 2014-02-12T23:36:40Z

Here are some starting thoughts:

Each fastpath wrapper should convert its wrappee (is this a real word?) to the closest variant accepted by numpy (with an exception, see below). Luckily, numpy containers allow few indexing options (basic, boolean, integer)
Inferring block type from ndarray is complicated and very slow due to timestamp impedance mismatch with numpy and I don't intend to tackle it atm. This means that actual slicing should happen at Block or BlockManager level to avoid losing type information. Which leads to the following.
Due to BlockManager structure, all containers but Series have one axis (items/columns) that can only be integer-indexed and should be handled accordingly. Basically, there should be two steps: one that selects subset (or all) items from BlockManager and the other that slices Blocks as necessary.
It would be great to implement (or to actually find it if it's already there) a single entry point to BlockManager API that, given all the ndim slices in numpy-acceptable formats (with items restricted to integer indexer only), would return the resulting BlockManager or scalar. The implementation seems to be reasonably simple and this would also restrict fastpath wrappers' responsibilities to operating with axes and converting slices/indices as necessary.

jreback · 2014-02-12T23:47:47Z

@immerrr

Their needs to be a couple of benchmark cases here, e.g. usecases that are too slow using a straight indexer, so need a fast one.

can you post some of these cases, with the current benchmark; then a 'hacked' version (use whatever tricks to fast-path that case).

This will help narrow down which cases matter, and what needs to be tackled.

immerrr · 2014-02-13T00:10:22Z

Here's a quick proof-of-concept hack (UPD: disregard the dataframe example because on bigger datasets full slicing [:, mask] is actually slower than take(mask.nonzero()[0], axis=1):

benchmarking series
s.floc[bool_slice[mask]]
10000 loops, best of 3: 46.8 µs per loop
s[mask]
10000 loops, best of 3: 64.5 µs per loop
s.loc[mask]
10000 loops, best of 3: 85.9 µs per loop
s.iloc[mask]
10000 loops, best of 3: 86.4 µs per loop
benchmarking dataframe
df.floc[bool_slice[mask]]
10000 loops, best of 3: 83.3 µs per loop
df.loc[mask]
1000 loops, best of 3: 227 µs per loop
df.iloc[mask]
1000 loops, best of 3: 219 µs per loop

Here's the benchmark code:

import numpy as np
import pandas as pd

from pandas.core.generic import NDFrame
from pandas.core.indexing import _NDFrameIndexer

class _FastpathSlice(object):
    def __init__(self, arg=None):
        self.arg = arg

    def __repr__(self):
        return "%s(arg=%r)" % (self.__class__.__name__, self.arg)

    def __getitem__(self, arg):
        return type(self)(arg)

    def _fastpath_dispatch(self, obj):
        raise NotImplementedError


class _FastpathBoolSlice(_FastpathSlice):
    def __init__(self, arg=None):
        if not isinstance(arg, np.ndarray):
            arg = np.array(arg, dtype=np.bool_)
        def dispatch(obj):
            return (obj._constructor(obj._data.get_slice(arg))
                    .__finalize__(obj))

        self.arg = arg
        self._fastpath_dispatch = dispatch

class _FastpathIndexer(_NDFrameIndexer):
    def __getitem__(self, key):
        try:
            return key._fastpath_dispatch(self.obj)
        except Exception:
            raise TypeError("Something's not right")

def floc(self): return _FastpathIndexer(self, 'floc')
NDFrame.floc = property(floc)

bool_slice = _FastpathBoolSlice()

mask = np.random.rand(100) > 0.5

print("benchmarking series")
s = pd.Series(np.arange(100), index=['a%s' % s for s in np.arange(100)])
print("s.floc[bool_slice[mask]]")
%timeit s.floc[bool_slice[mask]]
print("s[mask]")
%timeit s[mask]
print("s.loc[mask]")
%timeit s.loc[mask]
print("s.iloc[mask]")
%timeit s.iloc[mask]


print("benchmarking dataframe")
mask = mask[:20] # axis=0 for df is columns, not index
df = pd.DataFrame(np.random.rand(100, 20),
                  index=['a%s' % s for s in np.arange(100)],
                  columns=['c%s' % s for s in np.arange(20)])

print("df.floc[bool_slice[mask]]")
%timeit df.floc[bool_slice[mask]]
print("df.loc[:,mask]")
%timeit df.loc[:,mask]
print("df.iloc[:,mask]")
%timeit df.iloc[:,mask]

jreback · 2014-02-13T00:33:47Z

can you you make the bencmarks for a larger number. Saving us is very hard to measure in practice its often the difference of a couple of function calls; not real meaningful. And if you ARE doing this in a loop, well that's a problem in and of itself.

The benchmark should be a semi-realistic, that is a non-trivial time (>1m < 1000ms) for the current case.

immerrr · 2014-02-16T14:59:52Z

FTR: here's a bit of benchmark I've made: https://gist.github.com/immerrr/310a60850721e4ae6e84

Hackish, but works (uncovered #6370 issue for example).

jreback · 2014-02-16T16:07:39Z

would be nice to create a other vbench for all of these indexers
you could use your script and generate say indexers.py in then vbench format
then easy to compare improvements

jreback · 2015-10-20T22:29:13Z

closing as stale. pls reopen if still an issue.

This was referenced Feb 13, 2014

PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341

Merged

PERF: Index.__getitem__ performance issue #6370

Closed

jreback added Performance labels Feb 16, 2014

immerrr mentioned this issue Feb 23, 2014

BLD: add benchmarks for all single-axis indexers #6450

Closed

7 tasks

immerrr mentioned this issue Jul 10, 2014

ENH: support coordinate access for indexing #7138

Closed

jreback closed this as completed Oct 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: fastpath indexer API proposal (draft) #6328

ENH: fastpath indexer API proposal (draft) #6328

immerrr commented Feb 12, 2014

immerrr commented Feb 12, 2014

jreback commented Feb 12, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 16, 2014

jreback commented Feb 16, 2014

jreback commented Oct 20, 2015

ENH: fastpath indexer API proposal (draft) #6328

ENH: fastpath indexer API proposal (draft) #6328

Comments

immerrr commented Feb 12, 2014

immerrr commented Feb 12, 2014

jreback commented Feb 12, 2014

immerrr commented Feb 13, 2014

jreback commented Feb 13, 2014

immerrr commented Feb 16, 2014

jreback commented Feb 16, 2014

jreback commented Oct 20, 2015