Skip to content

ENH: fastpath indexer API proposal (draft) #6328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
immerrr opened this issue Feb 12, 2014 · 7 comments
Closed

ENH: fastpath indexer API proposal (draft) #6328

immerrr opened this issue Feb 12, 2014 · 7 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@immerrr
Copy link
Contributor

immerrr commented Feb 12, 2014

The discussion in #6134 has inspired an idea that I'm writing down for
discussion. The idea is pretty obvious so it should've been considered before,
but I still think pandas as it is right now can benefit from it.

My main complaint about pandas when using it in non-interactive way is that
lookups are significantly slower than with ndarray containers. I do realize
that this happens because of many ways the indexing may be done, but at some
point I've really started thinking about ditching pandas in some
performance-critical paths of my project and replacing them with the dreadful
dict/ndarray combo. Not only doing arr = df.values[df.idx.get_loc[key]]
gets old pretty fast but it's also slower when the frame contains different
dtypes and then you need to go deeper to fix that.

Now I thought what if this slowdown can be reduced by creating fastpath
indexers
that look like the IndexSlice from #6134 and would convey a
message to pandas indexing facilities, like "trust me, I've done all the
preprocessing, just look it up already". I'm talking about something like that
(the names are arbitrary and chosen for illustrative purposes only):

masked_rows = df.fastloc[pd.bool_slice[bool_array]]
# or
masked_rows = df.fastloc[pd.bool_series_slice[bool_series]]
# or
rows_3_and_10 = df.fastloc[pd.pos_slice[3, 10]]
# or
rows_3_through_10 = df.fastloc[pd.range_slice[3:10]]
# or
rows_for_two_days = df.fastloc[pd.tpos_slice['2014-01-01', '2014-01-08']]

Given the actual slice objects will have a common base class, the
implementation could be as easy as:

class FastLocAttribute(object):
   def __init__(self, container):
      self._container = container

    def __getitem__(self, smth):
        if not isinstance(smth, FastpathIndexer):
            raise TypeError("Indexing object is not a FastpathIndexer")

        # open to custom FastpathIndexer implementations
        return smth.getitem(self._container)
        # or a better encapsulated, but not so open
        return self._container._index_method[type(smth)](smth)

Cons:

  • a change in public API
  • one more lookup type
  • inconvenient to use interactively

Pros:

  • adheres to the Zen of Python (explicit is better than implicit)
  • when used in programs, most of the time you know what will the indexing
    object look like and how do you want to use its contents (e.g. no guessing if
    np.array([0,1,0,1]) is a boolean mask or a series of "takeable" indices)
  • lengthier than existing lookup schemes but still shorter than jumping through
    the hoops of NDFrame and Index internals to avoid premature
    pessimization (also, more reliable w.r.t. new releases)
  • fastpath indexing API could be used in pandas internally for the speed (and
    clarity, as in "interesting, what does this function pass to df.loc[...],
    let's find this out")
@immerrr
Copy link
Contributor Author

immerrr commented Feb 12, 2014

Here are some starting thoughts:

  • Each fastpath wrapper should convert its wrappee (is this a real word?) to the closest variant accepted by numpy (with an exception, see below). Luckily, numpy containers allow few indexing options (basic, boolean, integer)
  • Inferring block type from ndarray is complicated and very slow due to timestamp impedance mismatch with numpy and I don't intend to tackle it atm. This means that actual slicing should happen at Block or BlockManager level to avoid losing type information. Which leads to the following.
  • Due to BlockManager structure, all containers but Series have one axis (items/columns) that can only be integer-indexed and should be handled accordingly. Basically, there should be two steps: one that selects subset (or all) items from BlockManager and the other that slices Blocks as necessary.
  • It would be great to implement (or to actually find it if it's already there) a single entry point to BlockManager API that, given all the ndim slices in numpy-acceptable formats (with items restricted to integer indexer only), would return the resulting BlockManager or scalar. The implementation seems to be reasonably simple and this would also restrict fastpath wrappers' responsibilities to operating with axes and converting slices/indices as necessary.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2014

@immerrr

Their needs to be a couple of benchmark cases here, e.g. usecases that are too slow using a straight indexer, so need a fast one.

can you post some of these cases, with the current benchmark; then a 'hacked' version (use whatever tricks to fast-path that case).

This will help narrow down which cases matter, and what needs to be tackled.

@immerrr
Copy link
Contributor Author

immerrr commented Feb 13, 2014

Here's a quick proof-of-concept hack (UPD: disregard the dataframe example because on bigger datasets full slicing [:, mask] is actually slower than take(mask.nonzero()[0], axis=1):

benchmarking series
s.floc[bool_slice[mask]]
10000 loops, best of 3: 46.8 µs per loop
s[mask]
10000 loops, best of 3: 64.5 µs per loop
s.loc[mask]
10000 loops, best of 3: 85.9 µs per loop
s.iloc[mask]
10000 loops, best of 3: 86.4 µs per loop
benchmarking dataframe
df.floc[bool_slice[mask]]
10000 loops, best of 3: 83.3 µs per loop
df.loc[mask]
1000 loops, best of 3: 227 µs per loop
df.iloc[mask]
1000 loops, best of 3: 219 µs per loop

Here's the benchmark code:

import numpy as np
import pandas as pd

from pandas.core.generic import NDFrame
from pandas.core.indexing import _NDFrameIndexer

class _FastpathSlice(object):
    def __init__(self, arg=None):
        self.arg = arg

    def __repr__(self):
        return "%s(arg=%r)" % (self.__class__.__name__, self.arg)

    def __getitem__(self, arg):
        return type(self)(arg)

    def _fastpath_dispatch(self, obj):
        raise NotImplementedError


class _FastpathBoolSlice(_FastpathSlice):
    def __init__(self, arg=None):
        if not isinstance(arg, np.ndarray):
            arg = np.array(arg, dtype=np.bool_)
        def dispatch(obj):
            return (obj._constructor(obj._data.get_slice(arg))
                    .__finalize__(obj))

        self.arg = arg
        self._fastpath_dispatch = dispatch

class _FastpathIndexer(_NDFrameIndexer):
    def __getitem__(self, key):
        try:
            return key._fastpath_dispatch(self.obj)
        except Exception:
            raise TypeError("Something's not right")

def floc(self): return _FastpathIndexer(self, 'floc')
NDFrame.floc = property(floc)

bool_slice = _FastpathBoolSlice()

mask = np.random.rand(100) > 0.5

print("benchmarking series")
s = pd.Series(np.arange(100), index=['a%s' % s for s in np.arange(100)])
print("s.floc[bool_slice[mask]]")
%timeit s.floc[bool_slice[mask]]
print("s[mask]")
%timeit s[mask]
print("s.loc[mask]")
%timeit s.loc[mask]
print("s.iloc[mask]")
%timeit s.iloc[mask]


print("benchmarking dataframe")
mask = mask[:20] # axis=0 for df is columns, not index
df = pd.DataFrame(np.random.rand(100, 20),
                  index=['a%s' % s for s in np.arange(100)],
                  columns=['c%s' % s for s in np.arange(20)])

print("df.floc[bool_slice[mask]]")
%timeit df.floc[bool_slice[mask]]
print("df.loc[:,mask]")
%timeit df.loc[:,mask]
print("df.iloc[:,mask]")
%timeit df.iloc[:,mask]

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

can you you make the bencmarks for a larger number. Saving us is very hard to measure in practice its often the difference of a couple of function calls; not real meaningful. And if you ARE doing this in a loop, well that's a problem in and of itself.

The benchmark should be a semi-realistic, that is a non-trivial time (>1m < 1000ms) for the current case.

@immerrr
Copy link
Contributor Author

immerrr commented Feb 16, 2014

FTR: here's a bit of benchmark I've made: https://gist.github.com/immerrr/310a60850721e4ae6e84

Hackish, but works (uncovered #6370 issue for example).

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

would be nice to create a other vbench for all of these indexers
you could use your script and generate say indexers.py in then vbench format
then easy to compare improvements

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

closing as stale. pls reopen if still an issue.

@jreback jreback closed this as completed Oct 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants