-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: fastpath indexer API proposal (draft) #6328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here are some starting thoughts:
|
Their needs to be a couple of benchmark cases here, e.g. usecases that are too slow using a straight indexer, so need a fast one. can you post some of these cases, with the current benchmark; then a 'hacked' version (use whatever tricks to fast-path that case). This will help narrow down which cases matter, and what needs to be tackled. |
Here's a quick proof-of-concept hack (UPD: disregard the dataframe example because on bigger datasets full slicing
Here's the benchmark code: import numpy as np
import pandas as pd
from pandas.core.generic import NDFrame
from pandas.core.indexing import _NDFrameIndexer
class _FastpathSlice(object):
def __init__(self, arg=None):
self.arg = arg
def __repr__(self):
return "%s(arg=%r)" % (self.__class__.__name__, self.arg)
def __getitem__(self, arg):
return type(self)(arg)
def _fastpath_dispatch(self, obj):
raise NotImplementedError
class _FastpathBoolSlice(_FastpathSlice):
def __init__(self, arg=None):
if not isinstance(arg, np.ndarray):
arg = np.array(arg, dtype=np.bool_)
def dispatch(obj):
return (obj._constructor(obj._data.get_slice(arg))
.__finalize__(obj))
self.arg = arg
self._fastpath_dispatch = dispatch
class _FastpathIndexer(_NDFrameIndexer):
def __getitem__(self, key):
try:
return key._fastpath_dispatch(self.obj)
except Exception:
raise TypeError("Something's not right")
def floc(self): return _FastpathIndexer(self, 'floc')
NDFrame.floc = property(floc)
bool_slice = _FastpathBoolSlice()
mask = np.random.rand(100) > 0.5
print("benchmarking series")
s = pd.Series(np.arange(100), index=['a%s' % s for s in np.arange(100)])
print("s.floc[bool_slice[mask]]")
%timeit s.floc[bool_slice[mask]]
print("s[mask]")
%timeit s[mask]
print("s.loc[mask]")
%timeit s.loc[mask]
print("s.iloc[mask]")
%timeit s.iloc[mask]
print("benchmarking dataframe")
mask = mask[:20] # axis=0 for df is columns, not index
df = pd.DataFrame(np.random.rand(100, 20),
index=['a%s' % s for s in np.arange(100)],
columns=['c%s' % s for s in np.arange(20)])
print("df.floc[bool_slice[mask]]")
%timeit df.floc[bool_slice[mask]]
print("df.loc[:,mask]")
%timeit df.loc[:,mask]
print("df.iloc[:,mask]")
%timeit df.iloc[:,mask]
|
can you you make the bencmarks for a larger number. Saving The benchmark should be a semi-realistic, that is a non-trivial time (>1m < 1000ms) for the current case. |
FTR: here's a bit of benchmark I've made: https://gist.github.com/immerrr/310a60850721e4ae6e84 Hackish, but works (uncovered #6370 issue for example). |
would be nice to create a other vbench for all of these indexers |
closing as stale. pls reopen if still an issue. |
The discussion in #6134 has inspired an idea that I'm writing down for
discussion. The idea is pretty obvious so it should've been considered before,
but I still think pandas as it is right now can benefit from it.
My main complaint about pandas when using it in non-interactive way is that
lookups are significantly slower than with
ndarray
containers. I do realizethat this happens because of many ways the indexing may be done, but at some
point I've really started thinking about ditching pandas in some
performance-critical paths of my project and replacing them with the dreadful
dict/ndarray
combo. Not only doingarr = df.values[df.idx.get_loc[key]]
gets old pretty fast but it's also slower when the frame contains different
dtypes and then you need to go deeper to fix that.
Now I thought what if this slowdown can be reduced by creating fastpath
indexers that look like the
IndexSlice
from #6134 and would convey amessage to
pandas
indexing facilities, like "trust me, I've done all thepreprocessing, just look it up already". I'm talking about something like that
(the names are arbitrary and chosen for illustrative purposes only):
Given the actual slice objects will have a common base class, the
implementation could be as easy as:
Cons:
Pros:
object look like and how do you want to use its contents (e.g. no guessing if
np.array([0,1,0,1]) is a boolean mask or a series of "takeable" indices)
the hoops of
NDFrame
andIndex
internals to avoid prematurepessimization (also, more reliable w.r.t. new releases)
pandas
internally for the speed (andclarity, as in "interesting, what does this function pass to df.loc[...],
let's find this out")
The text was updated successfully, but these errors were encountered: