-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: non-info axes slicing on Panels is slow #6484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Slicing etc should be just fine as long as you use pandas methods. |
I actually ran into the whole datetime64 thing when I was pondering on where the 100x speed drop comes from in the case below (the example is for a pure float64 panel) and whether it can be related to datetime64 field (as in the previous example): >>> pn = pd.Panel(np.random.random((12, 5412, 162)))
>>> ix = np.ones(pn.shape[1], dtype=bool)
>>> ix[np.random.random(ix.size) > 0.5] = 0
>>> %timeit pn.loc[0, ix] # one field, as fast as dataframe
1000 loops, best of 3: 690 µs per loop
>>> %timeit pn.loc[[0, 1], ix] # two fields, 233x slower
10 loops, best of 3: 161 ms per loop
>>> %timeit pn.loc[:, ix]
1 loops, best of 3: 666 ms per loop
>>> pn_t = pn.swapaxes(0, 1).copy() # try .loc on the first axis
>>> %timeit pn_t.loc[ix]
1 loops, best of 3: 1.27 s per loop
>>> def pn_slice_major(pn, ix): # a really stupid way of doing this
..: slices = dict((item, pn.loc[item, ix]) for item in pn.items)
..: pn_s = pd.Panel(major_axis=slices[0].index, minor_axis=slices[0].columns)
..: for item in pn.items:
..: pn_s[item] = slices[item]
..: return pn_s
..:
>>> %timeit pn_slice_major(pn, ix) # 13.5x faster than .loc[:, ix]
10 loops, best of 3: 49.5 ms per loop Still have no idea what's the exact reason for this ^. |
#6440 can prob help with this the indexing code is very tricky if you can implement this in a generic way go for it, pls look into contributing a fix for this. you have to be really careful with this, because if for example their are a lot of items, this would be way slower. indexing is optimized for 0th access, (otherwise you are doing a lot of cross-section indexing). |
I can sure write hacks and workarounds that work faster for specific use cases, but solving it generally is quite a bit more complicated in pandas, especially when you're not as familiar with the entire internal api :/ I'll try looking into it a bit later, maybe it's something more or less obvious. I find Panels generally very useful (esp for financial market data, where you often have date/symbol/field/etc) and tried using them to avoid indexing similarly indexed data multiple times.. but ironically that's exactly what I have to do now because it's faster. As for a lot of items: see above where I index the transposed panel. |
@aldanor great...I use them for the same reason! I DO think their is a perf degredation from 0.12 to 0.13.1...but as you noted the slicing is pretty tricky. should be pretty straightfoward in this case to at least see where its coming from then can address it. as an aside, it is sometimes more efficient to transpose, then slice and transpose back (but a bit tricky to make this work correctly). as the 0th axis vs the -1th axis have different slicing characteristics because of how numpy aligns memory. I generally line the panels up like I use them (and somethimes this is different from how I store then in HDF5), e.g. generally do something like: items x dates x symbols |
Good point, I'll try to bench all the above from 0.12 through to 0.13.1 -- I bet it wasn't that slow before. Btw! More weirdness (data from the very first example): >>> %timeit pn2.ix[['volumes', 'discounts']] # panel w/o timestamps; both fields are floats
100 loops, best of 3: 16.9 ms per loop
>>> %timeit pn1.ix[['volumes', 'discounts']] # panel w/ timestamps; both fields are floats
1 loops, best of 3: 1.42 s per loop # WTF? Note that I'm not calling |
Panels are very tricky with multi-dtypes. Look at |
Your data is prob lined up cross sectionally which causes conversion to object
|
@jreback Hey, sorry I wouldn't have time to look into this until weekend. Mind if I add couple edge cases to the vbench regarding panels? |
np submit a pr at your leisure |
Assume we have two panels:
12 fields are float64 while one field in
pn1
isdatetime64[ns]
. This slows pretty much all operations (slicing, querying, anything) down by a huge factor:Is there an unofficial rule of not using datetime64 in the first place, is it some weird coercion bug (it seems to try and coerce everything down to floats), does it have anything to do with panels?
The text was updated successfully, but these errors were encountered: