-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Using cython in pandas tutorial #3923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
fyi it's not entirely true that you must "manually dispatch" in the "f_float64" "f_int32" etc way you can use fused types which do the dispatch for you, but they can be very difficult to debug and kind of awkward to use. |
I propose using the example from the cython docs:
We want to apply that to DataFrame:
It has ints and float columns, so may require the blocks trick... |
I think this is a good example of using cython (I can put something together for this) - it shows a big speed improvement, but I'm not sure if it's a good example for leveraging numpy arrays.... ? |
ideally prob have an extended example of solving this problem using apply kind of like the cython ndarray example |
So essentially do the apply yourself (all in cython)? |
I think that would be a nice non-trivial example |
Created working cython f and integrate f (plain and typed), working great. Any ideas why this might compile but not import (is this the kind of thing you meant?):
It comes up with a lovely message part way though the stacktrace :)
|
never seen that one |
These are direct copies from the cython example. :s |
@hayd is that error the first one out of the compiler? |
So I compile it just like this
and import in ipython like this:
This method works for the other functions (when apply_integrate_f is not in the pyx file)... |
why not just paste into ipython |
@cpcloud ? I think I'm missing something fundamental here. I just tried using |
oh i used |
Somewhat confusingly this worked first time...!! :) So, I guess nothing was wrong with the functions!
I think this probably makes quite an ok example, it doesn't make use of a ndarray (only 1D) but nonetheless I think it's not too bad. Definitely shows the benefits! Oh... maybe I can grab the float blocks using |
this cpdef apply_integrate_f(np.ndarray col_a, np.ndarray col_b, np.ndarray col_N):
assert (col_a.dtype == np.float and col_b.dtype == np.float and col_N.dtype == np.int)
assert (len(col_a) == len(col_b) == len(col_N))
cdef np.ndarray res = np.zeros(len(col_a), dtype=np.float)
# cdef np.ndarray dx = col_a * col_b / col_N
for i in range(len(col_a)):
res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
return res could be changed to cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[double] apply_integrate_f(np.ndarray[double] col_a, np.ndarray[double] col_b, np.ndarray[Py_ssize_t] col_N):
cdef Py_ssize_t i, n = len(col_N)
assert len(col_a) == len(col_b) == n # only because of above decorators
cdef np.ndarray[double] res = np.empty(n) # does float by default
for i in range(n):
res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
return res for some more speedup |
Wowza, that looking swish! For me though: |
u need to do |
fyi make sure u give the loop variable a type if it makes sense since i think cython will use an |
ah of course! It's getting quite late, my brain has stopped working. What's the |
Wow!
|
u might even be able to squeeze even more out if u use cython memoryviews cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[double] apply_integrate_f(double[:] col_a, double[:] col_b, Py_ssize_t[:] col_N):
cdef Py_ssize_t i, n = len(col_N)
assert len(col_a) == len(col_b) == n # only because of above decorators
cdef np.ndarray[double] res = np.empty(n) # does float by default
for i in range(n):
res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
return res |
Was wondering if we could have had an example using the float block, but I can only make it twice as slow as yours...
Barking up the wrong tree here? (I think already this looking like it's going to be a nice thing to write up!) |
i suppose. but u could also just do apply_integrate_f(*df.blocks['float64'].values.T, col_N=df['N']) a bit terse, but it gets the job done. |
Is there a neat way to get the stdout from things like prun and timeit into the docs:
seems to only capture printed things... (i.e. nothing in this case) |
very WIP, but here's a initial draft (see PR) |
@cpcloud Thanks for catching that, now I can see all the other errors... :s |
no prob. doc builds are finnicky... |
@cpcloud it somehow came together magically at the end, it is insanely sensitive to spacing.. et al. Thanks for your help and being so patient! |
glad it worked out! |
@hayd what editor do you use? I use vim and it makes it eas(ier) to see restructuredtext errors (not perfect though, it can be frustratingly sensitive). |
(and clearly I know very little about cython right now...) |
@jtratner i don't think so. i'm not sure if there's any extra metadata in a cython function that would allow u to tell the difference between it and a python function. @jreback probably knows more. you can pass a cythonized function anyway, but if in fact there's a cython function being called at some lower level that would call your cythonized function it will be typed as object and might not give much of a performance gain. assuming your cythonized function doesn't have all sorts of loops, e.g., a polynomial, then you'll probably gain some constant factor which of course may still be useful. |
@jtratner cython is roughly python with types. it's useful for 2 things: making array looping faster and interfacing with other C code in a sane way. it does all the refcounting for u and also has some limited generic typing abilities among other things... the loops are actually rewritten almost exactly as u would hand code the c loops which really gives a lot of speedup. it also has the ability to execute some code in parallel and bypass the GIL which so far i've only found to be useful in one situation (unrelated to pandas) @hayd's tutorial is a nice starting point and then if u want more u can read the cython docs :) |
@cpcloud The toctree I think I've had issues with is for to_pickle and read_pickle, I'm sure I switched all uses of save/load with to_pickle/read_pickle (and removed the deprecated ways of calling them). Guess I missed something... I've added in cython at the end of the toctree (I think it warrants it's own section?). @jtratner Once we worked out the correct syntax (and what it was caring about) it came out ok (I went through a whack-a-mole of indentation choices before that though). :( |
Please see #3965 for current draft.
Re: this topic https://groups.google.com/forum/?fromgroups#!topic/pydata/aLxALYqosOU (cc @jreback).
I think this is one of the killer features of pandas so I think merits being in the docs. (I'd definitely be interested in reading it!) Maybe something structured like this with a good example:
I'm not sure what would make a good toy example for this (and I think that choosing a good one is crucial).
Thoughts?
The text was updated successfully, but these errors were encountered: