Skip to content

Using cython in pandas tutorial #3923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hayd opened this issue Jun 16, 2013 · 38 comments
Closed

Using cython in pandas tutorial #3923

hayd opened this issue Jun 16, 2013 · 38 comments
Labels

Comments

@hayd
Copy link
Contributor

hayd commented Jun 16, 2013

Please see #3965 for current draft.

Re: this topic https://groups.google.com/forum/?fromgroups#!topic/pydata/aLxALYqosOU (cc @jreback).

I think this is one of the killer features of pandas so I think merits being in the docs. (I'd definitely be interested in reading it!) Maybe something structured like this with a good example:

  • write in python first (unit-test, and check for speed, it may be good enough!)
  • try and rewrite in python to be more efficient (now, it may be good enough!)
  • profile to work out which part is slow (and needs cython love)
  • writing and calling a cython function (to do that slow bit faster)

I'm not sure what would make a good toy example for this (and I think that choosing a good one is crucial).

Thoughts?

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

fyi it's not entirely true that you must "manually dispatch" in the "f_float64" "f_int32" etc way you can use fused types which do the dispatch for you, but they can be very difficult to debug and kind of awkward to use.

@hayd
Copy link
Contributor Author

hayd commented Jun 16, 2013

I propose using the example from the cython docs:

def f(x):
    return x**2-x

def integrate_f(a, b, N):
    s = 0
    dx = (b-a)/N
    for i in range(int(N)):  # annoyingly int seems to be required here:  #3928
        s += f(a+i*dx)
    return s * dx

We want to apply that to DataFrame:

df = pd.DataFrame({'a': randn(100), 'b': randn(100), 'N': randint(10, 100, (100))})
    N         a         b
0  93 -0.017216  0.329569
1  84  0.354537  0.314897
2  39  2.948030 -0.263055
3  57  0.751853  1.753032
4  42 -0.378684  2.685732
df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1).head()
0   -0.041781
1    0.008825
2   -4.461920
3    0.386990
4    2.797359
dtype: float64

It has ints and float columns, so may require the blocks trick...

@hayd
Copy link
Contributor Author

hayd commented Jun 16, 2013

I think this is a good example of using cython (I can put something together for this) - it shows a big speed improvement, but I'm not sure if it's a good example for leveraging numpy arrays.... ?

@jreback
Copy link
Contributor

jreback commented Jun 16, 2013

ideally prob have an extended example of solving this problem using apply
then maybe using a function passed to cython (which is a cython function) which operates on and returns ndarrays (which are then wrapped in frames)

kind of like the cython ndarray example

@hayd
Copy link
Contributor Author

hayd commented Jun 16, 2013

So essentially do the apply yourself (all in cython)?

@jreback
Copy link
Contributor

jreback commented Jun 16, 2013

I think that would be a nice non-trivial example
maybe pass in the floats, ints
supply the integrate and f as cython functions snd return the final ndarray
and provide a wrapping frame

@hayd
Copy link
Contributor Author

hayd commented Jun 16, 2013

Created working cython f and integrate f (plain and typed), working great.

Any ideas why this might compile but not import (is this the kind of thing you meant?):

import numpy as np
cimport numpy as np

cpdef apply_integrate_f(np.ndarray col_a, np.ndarray col_b, np.ndarray col_N):
    assert (col_a.dtype == np.float and col_b.dtype == np.float and col_N.dtype == np.int)
    assert (len(col_a) == len(col_b) == len(col_N))
    cdef np.ndarray res = np.zeros(len(col_a), dtype=np.float)
    # cdef np.ndarray dx = col_a * col_b / col_N
    for i in range(len(col_a)):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

It comes up with a lovely message part way though the stacktrace :)

# XXX -- this is a Vile HACK!
...lots like this
Users/234BroadWalk/.pyxbld/temp.macosx-10.6-intel-2.7/pyrex/integrate.c:4069: error: ‘PyUFuncObject’ undeclared (first use in this function)
lipo: can't figure out the architecture type of: /var/folders/hc/qwq7bjd535xgr4_vl7kjkxsw0000gp/T//cc3pLpao.out
---------------------------------------------------------------------------
... can post all if helpful?
ImportError: Building module integrate failed: ["CompileError: command 'gcc-4.2' failed with exit status 1\n"]

@jreback
Copy link
Contributor

jreback commented Jun 16, 2013

never seen that one
can u show integrated_f_typed?

@hayd
Copy link
Contributor Author

hayd commented Jun 16, 2013

cdef double f_typed(double x) except? -2:
    return x**2-x

cpdef integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b-a)/N
    for i in range(N):
        s += f_typed(a+i*dx)
    return s * dx

These are direct copies from the cython example. :s

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

@hayd is that error the first one out of the compiler?

@hayd
Copy link
Contributor Author

hayd commented Jun 16, 2013

So I compile it just like this

[~/pandas]$ cython integrate.pyx
[~/pandas]$

and import in ipython like this:

In [3]: import pyximport; pyximport.install()
Out[3]: (None, <pyximport.pyximport.PyxImporter at 0x1042a5dd0>)

In [4]: import integrate
# ImportError: Building module integrate failed: ["CompileError: command 'gcc-4.2' failed with exit status 1\n"]

This method works for the other functions (when apply_integrate_f is not in the pyx file)...

@cpcloud
Copy link
Member

cpcloud commented Jun 16, 2013

why not just paste into ipython

@hayd
Copy link
Contributor Author

hayd commented Jun 16, 2013

@cpcloud ? I think I'm missing something fundamental here.

I just tried using %%cython_inline but I get a CompilerCrash, from AssertionError: Not yet supporting any cimports/includes from string code snippets on the cimport numpy line. :S

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

oh i used %%cython and then copypasted each function separately

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

Somewhat confusingly this worked first time...!! :) So, I guess nothing was wrong with the functions!

In [13]: %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)  # python
10 loops, best of 3: 37.6 ms per loop

In [14]: %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)  # cythonised
100 loops, best of 3: 11.8 ms per loop

In [15]: %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1) # cythonised with type
100 loops, best of 3: 3.57 ms per loop

In [16]: %timeit apply_integrate_f(df['a'], df['b'], df['N']) # cythonised apply
1000 loops, best of 3: 1.2 ms per loop

I think this probably makes quite an ok example, it doesn't make use of a ndarray (only 1D) but nonetheless I think it's not too bad. Definitely shows the benefits!

Oh... maybe I can grab the float blocks using .blocks will see if that makes it even faster.

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

this

cpdef apply_integrate_f(np.ndarray col_a, np.ndarray col_b, np.ndarray col_N):
    assert (col_a.dtype == np.float and col_b.dtype == np.float and col_N.dtype == np.int)
    assert (len(col_a) == len(col_b) == len(col_N))
    cdef np.ndarray res = np.zeros(len(col_a), dtype=np.float)
    # cdef np.ndarray dx = col_a * col_b / col_N
    for i in range(len(col_a)):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

could be changed to

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[double] apply_integrate_f(np.ndarray[double] col_a, np.ndarray[double] col_b, np.ndarray[Py_ssize_t] col_N):
    cdef Py_ssize_t i, n = len(col_N)
    assert len(col_a) == len(col_b) == n  # only because of above decorators
    cdef np.ndarray[double] res = np.empty(n)  # does float by default
    for i in range(n):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

for some more speedup

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

Wowza, that looking swish! For me though: Cdef functions/classes cannot take arbitrary decorators. :s

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

u need to do cimport cython sorry i will correct.

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

fyi make sure u give the loop variable a type if it makes sense since i think cython will use an object if u don't

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

ah of course! It's getting quite late, my brain has stopped working.

What's the Py_ssize_t i stuff about?

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

python indexing type

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

Wow!

In [35]: %timeit apply_integrate_f_wrap(df['a'], df['b'], df['N'])
1000 loops, best of 3: 354 us per loop

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

u might even be able to squeeze even more out if u use cython memoryviews

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[double] apply_integrate_f(double[:] col_a, double[:] col_b, Py_ssize_t[:] col_N):
    cdef Py_ssize_t i, n = len(col_N)
    assert len(col_a) == len(col_b) == n  # only because of above decorators
    cdef np.ndarray[double] res = np.empty(n)  # does float by default
    for i in range(n):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

Was wondering if we could have had an example using the float block, but I can only make it twice as slow as yours...

apply_integrate_f_wrap_blocks(df.blocks['float64'].values, df['N'])
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[double] apply_integrate_f_wrap_blocks(np.ndarray[double, ndim=2] cols_ab, np.ndarray[Py_ssize_t] col_N):
    cdef Py_ssize_t i, n = len(col_N)
    # assert shape
    assert len(cols_ab) == n  # only because of above decorators
    cdef np.ndarray[double] res = np.empty(n)  # does float by default
    for i in range(n):
        res[i] = integrate_f_typed(cols_ab[i][0], cols_ab[i][1], col_N[i])
    return res

Barking up the wrong tree here?

(I think already this looking like it's going to be a nice thing to write up!)

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

i suppose. but u could also just do

apply_integrate_f(*df.blocks['float64'].values.T, col_N=df['N'])

a bit terse, but it gets the job done.

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

Is there a neat way to get the stdout from things like prun and timeit into the docs:

.. ipython:: python
   :verbatim

   %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['c']), axis=1)

seems to only capture printed things... (i.e. nothing in this case)

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

very WIP, but here's a initial draft (see PR)

@hayd
Copy link
Contributor Author

hayd commented Jun 17, 2013

@cpcloud Thanks for catching that, now I can see all the other errors... :s

@cpcloud
Copy link
Member

cpcloud commented Jun 17, 2013

no prob. doc builds are finnicky...

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

@cpcloud it somehow came together magically at the end, it is insanely sensitive to spacing.. et al. Thanks for your help and being so patient!

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2013

glad it worked out!

@jtratner
Copy link
Contributor

@hayd what editor do you use? I use vim and it makes it eas(ier) to see restructuredtext errors (not perfect though, it can be frustratingly sensitive).

@jtratner
Copy link
Contributor

@hayd @cpcloud Does pandas do anything special if you pass a cythonized function to things like groupby or apply? It'd be cool to be able to stay on the other side of the C ABI if you can pass a cythonized function.

@jtratner
Copy link
Contributor

(and clearly I know very little about cython right now...)

@cpcloud
Copy link
Member

cpcloud commented Jun 20, 2013

toctree isn't that complicated :) it's basically an index that allows you to refer to other documents without having use paths and what not and also allows customization of the table of contents output. @hayd u should put cython.rst in the toctree if you want it to show up in the navbar in the docs

@cpcloud
Copy link
Member

cpcloud commented Jun 20, 2013

@jtratner i don't think so. i'm not sure if there's any extra metadata in a cython function that would allow u to tell the difference between it and a python function. @jreback probably knows more. you can pass a cythonized function anyway, but if in fact there's a cython function being called at some lower level that would call your cythonized function it will be typed as object and might not give much of a performance gain. assuming your cythonized function doesn't have all sorts of loops, e.g., a polynomial, then you'll probably gain some constant factor which of course may still be useful.

@cpcloud
Copy link
Member

cpcloud commented Jun 20, 2013

@jtratner cython is roughly python with types. it's useful for 2 things: making array looping faster and interfacing with other C code in a sane way. it does all the refcounting for u and also has some limited generic typing abilities among other things... the loops are actually rewritten almost exactly as u would hand code the c loops which really gives a lot of speedup. it also has the ability to execute some code in parallel and bypass the GIL which so far i've only found to be useful in one situation (unrelated to pandas) @hayd's tutorial is a nice starting point and then if u want more u can read the cython docs :)

@hayd
Copy link
Contributor Author

hayd commented Jun 20, 2013

@cpcloud The toctree I think I've had issues with is for to_pickle and read_pickle, I'm sure I switched all uses of save/load with to_pickle/read_pickle (and removed the deprecated ways of calling them). Guess I missed something...

I've added in cython at the end of the toctree (I think it warrants it's own section?).

@jtratner Once we worked out the correct syntax (and what it was caring about) it came out ok (I went through a whack-a-mole of indentation choices before that though). :(

@hayd hayd closed this as completed Jun 21, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants