DataFrame.mean takes a very long time with mixed dtype columns #6662

cpcloud · 2014-03-18T16:11:24Z

If I have some numeric columns over which I want to compute the mean and I have at least one string column, it takes much too long to compute. I can't even hit Ctrl-C to interrupt the process (if the frame is large enough). Interestingly, the string columns are discarded in the final result. The perf difference is about a factor of 800 when the frame has 10000 elements.

In [18]: n = 10000

In [19]: df = DataFrame(randn(n, 2), columns=list('ab'))

In [20]: df['c'] = [pd.util.testing.rands(5) for _ in xrange(n)]

In [21]: df.head(10)
Out[21]:
        a       b      c
0  1.0393  0.5719  AVi6V
1  0.6642  0.7441  mtqXk
2 -1.1552  0.1583  euUoo
3  0.7759  0.7647  cAAk2
4 -0.4958  0.4079  TYRRj
5 -0.7168 -1.1523  YT34i
6  1.5557 -1.7054  vXtgM
7  0.2898 -0.4858  2Rs1P
8  0.3752  0.2802  4UUz1
9 -0.2449 -2.3170  Bbue3

[10 rows x 3 columns]

In [22]: timeit df.mean()
10 loops, best of 3: 48.6 ms per loop

In [23]: dfnum = df[['a', 'b']]

In [24]: timeit dfnum.mean()
10000 loops, best of 3: 61.6 µs per loop

In [25]: 48.6 * 1000 / 61.6
Out[25]: 788.961038961039 # this is huge

The text was updated successfully, but these errors were encountered:

cpcloud · 2014-03-18T16:14:39Z

hm i see the numeric_only flag guess that's the solution

jreback · 2014-03-18T17:18:25Z

see #4787, should prob change the default

jreback · 2014-03-18T17:19:02Z

@cpcloud you want to do that issue? pretty straightforward I think

cpcloud · 2014-03-18T17:20:01Z

Sure

cpcloud closed this as completed Mar 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.mean takes a very long time with mixed dtype columns #6662

DataFrame.mean takes a very long time with mixed dtype columns #6662

cpcloud commented Mar 18, 2014

cpcloud commented Mar 18, 2014

jreback commented Mar 18, 2014

jreback commented Mar 18, 2014

cpcloud commented Mar 18, 2014

DataFrame.mean takes a very long time with mixed dtype columns #6662

DataFrame.mean takes a very long time with mixed dtype columns #6662

Comments

cpcloud commented Mar 18, 2014

cpcloud commented Mar 18, 2014

jreback commented Mar 18, 2014

jreback commented Mar 18, 2014

cpcloud commented Mar 18, 2014