You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I have some numeric columns over which I want to compute the mean and I have at least one string column, it takes much too long to compute. I can't even hit Ctrl-C to interrupt the process (if the frame is large enough). Interestingly, the string columns are discarded in the final result. The perf difference is about a factor of 800 when the frame has 10000 elements.
In [18]: n = 10000
In [19]: df = DataFrame(randn(n, 2), columns=list('ab'))
In [20]: df['c'] = [pd.util.testing.rands(5) for _ in xrange(n)]
In [21]: df.head(10)
Out[21]:
a b c
0 1.0393 0.5719 AVi6V
1 0.6642 0.7441 mtqXk
2 -1.1552 0.1583 euUoo
3 0.7759 0.7647 cAAk2
4 -0.4958 0.4079 TYRRj
5 -0.7168 -1.1523 YT34i
6 1.5557 -1.7054 vXtgM
7 0.2898 -0.4858 2Rs1P
8 0.3752 0.2802 4UUz1
9 -0.2449 -2.3170 Bbue3
[10 rows x 3 columns]
In [22]: timeit df.mean()
10 loops, best of 3: 48.6 ms per loop
In [23]: dfnum = df[['a', 'b']]
In [24]: timeit dfnum.mean()
10000 loops, best of 3: 61.6 µs per loop
In [25]: 48.6 * 1000 / 61.6
Out[25]: 788.961038961039 # this is huge
The text was updated successfully, but these errors were encountered:
If I have some numeric columns over which I want to compute the mean and I have at least one string column, it takes much too long to compute. I can't even hit Ctrl-C to interrupt the process (if the frame is large enough). Interestingly, the string columns are discarded in the final result. The perf difference is about a factor of 800 when the frame has 10000 elements.
The text was updated successfully, but these errors were encountered: