Pandas quantile function very slow #11623

pckapps · 2015-11-17T04:17:48Z

The quantile function is almost 10 000 times slower than the equivalent percentile function in numpy. See code below:

import time
import pandas as pd
import numpy as np

q = np.array([0.1,0.4,0.6,0.9])
data = np.random.randn(10000, 4)
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
time1 = time.time()
pandas_quantiles = df.quantile(q, axis=1)
time2 = time.time()
print 'Pandas took %0.3f ms' % ((time2-time1)*1000.0)

time1 = time.time()
numpy_quantiles = np.percentile(data, q*100, axis=1)
time2 = time.time()
print 'Numpy took %0.3f ms' % ((time2-time1)*1000.0)

print (pandas_quantiles.values == numpy_quantiles).all()
# Output:
# Pandas took 15337.531 ms
# Numpy took 1.653 ms
# True

bashtage · 2015-11-19T13:59:45Z

This seems to be because it computes the quantiles series by series -- so computing 10k quantiles like this example does is going to have a lot of overhead. This was presumably done as a simplification to handle different types such as TimeStamp. It also handles nulls by default (as do most Pandas functions), which also affects performance (lots of notnull()) that aren't run by NumPy.

Ultimately df.quantile is just calling np.percentile N times where N is the shape of the axis. The simplest thing to do would be to have a fast path for numeric_only = True where there are no nulls, although the requirement to drop nulls can't be easily done inside of the NumPy implementation using the block.

jreback · 2015-11-19T14:02:45Z

these can simply be done block-by-block. we do this with almost all other functions already.

bashtage · 2015-11-19T14:03:16Z

I think the null-handling prevents trivial application even block-by-block. See my revised comment.

max-sixty · 2015-11-19T14:33:44Z

There's an np.nanpercentile which could be used to go block by block

bashtage · 2015-11-19T14:36:00Z

NumPy >= 1.9, so would require some special casing.

Kevin

On Thu, Nov 19, 2015 at 9:34 AM Maximilian Roos [email protected]
wrote:

There's an np.nanpercentile which could be used to go block by block

—
Reply to this email directly or view it on GitHub
#11623 (comment).

sinhrks · 2016-04-02T17:00:34Z

After #12752

# Output:
# Pandas took 20180.231 ms
# Numpy took 6.843 ms

Compared to numpy, #12752 has some improvements but needs further.

# current
# 15337.531 / 1.653 = 9278.603145795523

# 12752:
# 20180 / 6.843 = 2948.9989770568463

jreback · 2016-04-02T17:04:03Z

@sinhrks it needs to be done on a block-basis. Then it will be the same.

sinhrks · 2016-04-02T17:05:54Z

The another bottleneck is transposition caused by axis=1. I think this transposition can be skipped in some condition (numeric_only, etc)

jreback · 2016-04-02T17:13:05Z

numpy handles the 2-d just fine. so when it is done by blocks, we just transpose quantile and transpose (e.g. kind of like what .eval does). But again can fix this after.

jreback · 2016-05-10T02:19:35Z

after #13122

In [16]: %timeit df.quantile(q, axis=1)
100 loops, best of 3: 2.06 ms per loop

In [17]: %timeit np.percentile(data, q*100, axis=1)
1000 loops, best of 3: 1.23 ms per loop

REGR: series quantile with nan closes pandas-dev#11623 closes pandas-dev#13098

ashesh-0 · 2019-02-05T11:45:10Z

On a similar note, I see that describe() seems unnecessarily slow. Computing individual components is much faster.


In [30]: import pandas as pd
    ...: import numpy as np
    ...: df = pd.DataFrame(np.random.rand(5000,50))
    ...: print('Time taken for mean')
    ...: %timeit df.mean(axis=0)
    ...: print('Time taken for std')
    ...: %timeit df.std(axis=0)
    ...: print('Time taken for quantiles')
    ...: %timeit df.quantile([0, 0.25,0.5,0.75,1])
    ...: print('Time taken for count(just for completeness sake)')
    ...: %timeit df.shape
    ...: print('Time taken for describe')
    ...: %timeit df.describe()
    ...: 
Time taken for mean
1.81 ms _ 54.8 _s per loop (mean _ std. dev. of 7 runs, 100 loops each)
Time taken for std
4.72 ms _ 73.9 _s per loop (mean _ std. dev. of 7 runs, 100 loops each)
Time taken for quantiles
11.3 ms _ 109 _s per loop (mean _ std. dev. of 7 runs, 100 loops each)
Time taken for count(just for completeness sake)
1.69 _s _ 107 ns per loop (mean _ std. dev. of 7 runs, 1000000 loops each)
Time taken for describe
167 ms _ 2.7 ms per loop (mean _ std. dev. of 7 runs, 10 loops each)

jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Nov 17, 2015

jreback added this to the Next Major Release milestone Nov 17, 2015

max-sixty mentioned this issue Nov 17, 2015

CLN: blocks, not series, should contain logic for arithmetic #9859

Closed

jreback mentioned this issue Dec 19, 2015

DataFrame.describe(percentiles=[]) still returns 50% percentile. #11866

Closed

jreback mentioned this issue Mar 31, 2016

CLN: Move boxing logic to BlockManager #12752

Closed

5 tasks

jreback modified the milestones: 0.18.1, Next Major Release Apr 3, 2016

jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

jreback mentioned this issue May 9, 2016

PERF: quantile now operates per block boosting perf / fix quantile with nan #13122

Closed

jreback added a commit to jreback/pandas that referenced this issue May 12, 2016

PERF: quantile now operates per block boosting perf

aad72cb

REGR: series quantile with nan closes pandas-dev#11623 closes pandas-dev#13098

jreback closed this as completed in 4de83d2 May 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas quantile function very slow #11623

Pandas quantile function very slow #11623

pckapps commented Nov 17, 2015

bashtage commented Nov 19, 2015

jreback commented Nov 19, 2015

bashtage commented Nov 19, 2015

max-sixty commented Nov 19, 2015

bashtage commented Nov 19, 2015

sinhrks commented Apr 2, 2016

jreback commented Apr 2, 2016

sinhrks commented Apr 2, 2016

jreback commented Apr 2, 2016

jreback commented May 10, 2016

ashesh-0 commented Feb 5, 2019

Pandas quantile function very slow #11623

Pandas quantile function very slow #11623

Comments

pckapps commented Nov 17, 2015

bashtage commented Nov 19, 2015

jreback commented Nov 19, 2015

bashtage commented Nov 19, 2015

max-sixty commented Nov 19, 2015

bashtage commented Nov 19, 2015

sinhrks commented Apr 2, 2016

jreback commented Apr 2, 2016

sinhrks commented Apr 2, 2016

jreback commented Apr 2, 2016

jreback commented May 10, 2016

ashesh-0 commented Feb 5, 2019