Skip to content

Pandas quantile function very slow #11623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pckapps opened this issue Nov 17, 2015 · 11 comments
Closed

Pandas quantile function very slow #11623

pckapps opened this issue Nov 17, 2015 · 11 comments
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Milestone

Comments

@pckapps
Copy link

pckapps commented Nov 17, 2015

The quantile function is almost 10 000 times slower than the equivalent percentile function in numpy. See code below:

import time
import pandas as pd
import numpy as np

q = np.array([0.1,0.4,0.6,0.9])
data = np.random.randn(10000, 4)
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
time1 = time.time()
pandas_quantiles = df.quantile(q, axis=1)
time2 = time.time()
print 'Pandas took %0.3f ms' % ((time2-time1)*1000.0)

time1 = time.time()
numpy_quantiles = np.percentile(data, q*100, axis=1)
time2 = time.time()
print 'Numpy took %0.3f ms' % ((time2-time1)*1000.0)

print (pandas_quantiles.values == numpy_quantiles).all()
# Output:
# Pandas took 15337.531 ms
# Numpy took 1.653 ms
# True
@jreback jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Nov 17, 2015
@jreback jreback added this to the Next Major Release milestone Nov 17, 2015
@bashtage
Copy link
Contributor

This seems to be because it computes the quantiles series by series -- so computing 10k quantiles like this example does is going to have a lot of overhead. This was presumably done as a simplification to handle different types such as TimeStamp. It also handles nulls by default (as do most Pandas functions), which also affects performance (lots of notnull()) that aren't run by NumPy.

Ultimately df.quantile is just calling np.percentile N times where N is the shape of the axis. The simplest thing to do would be to have a fast path for numeric_only = True where there are no nulls, although the requirement to drop nulls can't be easily done inside of the NumPy implementation using the block.

@jreback
Copy link
Contributor

jreback commented Nov 19, 2015

these can simply be done block-by-block. we do this with almost all other functions already.

@bashtage
Copy link
Contributor

I think the null-handling prevents trivial application even block-by-block. See my revised comment.

@max-sixty
Copy link
Contributor

There's an np.nanpercentile which could be used to go block by block

@bashtage
Copy link
Contributor

NumPy >= 1.9, so would require some special casing.

Kevin

On Thu, Nov 19, 2015 at 9:34 AM Maximilian Roos [email protected]
wrote:

There's an np.nanpercentile which could be used to go block by block


Reply to this email directly or view it on GitHub
#11623 (comment).

@sinhrks
Copy link
Member

sinhrks commented Apr 2, 2016

After #12752

# Output:
# Pandas took 20180.231 ms
# Numpy took 6.843 ms

Compared to numpy, #12752 has some improvements but needs further.

# current
# 15337.531 / 1.653 = 9278.603145795523

# 12752:
# 20180 / 6.843 = 2948.9989770568463

@jreback
Copy link
Contributor

jreback commented Apr 2, 2016

@sinhrks it needs to be done on a block-basis. Then it will be the same.

@sinhrks
Copy link
Member

sinhrks commented Apr 2, 2016

The another bottleneck is transposition caused by axis=1. I think this transposition can be skipped in some condition (numeric_only, etc)

@jreback
Copy link
Contributor

jreback commented Apr 2, 2016

numpy handles the 2-d just fine. so when it is done by blocks, we just transpose quantile and transpose (e.g. kind of like what .eval does). But again can fix this after.

@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 3, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@jreback
Copy link
Contributor

jreback commented May 10, 2016

after #13122

In [16]: %timeit df.quantile(q, axis=1)
100 loops, best of 3: 2.06 ms per loop

In [17]: %timeit np.percentile(data, q*100, axis=1)
1000 loops, best of 3: 1.23 ms per loop

jreback added a commit to jreback/pandas that referenced this issue May 12, 2016
@ashesh-0
Copy link

ashesh-0 commented Feb 5, 2019

On a similar note, I see that describe() seems unnecessarily slow. Computing individual components is much faster.


In [30]: import pandas as pd
    ...: import numpy as np
    ...: df = pd.DataFrame(np.random.rand(5000,50))
    ...: print('Time taken for mean')
    ...: %timeit df.mean(axis=0)
    ...: print('Time taken for std')
    ...: %timeit df.std(axis=0)
    ...: print('Time taken for quantiles')
    ...: %timeit df.quantile([0, 0.25,0.5,0.75,1])
    ...: print('Time taken for count(just for completeness sake)')
    ...: %timeit df.shape
    ...: print('Time taken for describe')
    ...: %timeit df.describe()
    ...: 
Time taken for mean
1.81 ms _ 54.8 _s per loop (mean _ std. dev. of 7 runs, 100 loops each)
Time taken for std
4.72 ms _ 73.9 _s per loop (mean _ std. dev. of 7 runs, 100 loops each)
Time taken for quantiles
11.3 ms _ 109 _s per loop (mean _ std. dev. of 7 runs, 100 loops each)
Time taken for count(just for completeness sake)
1.69 _s _ 107 ns per loop (mean _ std. dev. of 7 runs, 1000000 loops each)
Time taken for describe
167 ms _ 2.7 ms per loop (mean _ std. dev. of 7 runs, 10 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants