Skip to content

pandas sort_values significantly slower on Python 3.5.2 vs. Python 2.7.12 #14103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
samlalwani opened this issue Aug 28, 2016 · 2 comments
Closed
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance

Comments

@samlalwani
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from time import time
import sys

df_data = pd.DataFrame(np.random.randint(0,int(1e6),int(20e6)), columns=['pop_id'])
df_data['PL_dB'] = 50 + np.random.random(df_data.shape[0]) * 100
df_data['Rx_dBm'] = 23 - df_data.PL_dB
df_data['noise_mW'] = (10.**(df_data.Rx_dBm / 10.)).astype('float32')

start = time()
df_data.sort_values(by=['pop_id', 'Rx_dBm'], ascending=[True, False], inplace=True)
df_data.reset_index(drop=True, inplace=True)

print("Sort took {:0.2f} seconds".format(time() - start))
print('Python version ' + sys.version)
print('pandas version ' + pd.version)

output of pd.show_versions()

For Python 2.7

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

For Python 3.5

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: None
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Results with Python 2.7

Sort took 40.91 seconds
Python version 2.7.12 |Anaconda custom (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
pandas version 0.18.1

Results with Python 3.5

Sort took 81.30 seconds
Python version 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
pandas version 0.18.1

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

looks the same as issue fixed by #13436
if someone could confirm

@jreback
Copy link
Contributor

jreback commented Aug 28, 2016

note that using inplace is pretty non-idiomatic as it promotes less readable and more error prone code

2.7

In [2]: import pandas as pd
   ...: import numpy as np
   ...: from time import time
   ...: import sys
   ...: 
   ...: df_data = pd.DataFrame(np.random.randint(0,int(1e6),int(20e5)), columns=['pop_id'])
   ...: df_data['PL_dB'] = 50 + np.random.random(df_data.shape[0]) * 100
   ...: df_data['Rx_dBm'] = 23 - df_data.PL_dB
   ...: df_data['noise_mW'] = (10.**(df_data.Rx_dBm / 10.)).astype('float32')

In [3]: %timeit df_data.sort_values(by=['pop_id', 'Rx_dBm'], ascending=[True, False])
1 loop, best of 3: 1.86 s per loop

In [4]: pd.__version__
Out[4]: '0.18.1+403.ga0151a7'

In [5]: sys.version
Out[5]: '2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:57:58) \n[GCC 4.2.1 (Apple Inc. build 5577)]'

3.5

In [2]: %timeit df_data.sort_values(by=['pop_id', 'Rx_dBm'], ascending=[True, False])
1 loop, best of 3: 1.76 s per loop

In [3]:  pd.__version__
   ...: 
Out[3]: '0.18.1+403.ga0151a7'

In [4]:  sys.version
   ...: 
Out[4]: '3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:24:55) \n[GCC 4.2.1 (Apple Inc. build 5577)]'

@jreback jreback closed this as completed Aug 28, 2016
@jreback jreback added Performance Memory or execution speed performance Duplicate Report Duplicate issue or pull request labels Aug 28, 2016
@jreback jreback added this to the No action milestone Aug 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants