Pandas groupby extremely slow in python3 for certain sets of single precision floating point data #13335
Labels
Dtype Conversions
Unexpected or buggy dtype conversions
Numeric Operations
Arithmetic, Comparison, and Logical operations
Performance
Memory or execution speed performance
Milestone
In python3 with certain sets of single precision floating point data pandas groupby is up to ~150 slower than the same data in python2
Code Sample, a copy-pastable example if possible
On my machine
python2 this_file.py
The groupby takes around 0.1s
Where as python3 this_file.py
The groupby takes around 11s
With some investigation the discrepancy in run time between python versions varies hugely between the actual data but seems to have the biggest difference when half the data is roughly 100 times smaller than the other half.
Having profiled this, it seems this function is taking almost all the 11s in the python3 version in this method
https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx#L538
However I have no idea what is causing the run time discrepancy.
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 21.2.2
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.5.2
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.7.4.None
psycopg2: None
jinja2: 2.8
boto: 2.38.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: