Skip to content

Pandas groupby extremely slow in python3 for certain sets of single precision floating point data #13335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RogerThomas opened this issue May 31, 2016 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Milestone

Comments

@RogerThomas
Copy link
Contributor

RogerThomas commented May 31, 2016

In python3 with certain sets of single precision floating point data pandas groupby is up to ~150 slower than the same data in python2

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
from numpy.random import random
from time import time


def do_groupby(df):
    df.groupby(['a'])['b'].sum()


def main():
    tmp1 = (random(10000) * 0.1).astype(np.float32)
    tmp2 = (random(10000) * 10.0).astype(np.float32)
    tmp = np.concatenate((tmp1, tmp2))
    arr = np.repeat(tmp, 100)
    df = pd.DataFrame(dict(a=arr, b=arr))
    t1 = time()
    do_groupby(df)
    print("Took: %s" % (time() - t1,))

main()

On my machine
python2 this_file.py
The groupby takes around 0.1s

Where as python3 this_file.py
The groupby takes around 11s

With some investigation the discrepancy in run time between python versions varies hugely between the actual data but seems to have the biggest difference when half the data is roughly 100 times smaller than the other half.

Having profiled this, it seems this function is taking almost all the 11s in the python3 version in this method
https://github.com/pydata/pandas/blob/master/pandas/hashtable.pyx#L538

However I have no idea what is causing the run time discrepancy.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 21.2.2
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.5.2
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.7.4.None
psycopg2: None
jinja2: 2.8
boto: 2.38.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 31, 2016

is the same issue I think as here: #13166

from @pitrou

PyHash_double is doing this so that hash(float(x)) == hash(x) for every integer x that's exactly representable as a double
Actually, as the comment suggests, it also ensures that the invariant holds if x is a Fraction
 >>> x = Fraction(5, 4)
>>> hash(x)
576460752303423489
>>> hash(float(x))
576460752303423489

So I think we could simply change as suggested in #13166 and see. Probably seeing LOTS of hash collisions.

@jreback
Copy link
Contributor

jreback commented May 31, 2016

I suspect the upcasting from float32 -> float64 might contribute to this as well.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance labels May 31, 2016
@jreback jreback added this to the 0.18.2 milestone May 31, 2016
@RogerThomas
Copy link
Contributor Author

@jreback i've been looking into this, but I'm not getting very far, any advice on why we're seeing this issue in python3 and not python2, since both are being compiled into c code I don't see how the python version can have an effect

@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

see the issue I referenced. I think you can fix it by changing the hashing code as indicated.

I think that python 3 is taking great care with the hashing, somewhat to the detriment of perf. (which is what the Float64Hashtable ultimately does via kh_* functions). The adjustment makes it much faster (and for all intents and purposes works the same).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants