Skip to content

Float64Index is very slow in some condition. #13166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ruoyu0088 opened this issue May 13, 2016 · 8 comments
Closed

Float64Index is very slow in some condition. #13166

ruoyu0088 opened this issue May 13, 2016 · 8 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@ruoyu0088
Copy link

ruoyu0088 commented May 13, 2016

The following code is very slow:

import pandas as pd
import numpy as np

dt = 4.8000000418824129e-08
data = np.random.rand(1000000, 4)
df = pd.DataFrame(data, columns=list("ABCD"))
df.index *= dt
print(df.loc[0:0.001].shape)

after debug it, I found Float64Engine.get_loc() is slow. Here is a demo:

import pandas as pd
import numpy as np

a = np.arange(1000000)
ind1 = pd.Float64Index(a * 4.8e-08)
ind2 = pd.Float64Index(a * 4.8000000418824129e-08)

%time ind1._engine.get_loc(0)
%time ind2._engine.get_loc(0)

outputs:

Wall time: 295 ms
Wall time: 9.9 s
@jreback
Copy link
Contributor

jreback commented May 13, 2016

This might have been true on older versions of pandas (maybe < 0.16.0, I don't recall the exact version) as these were object based. but these moved to true float based hashtables (typed).

In [1]: pd.__version__
Out[1]: u'0.18.1'

In [2]: a = np.arange(1000000)

In [3]: ind1 = pd.Float64Index(a * 4.8e-08)

In [4]: ind2 = pd.Float64Index(a * 4.8000000418824129e-08)

In [5]: %timeit -n 1 -r 1 ind1._engine.get_loc(0)
1 loop, best of 1: 116 ms per loop

In [6]: %timeit -n 1 -r 1 ind2._engine.get_loc(0)
1 loop, best of 1: 107 ms per loop

closing, but pls show your versions

note you have to time these with a single iteration as these build the hash tables which are then cached (so of course that tells you the lookup time, but you want the build time as well)

@jreback jreback closed this as completed May 13, 2016
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels May 13, 2016
@jreback jreback added this to the No action milestone May 13, 2016
@ruoyu0088
Copy link
Author

ruoyu0088 commented May 13, 2016

@jreback

I am using pandas 0.18.1, python 3.5 64bit, I confirmed this problem both on Linux and Windows 7.
The pd.hashtable.Float64HashTable is slow for ind2 on my system. I think this is due to the khash library.

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.18
machine: x86_64
processor: Intel(R) Celeron(R) CPU N2840 @ 2.16GHz
byteorder: little
LC_ALL: en_US.utf8
LANG: None

@ruoyu0088
Copy link
Author

@jreback It seems that Python 3.5 has the problem, but Python 2.7 has no problem. Can you reopen this issue? You can confirm this on https://try.jupyter.org/.

@jreback
Copy link
Contributor

jreback commented May 13, 2016

hmm interesting

so you want to try profiling the cython ?

@jreback jreback reopened this May 13, 2016
@ruoyu0088
Copy link
Author

I think here is the problem, but I don't know why:

khash_python.h

#define kh_float64_hash_func _Py_HashDouble

I changed the line to following code, it view the double value as int64 and use the same formula as kh_int64_hash_func:

inline khint64_t asint64(double key)
{
  return *(khint64_t *)(&key);
}

#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)

The %time result is almost the same, and it's even 2x faster for ind0.

@jreback
Copy link
Contributor

jreback commented May 13, 2016

does that break any tests? can you run the asv suite as well (and add benchmark for this).

@jreback
Copy link
Contributor

jreback commented May 13, 2016

https://github.com/python/cpython/blob/master/Python/pyhash.c#L85 is the existing PyHash_double.

Its probably generating 'better' hashes that your change, but in the end of the day I don't see why that's preferable.

@jreback
Copy link
Contributor

jreback commented May 31, 2016

@ruoyu0088 see also #13335

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants