-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Introducing hash tables for complex64 and complex128 #38179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FYI @jbrockmendel Some polishing is still needed and probably additional testing, but this is how it could look like. Maybe you know a better solution for the whole numpy's <->cython's complex-business. There are also some details, which maybe need some tinkering:
|
@@ -606,14 +681,15 @@ cdef class {{name}}HashTable(HashTable): | |||
ignore_na=True, return_inverse=True) | |||
return labels | |||
|
|||
{{if dtype == 'int64'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding this only for int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is used just in int64-version:
Lines 598 to 603 in aad85ad
table = hashtable.Int64HashTable(size_hint) | |
group_index = ensure_int64(group_index) | |
# note, group labels come out ascending (ie, 1,2,3 etc) | |
comp_ids, obs_group_ids = table.get_labels_groupby(group_index) |
The logic in this function:
pandas/pandas/_libs/hashtable_class_helper.pxi.in
Lines 628 to 633 in 1fc5efd
# specific for groupby | |
{{if dtype != 'uint64'}} | |
if val < 0: | |
labels[i] = -1 | |
continue | |
{{endif}} |
IIUC, negative numbers are special and should be skipped, but this is only a convention for the meaning of values group_index
. For other types there is no such convention. Instead of inventing something, I've decided to delete unused functions/versions.
Failures seem to be unrelated to my changes... |
Yes. The npdev failure I think should be fixed by #38209 and the other one just annoyingly happens sometimes because hypothesis is slow. Don't worry about either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good a few questions.
pandas/tests/libs/test_hashtable.py
Outdated
@@ -80,6 +82,8 @@ def test_map(self, table_type, dtype): | |||
table = table_type() | |||
keys = np.arange(N).astype(dtype) | |||
vals = np.arange(N).astype(np.int64) + N | |||
keys.flags.writeable = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize all of these on writeable True/False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 073222e
|
||
{{for c_type in float_types}} | ||
|
||
cdef bint is_nan_{{c_type}}({{c_type}} val) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't define this for int types (as a no-op)? would it make the logic simpler if we did this (then we can always call a parameterized is_nan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we already do. I've reshuffled the code, so it becomes more obvious.
e73bb2c
to
0cce398
Compare
can you merge master and will look again |
…ill not be dtype_t
…2, so making a workaround
…nly for int64. What should happen for other types (encoding nans as negative values is not possible with all types) depends on how it is used (but it is not used right now)
0cce398
to
9930a55
Compare
ok looks good. this complex 64/128 are available on all platforms? |
@jreback It should work on platforms which have 64bit- and 32bit floating numbers. As we don't really use complex arithmetic it is not important whether C-compiler supports complex-numbers or not. |
thanks @realead |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Next step for #33287
The complex case is somewhat less straight forward, because Cython defines
complex128
/complex64
as:where
double complex
actually means a Cython-datatype defined here https://github.com/cython/cython/blob/master/Cython/Utility/Complex.c#L56Thus a conversion between numpy's complex and Cython's complex must happens somewhere.