Skip to content

CLN: Use signed integers in khash maps for signed integer keys #38882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 3, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Jan 1, 2021

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

The goal of this PR is to make this comment obsolete:

// we implicitly convert signed int to unsigned int, thus potential overflows
// for operations (<<,*,+) don't trigger undefined behavior, also >>-operator
// is implementation defined for signed ints if sign-bit is set.
// because we never really "get" the keys, there will be no convertion from
// unsigend int to (signed) int (which would be implementation defined behavior)
// this holds also for 64-, 16- and 8-bit integers

see also this discussion for even more details.

The issue is:

  • the current way (inherited from khash-original) is too clever (and suprising)
  • it is wrong as the (unsigned) keys are converted back to signed values (which is implementation defined behavior) for example here:
    result_keys[i] = <{{dtype}}>table.keys[k]

Using signed integer, one must take into account that hash-function with signed integers might have undefined/implementation defined behavior, thus we cast signed integers to unsigned counterpart in the (64bit-) hash function now.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jan 1, 2021
@jreback jreback added this to the 1.3 milestone Jan 1, 2021
@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

this looks fine. I assume that nothing in the current impl breaks because of this (e.g. converting int -> uint is safe as far as our tests go). any perf implications?

@realead
Copy link
Contributor Author

realead commented Jan 2, 2021

@jreback I expect no perf changes and ran asv only for a subset of tests ( -b hash_functions -b ^algorithms -b ^indexing) and there were no changes (albeit hash_functions.NumericSeriesIndexing seems to be somewhat flacky on my machine - but I have observed this behavior in other PRs as well).

@jreback jreback merged commit 39e1261 into pandas-dev:master Jan 3, 2021
@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

thanks @realead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants