-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[ENH] introducing IntpHashMap and making unique_label_indices use intp #40653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jbrockmendel FYI: this is implementation of your idea in #40635 (#40635 (comment)). I think it should work, but let's see... |
Can the kh_foo do the same thing? In particular unique_label_indices i think should take and return intp |
In my understanding
Here we need |
@jbrockmendel do you really need |
If unique_label_indices was changed to intp i would find that immediately useful. IntpHashMap I expect would become useful, but I don't have anything immediately in mind. |
@jbrockmendel I have introduced You either can replace |
unique_label_indices is only called in one non-test place, so we only need the intp version. Does that simplify things? The solution I was going to try before you got to it first was going to look something like:
And the part I hadn't figured out would be figuring out intp at compile-time so these could all be made cdef |
@jbrockmendel now unique_label_indices uses intp. Not sure, the failures are due to my changes. |
failures look unrelated, yah |
tangential: would it be feasible to make a kh_base_t or something that all the other kh_foo_t subclass? I think we could de-duplicate a bunch of code if we could declare |
I don't see, how this could be easily done. E.g. in c++ the typical solution is to use templates, i.e. A different example are Java's Maps (at least prior to 1.5), but Java is somewhat different because everything is an Object and dynamic cast is pretty good supported by the language itself (and yet I always fellt c++ solution to be "clearer" and also faster). Trying to use type erasion (e.g. casting key-pointer to void* and creating virtual table for kh_-function depending on type) would probably work but I doubt the code will be clearer and simpler/less error prone than now. A slightly tangential remark: Because tempita is quite powerful, I think it would not be that easy to switch to fused types or at least some artificial tricks will be needed to achieve something similar without tempita (so the question is really whether the resulting code will be really easier). For me the only real downside of using tempita are the big resulting object files - there is not really a code duplication (because different code versions are created by tempita) as far as I can see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM cc @jreback
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@realead can you merge master to resolve conflicts |
833d1d1
to
625fa98
Compare
pandas/_libs/hashtable.pyx
Outdated
ismember_intp = ismember_int32 | ||
mode_intp = mode_int32 | ||
unique_label_indices = _unique_label_indices_int64 | ||
unique_label_indices_intp = _unique_label_indices_int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value_count_intp = value_count_int32 & Co no longer needed, now there is just value_count which works as supposed to work.
@simonjayhawkins Done. I think the failing tests are unrelated. |
625fa98
to
0cba866
Compare
Is this PR still needed/should be pursued further? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
0cba866
to
66c4727
Compare
you good here @jreback? |
will look later any resolution on the numpy pin? |
simon has a PR that i think will be ready soonish to handle that. this PR should be unaffected |
@realead is this still relevant? |
66c4727
to
9a44401
Compare
@jreback it was a request by @jbrockmendel (#40635 (comment)) not sure, whether it is still needed/wanted. |
the motivation hasn't changed: the one place where we use unique_label_indices we have ndarray[intp] and have to use In practice, it's not that big of a deal since on most platforms ensure_int64 is a no-op. So there's a reasonable case to be made that it isn't worth the effort or build-size increase. |
ok this seems reasonable for completeness. |
ok this seems reasonable for completeness. |
Introduces IntpHashMap. Instead of making it from the scratch, we map the functionality either to Int32HashMap or Int64HashMap depending on the plattform. This has the advantage, that the resulting so/pyd-files don't get bigger.