Skip to content

[ENH] introducing IntpHashMap and making unique_label_indices use intp #40653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 7, 2021

Conversation

realead
Copy link
Contributor

@realead realead commented Mar 27, 2021

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them

Introduces IntpHashMap. Instead of making it from the scratch, we map the functionality either to Int32HashMap or Int64HashMap depending on the plattform. This has the advantage, that the resulting so/pyd-files don't get bigger.

@realead
Copy link
Contributor Author

realead commented Mar 27, 2021

@jbrockmendel FYI: this is implementation of your idea in #40635 (#40635 (comment)). I think it should work, but let's see...

@jbrockmendel
Copy link
Member

Can the kh_foo do the same thing? In particular unique_label_indices i think should take and return intp

@realead
Copy link
Contributor Author

realead commented Mar 28, 2021

Can the kh_foo do the same thing?

In my understanding np.intp is just an alias for np.int64 on 64bit system and for np.int32 on 32bit system, thus calling right kh_foo_intXX (XX=64 or 32) with np.intp_t would do the right thing.

In particular unique_label_indices i think should take and return intp.

Here we need unique_label_indices_int64 (we have) and unique_label_indices_int32 (we don't have yet), then we could map unique_label_indiceseither to int64- or int32-version, which would become the intp-version.

@realead
Copy link
Contributor Author

realead commented Mar 28, 2021

@jbrockmendel do you really need IntpHashMap or just unique_label_indices which takes intp and returns intp?

@jbrockmendel
Copy link
Member

@jbrockmendel do you really need IntpHashMap or just unique_label_indices which takes intp and returns intp?

If unique_label_indices was changed to intp i would find that immediately useful. IntpHashMap I expect would become useful, but I don't have anything immediately in mind.

@realead realead changed the title [ENH] introducing IntpHashMap [ENH] introducing IntpHashMap and unique_label_indices_intp Mar 28, 2021
@realead
Copy link
Contributor Author

realead commented Mar 28, 2021

@jbrockmendel I have introduced unique_label_indices_intp. If this solution is too clever, we could do the "boring" dance and introduce IntpHashMap similar to the way other maps are done.

You either can replace unique_label_indices one after another through unique_label_indices_intp or redefine unique_label_indices in one go.

@jbrockmendel
Copy link
Member

unique_label_indices is only called in one non-test place, so we only need the intp version. Does that simplify things? The solution I was going to try before you got to it first was going to look something like:

if np.dtype(np.intp) == np.dtype(np.int64):
    IntpHashTable = Int64HashTable
    IntpVector = Int64Vector
    IntpVectorData = Int64VectorData
    kh_resize_intp = ky_resize_int64
    [... whatever else is needed for unique_label_indices]
elif np.dtype(np.intp) == np.dtype(np.int32):
    [... same as above but with int32]
else:
    raise ValueError(np.dtype(np.intp))

And the part I hadn't figured out would be figuring out intp at compile-time so these could all be made cdef

@realead realead changed the title [ENH] introducing IntpHashMap and unique_label_indices_intp [ENH] introducing IntpHashMap and making unique_label_indices use intp Mar 29, 2021
@realead
Copy link
Contributor Author

realead commented Mar 29, 2021

@jbrockmendel now unique_label_indices uses intp.

Not sure, the failures are due to my changes.

@jbrockmendel
Copy link
Member

failures look unrelated, yah

@jbrockmendel
Copy link
Member

tangential: would it be feasible to make a kh_base_t or something that all the other kh_foo_t subclass? I think we could de-duplicate a bunch of code if we could declare cdef readonly kh_base_t table on cdef class HashTable

@realead
Copy link
Contributor Author

realead commented Apr 14, 2021

tangential: would it be feasible to make a kh_base_t or something that all the other kh_foo_t subclass? I think we could de-duplicate a bunch of code if we could declare cdef readonly kh_base_t table on cdef class HashTable

I don't see, how this could be easily done. E.g. in c++ the typical solution is to use templates, i.e. std::map<K,T> and I haven't seen any other solutions really getting any kind of traction. Here we have a very similar situation - just using tempita rather than templates.

A different example are Java's Maps (at least prior to 1.5), but Java is somewhat different because everything is an Object and dynamic cast is pretty good supported by the language itself (and yet I always fellt c++ solution to be "clearer" and also faster).

Trying to use type erasion (e.g. casting key-pointer to void* and creating virtual table for kh_-function depending on type) would probably work but I doubt the code will be clearer and simpler/less error prone than now.

A slightly tangential remark: Because tempita is quite powerful, I think it would not be that easy to switch to fused types or at least some artificial tricks will be needed to achieve something similar without tempita (so the question is really whether the resulting code will be really easier). For me the only real downside of using tempita are the big resulting object files - there is not really a code duplication (because different code versions are created by tempita) as far as I can see.

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @jreback

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 16, 2021
@simonjayhawkins
Copy link
Member

@realead can you merge master to resolve conflicts

@realead realead force-pushed the add_IntpHashTable branch 2 times, most recently from 833d1d1 to 625fa98 Compare May 28, 2021 05:42
ismember_intp = ismember_int32
mode_intp = mode_int32
unique_label_indices = _unique_label_indices_int64
unique_label_indices_intp = _unique_label_indices_int32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value_count_intp = value_count_int32 & Co no longer needed, now there is just value_count which works as supposed to work.

@realead
Copy link
Contributor Author

realead commented May 28, 2021

@simonjayhawkins Done. I think the failing tests are unrelated.

@realead realead force-pushed the add_IntpHashTable branch from 625fa98 to 0cba866 Compare June 8, 2021 20:47
@realead
Copy link
Contributor Author

realead commented Jun 9, 2021

@jbrockmendel @jreback

Is this PR still needed/should be pursued further?

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@realead realead force-pushed the add_IntpHashTable branch from 0cba866 to 66c4727 Compare June 11, 2021 15:15
@jbrockmendel
Copy link
Member

you good here @jreback?

@jreback
Copy link
Contributor

jreback commented Jun 23, 2021

will look later

any resolution on the numpy pin?

@jbrockmendel
Copy link
Member

any resolution on the numpy pin?

simon has a PR that i think will be ready soonish to handle that. this PR should be unaffected

@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

@realead is this still relevant?

@realead realead force-pushed the add_IntpHashTable branch from 66c4727 to 9a44401 Compare October 6, 2021 18:39
@realead
Copy link
Contributor Author

realead commented Oct 6, 2021

@jreback it was a request by @jbrockmendel (#40635 (comment)) not sure, whether it is still needed/wanted.

@jbrockmendel
Copy link
Member

not sure, whether it is still needed/wanted

the motivation hasn't changed: the one place where we use unique_label_indices we have ndarray[intp] and have to use ensure_int64 before calling unique_label_indices. So ideally unique_label_indices would take ndarray[intp] directly.

In practice, it's not that big of a deal since on most platforms ensure_int64 is a no-op. So there's a reasonable case to be made that it isn't worth the effort or build-size increase.

@jreback jreback added this to the 1.4 milestone Oct 7, 2021
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Review Dtype Conversions Unexpected or buggy dtype conversions labels Oct 7, 2021
@jreback
Copy link
Contributor

jreback commented Oct 7, 2021

ok this seems reasonable for completeness.

@jreback jreback merged commit cd8930b into pandas-dev:master Oct 7, 2021
@jreback
Copy link
Contributor

jreback commented Oct 7, 2021

ok this seems reasonable for completeness.
thanks @realead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants