Skip to content

ENH: mask support for hastable functions for indexing #48396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Oct 20, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 5, 2022

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This is a first step towards creating a MaskedEngine to optimise performance in these cases. This supports masks in the HashTable.
Did not impact performance of non masked cases.

cc @jorisvandenbossche

As a precursor to a MaskedEngine we have to adjust the HashTable methods to support masked na values. They keep track of seen na values separately. I think this makes more sense than keeping track of the NAs on the Engine level

map_locations and lookup are normally called when the values are unique or used to check if the values are unique. Nevertheless, if they have duplicate values, they always return the index of the last occurrence of a specific value. This is the same in get_item and set_item. We have to adjust the length of the HashTable, because the length is used to check if the values were unique.

@phofl phofl marked this pull request as draft September 5, 2022 14:30
@phofl phofl marked this pull request as ready for review September 5, 2022 15:49
@phofl phofl added Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 5, 2022
@@ -413,6 +416,10 @@ cdef class {{name}}HashTable(HashTable):
cdef:
khiter_t k
{{c_type}} ckey

if self.uses_mask and checknull(key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter what type of null we're looking at?

Copy link
Member Author

@phofl phofl Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends if we want to support multiple nulls in masked arrays. Currently it does not:

idx = Index([1, 2, pd.NA, pd.NA], dtype="Int64")
idx.get_loc(None)
idx.get_loc(np.nan)

Both match pd.NA. Same for Float64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None and np.nan we usually treat as interchangeable with pd.NA. here im thinking more of pd.NaT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got you. This is interesting currently:

idx.get_loc(pd.NaT)

matches both NAs, but

Index([1, 2, pd.NA, pd.NA, pd.NaT], dtype="Int64")

raises. Any suggestions here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_valid_na_for_dtype exists for this purpose. For other dtypes we do this in FooIndex.get_loc. Might require some gymnastics in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I would be ok with pushing responsibility to the caller. Have to do this for set_item and get_item anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely reasonable, pls be sure to document this assumption

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. I'll probably add some documentation in general when we are through with a MaskedEngine

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally ready? (just added a few tiny comments)
And then actually using it is for a follow-up PR?

@@ -396,23 +396,32 @@ dtypes = [('Complex128', 'complex128', 'khcomplex128_t', 'to_khcomplex128_t'),

cdef class {{name}}HashTable(HashTable):

def __cinit__(self, int64_t size_hint=1):
def __cinit__(self, int64_t size_hint=1, bint uses_mask=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be reflected in the pyi file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added

@@ -434,30 +443,49 @@ cdef class {{name}}HashTable(HashTable):
'upper_bound' : self.table.upper_bound,
}

cpdef get_item(self, {{dtype}}_t val):
cpdef get_item(self, {{dtype}}_t val, bint na_value = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like val_is_na be a more descriptive keyword name? (typically in other places where we have na_value, it is the value itself, not a bool flag)

(while you are at it, a brief docstring could also help to describe the keywords)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo val_is_na is a bit confusing, because val is actually a placeholder and therefore garbage at this place. It is more like get_item of na_value

Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is_na_value or get_na_value?
(I mostly would like to avoid using the same keyword name that is used elsewhere for a different meaning)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although with get_na_value, it is giving some duplication (get_item(.., get_na_value=True)

But this is also pointing to that it's not a very clean API, with basically two separate behaviours (value being garbarge if the other keyword is set). It's not too important given this is deep into the internals, but maybe a clearer interface is to just have two separate methods instead of the keyword in the single method: get_item(value) and get_na().
Since the caller needs to handle this specifically in the level up anyway to check for NA and set the keyword, you might as well just call a different method instead of passing a keyword.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored into own methods, should be significantly easier to understand now


def test_get_set_contains_len_mask(self, table_type, dtype):
if table_type == ht.PyObjectHashTable:
pytest.skip("Mask not supporter for object")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pytest.skip("Mask not supporter for object")
pytest.skip("Mask not supported for object")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@phofl
Copy link
Member Author

phofl commented Sep 29, 2022

Yes this is ready. Would add the actual Engine as a follow up

@mroeschke mroeschke added this to the 2.0 milestone Oct 18, 2022
@mroeschke mroeschke merged commit f271445 into pandas-dev:main Oct 20, 2022
@mroeschke
Copy link
Member

Thanks @phofl (can add follow up PRs if needed)

phofl added a commit to phofl/pandas that referenced this pull request Oct 21, 2022
* ENH: mask support for hastable functions for indexing

* Fix mypy

* Adjust test

* Add comment

* Add docstring

* Refactor into own functions

* Fix typing
@phofl phofl deleted the masked_engine_hashtables branch October 23, 2022 17:39
@@ -435,30 +444,73 @@ cdef class {{name}}HashTable(HashTable):
}

cpdef get_item(self, {{dtype}}_t val):
"""Extracts the position of val from the hashtable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the return be typed as int Py_ssize_t or something? same question for get_na below

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* ENH: mask support for hastable functions for indexing

* Fix mypy

* Adjust test

* Add comment

* Add docstring

* Refactor into own functions

* Fix typing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants