-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: mask support for hastable functions for indexing #48396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -413,6 +416,10 @@ cdef class {{name}}HashTable(HashTable): | |||
cdef: | |||
khiter_t k | |||
{{c_type}} ckey | |||
|
|||
if self.uses_mask and checknull(key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter what type of null we're looking at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends if we want to support multiple nulls in masked arrays. Currently it does not:
idx = Index([1, 2, pd.NA, pd.NA], dtype="Int64")
idx.get_loc(None)
idx.get_loc(np.nan)
Both match pd.NA. Same for Float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None and np.nan we usually treat as interchangeable with pd.NA. here im thinking more of pd.NaT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got you. This is interesting currently:
idx.get_loc(pd.NaT)
matches both NAs, but
Index([1, 2, pd.NA, pd.NA, pd.NaT], dtype="Int64")
raises. Any suggestions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_valid_na_for_dtype exists for this purpose. For other dtypes we do this in FooIndex.get_loc. Might require some gymnastics in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I would be ok with pushing responsibility to the caller. Have to do this for set_item
and get_item
anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely reasonable, pls be sure to document this assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. I'll probably add some documentation in general when we are through with a MaskedEngine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally ready? (just added a few tiny comments)
And then actually using it is for a follow-up PR?
@@ -396,23 +396,32 @@ dtypes = [('Complex128', 'complex128', 'khcomplex128_t', 'to_khcomplex128_t'), | |||
|
|||
cdef class {{name}}HashTable(HashTable): | |||
|
|||
def __cinit__(self, int64_t size_hint=1): | |||
def __cinit__(self, int64_t size_hint=1, bint uses_mask=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be reflected in the pyi file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, added
@@ -434,30 +443,49 @@ cdef class {{name}}HashTable(HashTable): | |||
'upper_bound' : self.table.upper_bound, | |||
} | |||
|
|||
cpdef get_item(self, {{dtype}}_t val): | |||
cpdef get_item(self, {{dtype}}_t val, bint na_value = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like val_is_na
be a more descriptive keyword name? (typically in other places where we have na_value
, it is the value itself, not a bool flag)
(while you are at it, a brief docstring could also help to describe the keywords)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo val_is_na is a bit confusing, because val
is actually a placeholder and therefore garbage at this place. It is more like get_item of na_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is_na_value
or get_na_value
?
(I mostly would like to avoid using the same keyword name that is used elsewhere for a different meaning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although with get_na_value
, it is giving some duplication (get_item(.., get_na_value=True)
But this is also pointing to that it's not a very clean API, with basically two separate behaviours (value
being garbarge if the other keyword is set). It's not too important given this is deep into the internals, but maybe a clearer interface is to just have two separate methods instead of the keyword in the single method: get_item(value)
and get_na()
.
Since the caller needs to handle this specifically in the level up anyway to check for NA and set the keyword, you might as well just call a different method instead of passing a keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored into own methods, should be significantly easier to understand now
pandas/tests/libs/test_hashtable.py
Outdated
|
||
def test_get_set_contains_len_mask(self, table_type, dtype): | ||
if table_type == ht.PyObjectHashTable: | ||
pytest.skip("Mask not supporter for object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest.skip("Mask not supporter for object") | |
pytest.skip("Mask not supported for object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Yes this is ready. Would add the actual Engine as a follow up |
Thanks @phofl (can add follow up PRs if needed) |
* ENH: mask support for hastable functions for indexing * Fix mypy * Adjust test * Add comment * Add docstring * Refactor into own functions * Fix typing
@@ -435,30 +444,73 @@ cdef class {{name}}HashTable(HashTable): | |||
} | |||
|
|||
cpdef get_item(self, {{dtype}}_t val): | |||
"""Extracts the position of val from the hashtable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the return be typed as int Py_ssize_t or something? same question for get_na below
* ENH: mask support for hastable functions for indexing * Fix mypy * Adjust test * Add comment * Add docstring * Refactor into own functions * Fix typing
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This is a first step towards creating a MaskedEngine to optimise performance in these cases. This supports masks in the HashTable.
Did not impact performance of non masked cases.
cc @jorisvandenbossche
As a precursor to a MaskedEngine we have to adjust the HashTable methods to support masked na values. They keep track of seen na values separately. I think this makes more sense than keeping track of the NAs on the Engine level
map_locations
andlookup
are normally called when the values are unique or used to check if the values are unique. Nevertheless, if they have duplicate values, they always return the index of the last occurrence of a specific value. This is the same inget_item
andset_item
. We have to adjust the length of the HashTable, because the length is used to check if the values were unique.