ENH: mask support for hastable functions for indexing #48396

phofl · 2022-09-05T14:30:30Z

closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This is a first step towards creating a MaskedEngine to optimise performance in these cases. This supports masks in the HashTable.
Did not impact performance of non masked cases.

cc @jorisvandenbossche

As a precursor to a MaskedEngine we have to adjust the HashTable methods to support masked na values. They keep track of seen na values separately. I think this makes more sense than keeping track of the NAs on the Engine level

map_locations and lookup are normally called when the values are unique or used to check if the values are unique. Nevertheless, if they have duplicate values, they always return the index of the last occurrence of a specific value. This is the same in get_item and set_item. We have to adjust the length of the HashTable, because the length is used to check if the values were unique.

…ables

jbrockmendel · 2022-09-08T20:55:29Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -413,6 +416,10 @@ cdef class {{name}}HashTable(HashTable):
        cdef:
            khiter_t k
            {{c_type}} ckey
+
+        if self.uses_mask and checknull(key):


does it matter what type of null we're looking at?

Depends if we want to support multiple nulls in masked arrays. Currently it does not:

idx = Index([1, 2, pd.NA, pd.NA], dtype="Int64") idx.get_loc(None) idx.get_loc(np.nan)

Both match pd.NA. Same for Float64

None and np.nan we usually treat as interchangeable with pd.NA. here im thinking more of pd.NaT

Ah got you. This is interesting currently:

idx.get_loc(pd.NaT)

matches both NAs, but

Index([1, 2, pd.NA, pd.NA, pd.NaT], dtype="Int64")

raises. Any suggestions here?

is_valid_na_for_dtype exists for this purpose. For other dtypes we do this in FooIndex.get_loc. Might require some gymnastics in this case

In general I would be ok with pushing responsibility to the caller. Have to do this for set_item and get_item anyway

definitely reasonable, pls be sure to document this assumption

Added a comment. I'll probably add some documentation in general when we are through with a MaskedEngine

…ables

jorisvandenbossche

This is generally ready? (just added a few tiny comments)
And then actually using it is for a follow-up PR?

jorisvandenbossche · 2022-09-26T11:56:12Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -396,23 +396,32 @@ dtypes = [('Complex128', 'complex128', 'khcomplex128_t', 'to_khcomplex128_t'),

 cdef class {{name}}HashTable(HashTable):

-    def __cinit__(self, int64_t size_hint=1):
+    def __cinit__(self, int64_t size_hint=1, bint uses_mask=False):


Should this also be reflected in the pyi file?

jorisvandenbossche · 2022-09-26T12:05:39Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -434,30 +443,49 @@ cdef class {{name}}HashTable(HashTable):
            'upper_bound' : self.table.upper_bound,
        }

-    cpdef get_item(self, {{dtype}}_t val):
+    cpdef get_item(self, {{dtype}}_t val, bint na_value = False):


Would something like val_is_na be a more descriptive keyword name? (typically in other places where we have na_value, it is the value itself, not a bool flag)

(while you are at it, a brief docstring could also help to describe the keywords)

imo val_is_na is a bit confusing, because val is actually a placeholder and therefore garbage at this place. It is more like get_item of na_value

Or is_na_value or get_na_value?
(I mostly would like to avoid using the same keyword name that is used elsewhere for a different meaning)

Although with get_na_value, it is giving some duplication (get_item(.., get_na_value=True)

But this is also pointing to that it's not a very clean API, with basically two separate behaviours (value being garbarge if the other keyword is set). It's not too important given this is deep into the internals, but maybe a clearer interface is to just have two separate methods instead of the keyword in the single method: get_item(value) and get_na().
Since the caller needs to handle this specifically in the level up anyway to check for NA and set the keyword, you might as well just call a different method instead of passing a keyword.

Refactored into own methods, should be significantly easier to understand now

jorisvandenbossche · 2022-09-26T12:07:10Z

pandas/tests/libs/test_hashtable.py

+
+    def test_get_set_contains_len_mask(self, table_type, dtype):
+        if table_type == ht.PyObjectHashTable:
+            pytest.skip("Mask not supporter for object")


Suggested change

pytest.skip("Mask not supporter for object")

pytest.skip("Mask not supported for object")

phofl · 2022-09-29T08:27:04Z

Yes this is ready. Would add the actual Engine as a follow up

…ables

mroeschke · 2022-10-20T16:33:56Z

Thanks @phofl (can add follow up PRs if needed)

* ENH: mask support for hastable functions for indexing * Fix mypy * Adjust test * Add comment * Add docstring * Refactor into own functions * Fix typing

jbrockmendel · 2022-10-25T16:25:48Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -435,30 +444,73 @@ cdef class {{name}}HashTable(HashTable):
        }

    cpdef get_item(self, {{dtype}}_t val):
+        """Extracts the position of val from the hashtable.


can the return be typed as int Py_ssize_t or something? same question for get_na below

* ENH: mask support for hastable functions for indexing * Fix mypy * Adjust test * Add comment * Add docstring * Refactor into own functions * Fix typing

phofl added 3 commits September 5, 2022 16:23

ENH: mask support for hastable functions for indexing

a7f1abf

Fix mypy

b216f6c

Adjust test

f276f6f

phofl marked this pull request as draft September 5, 2022 14:30

phofl marked this pull request as ready for review September 5, 2022 15:49

Merge remote-tracking branch 'upstream/main' into masked_engine_hasht…

82f8a40

…ables

phofl added Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 5, 2022

jbrockmendel reviewed Sep 8, 2022

View reviewed changes

phofl added 3 commits September 9, 2022 21:45

Add comment

f2c271b

Merge remote-tracking branch 'upstream/main' into masked_engine_hasht…

b4b263e

…ables

Merge remote-tracking branch 'upstream/main' into masked_engine_hasht…

69c6407

…ables

jorisvandenbossche reviewed Sep 26, 2022

View reviewed changes

Add docstring

d08a7ae

phofl added 5 commits September 29, 2022 10:27

Merge remote-tracking branch 'upstream/main' into masked_engine_hasht…

44d0797

…ables

Merge remote-tracking branch 'upstream/main' into masked_engine_hasht…

7bb6914

…ables

Refactor into own functions

4670624

Merge remote-tracking branch 'upstream/main' into masked_engine_hasht…

3861add

…ables

Fix typing

4642852

mroeschke added this to the 2.0 milestone Oct 18, 2022

mroeschke approved these changes Oct 18, 2022

View reviewed changes

Merge branch 'main' into masked_engine_hashtables

0a0ad44

mroeschke merged commit f271445 into pandas-dev:main Oct 20, 2022

phofl deleted the masked_engine_hashtables branch October 23, 2022 17:39

jbrockmendel reviewed Oct 25, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: mask support for hastable functions for indexing #48396

ENH: mask support for hastable functions for indexing #48396

phofl commented Sep 5, 2022 •

edited

Loading

jbrockmendel Sep 8, 2022

phofl Sep 8, 2022 •

edited

Loading

jbrockmendel Sep 8, 2022

phofl Sep 8, 2022

jbrockmendel Sep 8, 2022

phofl Sep 8, 2022

jbrockmendel Sep 8, 2022

phofl Sep 9, 2022

jorisvandenbossche left a comment

jorisvandenbossche Sep 26, 2022

phofl Sep 29, 2022

jorisvandenbossche Sep 26, 2022

phofl Sep 29, 2022

jorisvandenbossche Oct 3, 2022 •

edited

Loading

jorisvandenbossche Oct 3, 2022

phofl Oct 14, 2022

jorisvandenbossche Sep 26, 2022

phofl Sep 29, 2022

phofl commented Sep 29, 2022

mroeschke commented Oct 20, 2022

jbrockmendel Oct 25, 2022

	pytest.skip("Mask not supporter for object")
	pytest.skip("Mask not supported for object")

ENH: mask support for hastable functions for indexing #48396

ENH: mask support for hastable functions for indexing #48396

Conversation

phofl commented Sep 5, 2022 • edited Loading

Choose a reason for hiding this comment

phofl Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Sep 29, 2022

mroeschke commented Oct 20, 2022

Choose a reason for hiding this comment

phofl commented Sep 5, 2022 •

edited

Loading

phofl Sep 8, 2022 •

edited

Loading

jorisvandenbossche Oct 3, 2022 •

edited

Loading