Skip to content

Parameterize NA sentinel for hashtables #20328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Mar 13, 2018 · 9 comments · Fixed by #20473
Closed

Parameterize NA sentinel for hashtables #20328

TomAugspurger opened this issue Mar 13, 2018 · 9 comments · Fixed by #20473
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@TomAugspurger
Copy link
Contributor

xref #19938. There's I'm abusing the fact that the Int64HashTable class treats iNaT as a missing value. It'd be cleaner to just pass the Categorical's codes with a flag saying that -1 is the missing value marker.

IIUC, the missing value condition is baked into the class definition.

{{py:

# name, dtype, null_condition, float_group
dtypes = [('Float64', 'float64', 'val != val', True),
          ('UInt64', 'uint64', 'False', False),
          ('Int64', 'int64', 'val == iNaT', False)]

I'm not familiar enough with Cython to know how that could be made a parameter. Possibly an optional scalar value. If that value is not None check if val == na_sentinel. Else, use the condition from the class definition?

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 13, 2018
@WillAyd
Copy link
Member

WillAyd commented Mar 14, 2018

Not 100% certain but I believe you can define an optional argument for your sentinel in the __cinit__

def __cinit__(self, size_hint=1):

Then adjust the existing comparison to do something like if check_null and (val == self.sentinel or {{null_condition}}):

if check_null and {{null_condition}}:

@jreback
Copy link
Contributor

jreback commented Mar 14, 2018

yeah maybe we could just make a parameter for this, rather than check_null, and default it in the template (so allow it to be overriden)

@TomAugspurger
Copy link
Contributor Author

Just doing if val == self.na_value doesn't quite work, since na_value is a (potentially?) a Python object, and this is all in a no_gil block.

@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2018

I don't think an object is possible to hit that branch because it's not explicitly listed as one of the dtypes - do you have an example of where that is happening?

dtypes = [('Float64', 'float64', 'val != val', True),

@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2018

If you need to add support for objects in your template you could replace the nogil block with if True: - I've done this a few times in groupby_helper.pxi.in:

if True: # make templating happy

@TomAugspurger
Copy link
Contributor Author

I don't think an object is possible to hit that branch because it's not explicitly listed as one of the dtypes

It'd be a parameter passed by the user when the hashtable is created.

I'm probably just not familiar enough with cython to describe what's going on. I want something roughly like this

diff --git a/pandas/_libs/hashtable_class_helper.pxi.in b/pandas/_libs/hashtable_class_helper.pxi.in
index bca4e388f..58df8f8b7 100644
--- a/pandas/_libs/hashtable_class_helper.pxi.in
+++ b/pandas/_libs/hashtable_class_helper.pxi.in
@@ -308,8 +308,9 @@ def get_dispatch(dtypes):
 
 cdef class {{name}}HashTable(HashTable):
 
-    def __cinit__(self, size_hint=1):
+    def __cinit__(self, size_hint=1, {{dtype}}_t na_value=None):
         self.table = kh_init_{{dtype}}()
+        self.na_value = na_value
         if size_hint is not None:
             kh_resize_{{dtype}}(self.table, size_hint)
 
@@ -414,18 +415,25 @@ cdef class {{name}}HashTable(HashTable):
             int64_t[:] labels
             Py_ssize_t idx, count = count_prior
             int ret = 0
-            {{dtype}}_t val
+            {{dtype}}_t val, na_value
             khiter_t k
             {{name}}VectorData *ud
+            bint use_na_value
 
         labels = np.empty(n, dtype=np.int64)
         ud = uniques.data
+        na_value = self.na_value
+        use_na_value = self.na_value != None
 
         with nogil:
             for i in range(n):
                 val = values[i]
 
-                if check_null and {{null_condition}}:
+                if use_na_value and val == na_value:
+                    labels[i] = na_sentinel
+                    continue
+
+                elif check_null and {{null_condition}}:
                     labels[i] = na_sentinel
                     continue
 
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index de2e63826..825d17b2c 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -435,7 +435,7 @@ def isin(comps, values):
     return f(comps, values)
 
 
-def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
+def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None, na_value=None):
     """Factorize an array-like to labels and uniques.
 
     This doesn't do any coercion of types or unboxing before factorization.
@@ -455,7 +455,7 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
     """
     (hash_klass, vec_klass), values = _get_data_algo(values, _hashtables)
 
-    table = hash_klass(size_hint or len(values))
+    table = hash_klass(size_hint or len(values), na_value=na_value)
     uniques = vec_klass()
     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
 
@@ -465,7 +465,7 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
 
 
 @deprecate_kwarg(old_arg_name='order', new_arg_name=None)
-def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
+def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None, na_value=None):
     """
     Encode input values as an enumerated type or categorical variable
 
@@ -511,7 +511,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
         check_nulls = not is_integer_dtype(original)
         labels, uniques = _factorize_array(values, check_nulls,
                                            na_sentinel=na_sentinel,
-                                           size_hint=size_hint)
+                                           size_hint=size_hint, na_value=na_value)
 
     if sort and len(uniques) > 0:
         from pandas.core.sorting import safe_sort

This compiles, but doesn't run as float64_t na_value=None isn't valid, since None isn't a float. Is there a way around that with Cython? Otherwise, we'll need type-specific NA values (NaN, iNaT, something). But then we won't have a way to disable that.

Unless... at the python level, the default is None, we check if the user specifies it, and pass through an additional flag telling the hashtable, use_na_value. That should work

@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2018

Just loosely I think if you changed the templating around a bit to only list the nan_value in the dtypes list instead of the full equality comparison (check the groupby_helper for ref) you could change your constructor signature to be:

def __cinit__(self, size_hint=1, {{dtype}}_t na_value={{nan_val}}):
    ...
    self.na_value = na_value

I think you could then just change your conditional to look something like this:

if val == val and val != na_value:

Bypassing the need for trying to track the state of the comparison you need to perform in use_na_value

@TomAugspurger
Copy link
Contributor Author

I'll take a look, thanks.

Just looking at it though (haven't tested) would that work for the current Int64 of iNaT (min int)? With that we'd get iNaT == iNaT and iNaT != iNaT, which will always be False...

@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2018

Sorry that condition was to check for non-null, so you'd want the inverse of that to find null. There may also be a more compact way to go about it

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants