Parameterize NA sentinel for hashtables #20328

TomAugspurger · 2018-03-13T12:25:00Z

xref #19938. There's I'm abusing the fact that the Int64HashTable class treats iNaT as a missing value. It'd be cleaner to just pass the Categorical's codes with a flag saying that -1 is the missing value marker.

IIUC, the missing value condition is baked into the class definition.

{{py:

# name, dtype, null_condition, float_group
dtypes = [('Float64', 'float64', 'val != val', True),
          ('UInt64', 'uint64', 'False', False),
          ('Int64', 'int64', 'val == iNaT', False)]

I'm not familiar enough with Cython to know how that could be made a parameter. Possibly an optional scalar value. If that value is not None check if val == na_sentinel. Else, use the condition from the class definition?

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-03-14T07:26:35Z

Not 100% certain but I believe you can define an optional argument for your sentinel in the __cinit__

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Line 311 in 3783ccc

def __cinit__(self, size_hint=1):

Then adjust the existing comparison to do something like if check_null and (val == self.sentinel or {{null_condition}}):

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Line 428 in 3783ccc

if check_null and {{null_condition}}:

jreback · 2018-03-14T10:20:38Z

yeah maybe we could just make a parameter for this, rather than check_null, and default it in the template (so allow it to be overriden)

TomAugspurger · 2018-03-23T14:33:40Z

Just doing if val == self.na_value doesn't quite work, since na_value is a (potentially?) a Python object, and this is all in a no_gil block.

WillAyd · 2018-03-23T15:10:24Z

I don't think an object is possible to hit that branch because it's not explicitly listed as one of the dtypes - do you have an example of where that is happening?

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Line 254 in 3783ccc

dtypes = [('Float64', 'float64', 'val != val', True),

WillAyd · 2018-03-23T15:18:02Z

If you need to add support for objects in your template you could replace the nogil block with if True: - I've done this a few times in groupby_helper.pxi.in:

pandas/pandas/_libs/groupby_helper.pxi.in

Line 329 in 3783ccc

if True: # make templating happy

TomAugspurger · 2018-03-23T15:48:11Z

I don't think an object is possible to hit that branch because it's not explicitly listed as one of the dtypes

It'd be a parameter passed by the user when the hashtable is created.

I'm probably just not familiar enough with cython to describe what's going on. I want something roughly like this

diff --git a/pandas/_libs/hashtable_class_helper.pxi.in b/pandas/_libs/hashtable_class_helper.pxi.in
index bca4e388f..58df8f8b7 100644
--- a/pandas/_libs/hashtable_class_helper.pxi.in
+++ b/pandas/_libs/hashtable_class_helper.pxi.in
@@ -308,8 +308,9 @@ def get_dispatch(dtypes):
 
 cdef class {{name}}HashTable(HashTable):
 
-    def __cinit__(self, size_hint=1):
+    def __cinit__(self, size_hint=1, {{dtype}}_t na_value=None):
         self.table = kh_init_{{dtype}}()
+        self.na_value = na_value
         if size_hint is not None:
             kh_resize_{{dtype}}(self.table, size_hint)
 
@@ -414,18 +415,25 @@ cdef class {{name}}HashTable(HashTable):
             int64_t[:] labels
             Py_ssize_t idx, count = count_prior
             int ret = 0
-            {{dtype}}_t val
+            {{dtype}}_t val, na_value
             khiter_t k
             {{name}}VectorData *ud
+            bint use_na_value
 
         labels = np.empty(n, dtype=np.int64)
         ud = uniques.data
+        na_value = self.na_value
+        use_na_value = self.na_value != None
 
         with nogil:
             for i in range(n):
                 val = values[i]
 
-                if check_null and {{null_condition}}:
+                if use_na_value and val == na_value:
+                    labels[i] = na_sentinel
+                    continue
+
+                elif check_null and {{null_condition}}:
                     labels[i] = na_sentinel
                     continue
 
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index de2e63826..825d17b2c 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -435,7 +435,7 @@ def isin(comps, values):
     return f(comps, values)
 
 
-def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
+def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None, na_value=None):
     """Factorize an array-like to labels and uniques.
 
     This doesn't do any coercion of types or unboxing before factorization.
@@ -455,7 +455,7 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
     """
     (hash_klass, vec_klass), values = _get_data_algo(values, _hashtables)
 
-    table = hash_klass(size_hint or len(values))
+    table = hash_klass(size_hint or len(values), na_value=na_value)
     uniques = vec_klass()
     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
 
@@ -465,7 +465,7 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
 
 
 @deprecate_kwarg(old_arg_name='order', new_arg_name=None)
-def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
+def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None, na_value=None):
     """
     Encode input values as an enumerated type or categorical variable
 
@@ -511,7 +511,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
         check_nulls = not is_integer_dtype(original)
         labels, uniques = _factorize_array(values, check_nulls,
                                            na_sentinel=na_sentinel,
-                                           size_hint=size_hint)
+                                           size_hint=size_hint, na_value=na_value)
 
     if sort and len(uniques) > 0:
         from pandas.core.sorting import safe_sort

This compiles, but doesn't run as float64_t na_value=None isn't valid, since None isn't a float. Is there a way around that with Cython? Otherwise, we'll need type-specific NA values (NaN, iNaT, something). But then we won't have a way to disable that.

Unless... at the python level, the default is None, we check if the user specifies it, and pass through an additional flag telling the hashtable, use_na_value. That should work

WillAyd · 2018-03-23T16:03:21Z

Just loosely I think if you changed the templating around a bit to only list the nan_value in the dtypes list instead of the full equality comparison (check the groupby_helper for ref) you could change your constructor signature to be:

def __cinit__(self, size_hint=1, {{dtype}}_t na_value={{nan_val}}):
    ...
    self.na_value = na_value

I think you could then just change your conditional to look something like this:

if val == val and val != na_value:

Bypassing the need for trying to track the state of the comparison you need to perform in use_na_value

TomAugspurger · 2018-03-23T16:08:29Z

I'll take a look, thanks.

Just looking at it though (haven't tested) would that work for the current Int64 of iNaT (min int)? With that we'd get iNaT == iNaT and iNaT != iNaT, which will always be False...

WillAyd · 2018-03-23T16:27:10Z

Sorry that condition was to check for non-null, so you'd want the inverse of that to find null. There may also be a more compact way to go about it

TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 13, 2018

TomAugspurger mentioned this issue Mar 13, 2018

REF/BUG/API: factorizing categorical data #19938

Merged

jreback added Difficulty Intermediate and removed Difficulty Intermediate labels Mar 14, 2018

jreback added this to the Next Major Release milestone Mar 14, 2018

TomAugspurger mentioned this issue Mar 23, 2018

ENH/API: ExtensionArray.factorize #20361

Merged

TomAugspurger mentioned this issue Mar 23, 2018

Parametrized NA sentinel for factorize #20473

Merged

jreback modified the milestones: Next Major Release, 0.23.0 Mar 27, 2018

jreback closed this as completed in #20473 Mar 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameterize NA sentinel for hashtables #20328

Parameterize NA sentinel for hashtables #20328

TomAugspurger commented Mar 13, 2018

WillAyd commented Mar 14, 2018 •

edited

Loading

jreback commented Mar 14, 2018

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018

WillAyd commented Mar 23, 2018

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018 •

edited

Loading

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018

Parameterize NA sentinel for hashtables #20328

Parameterize NA sentinel for hashtables #20328

Comments

TomAugspurger commented Mar 13, 2018

WillAyd commented Mar 14, 2018 • edited Loading

jreback commented Mar 14, 2018

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018

WillAyd commented Mar 23, 2018

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018 • edited Loading

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018

WillAyd commented Mar 14, 2018 •

edited

Loading

WillAyd commented Mar 23, 2018 •

edited

Loading