-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Parameterize NA sentinel for hashtables #20328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not 100% certain but I believe you can define an optional argument for your sentinel in the
Then adjust the existing comparison to do something like
|
yeah maybe we could just make a parameter for this, rather than |
Just doing |
I don't think an object is possible to hit that branch because it's not explicitly listed as one of the dtypes - do you have an example of where that is happening?
|
If you need to add support for objects in your template you could replace the pandas/pandas/_libs/groupby_helper.pxi.in Line 329 in 3783ccc
|
It'd be a parameter passed by the user when the hashtable is created. I'm probably just not familiar enough with cython to describe what's going on. I want something roughly like this diff --git a/pandas/_libs/hashtable_class_helper.pxi.in b/pandas/_libs/hashtable_class_helper.pxi.in
index bca4e388f..58df8f8b7 100644
--- a/pandas/_libs/hashtable_class_helper.pxi.in
+++ b/pandas/_libs/hashtable_class_helper.pxi.in
@@ -308,8 +308,9 @@ def get_dispatch(dtypes):
cdef class {{name}}HashTable(HashTable):
- def __cinit__(self, size_hint=1):
+ def __cinit__(self, size_hint=1, {{dtype}}_t na_value=None):
self.table = kh_init_{{dtype}}()
+ self.na_value = na_value
if size_hint is not None:
kh_resize_{{dtype}}(self.table, size_hint)
@@ -414,18 +415,25 @@ cdef class {{name}}HashTable(HashTable):
int64_t[:] labels
Py_ssize_t idx, count = count_prior
int ret = 0
- {{dtype}}_t val
+ {{dtype}}_t val, na_value
khiter_t k
{{name}}VectorData *ud
+ bint use_na_value
labels = np.empty(n, dtype=np.int64)
ud = uniques.data
+ na_value = self.na_value
+ use_na_value = self.na_value != None
with nogil:
for i in range(n):
val = values[i]
- if check_null and {{null_condition}}:
+ if use_na_value and val == na_value:
+ labels[i] = na_sentinel
+ continue
+
+ elif check_null and {{null_condition}}:
labels[i] = na_sentinel
continue
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index de2e63826..825d17b2c 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -435,7 +435,7 @@ def isin(comps, values):
return f(comps, values)
-def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
+def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None, na_value=None):
"""Factorize an array-like to labels and uniques.
This doesn't do any coercion of types or unboxing before factorization.
@@ -455,7 +455,7 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
"""
(hash_klass, vec_klass), values = _get_data_algo(values, _hashtables)
- table = hash_klass(size_hint or len(values))
+ table = hash_klass(size_hint or len(values), na_value=na_value)
uniques = vec_klass()
labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
@@ -465,7 +465,7 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
@deprecate_kwarg(old_arg_name='order', new_arg_name=None)
-def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
+def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None, na_value=None):
"""
Encode input values as an enumerated type or categorical variable
@@ -511,7 +511,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
check_nulls = not is_integer_dtype(original)
labels, uniques = _factorize_array(values, check_nulls,
na_sentinel=na_sentinel,
- size_hint=size_hint)
+ size_hint=size_hint, na_value=na_value)
if sort and len(uniques) > 0:
from pandas.core.sorting import safe_sort This compiles, but doesn't run as Unless... at the python level, the default is None, we check if the user specifies it, and pass through an additional flag telling the hashtable, |
Just loosely I think if you changed the templating around a bit to only list the
I think you could then just change your conditional to look something like this:
Bypassing the need for trying to track the state of the comparison you need to perform in |
I'll take a look, thanks. Just looking at it though (haven't tested) would that work for the current Int64 of iNaT (min int)? With that we'd get |
Sorry that condition was to check for non-null, so you'd want the inverse of that to find null. There may also be a more compact way to go about it |
xref #19938. There's I'm abusing the fact that the Int64HashTable class treats iNaT as a missing value. It'd be cleaner to just pass the Categorical's codes with a flag saying that
-1
is the missing value marker.IIUC, the missing value condition is baked into the class definition.
I'm not familiar enough with Cython to know how that could be made a parameter. Possibly an optional scalar value. If that value is not None check if
val == na_sentinel
. Else, use the condition from the class definition?The text was updated successfully, but these errors were encountered: