BUG: hashing datetime64 objects #50960

jbrockmendel · 2023-01-24T16:58:09Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

xref #50690 ~~this handles only cases where the relevant dt64 objects are within the pydatetime bounds. to handle the remaining cases we need to port astype_overflowsafe or something like it.~~updateNow handles all cases, but only has tests for a couple within-pydatetime cases.

cc @seberg ideally we'll port this to numpy and make this datetime64.__hash__

cc @WillAyd

WillAyd · 2023-01-24T17:20:51Z

pandas/_libs/src/klib/khash_python.h

+Py_hash_t PANDAS_INLINE np_datetime64_object_hash(PyObject* key) {
+    // GH#50690 numpy's hash implementation does not preserve comparabity
+    // either across resolutions or with standard library objects.
+    NPY_DATETIMEUNIT unit = (NPY_DATETIMEUNIT)((PyDatetimeScalarObject*)key)->obmeta.base;


I think we should tighten up the error handling instead of just doing a blind cast. From a quick glance in the numpy source I think they have functions like PyArray_IsScalar(key, Datetime) and PyArray_IsScalar(key, Timedellta)

this is copy-pasted from our get_datetime64_unit and get_datetime64_value. i haven't found a way to do a c/cython declaration for a np.datetime64 object other than PyObject*/object. open to suggestion.

Yea there's nothing wrong with the PyObject * type, but we are missing the runtime type introspection here and where you copied it from. Does the suggestion above not compile?

so start out with something like?

if (!PyArray_IsScalar(key, Datetime)) { PyErr_Format(PyExc_ValueError, "Expected np.datetime64 object"); return NULL; }

Yea that's definitely headed in the right direction

You return -1 here, but that would work if you want to have runtime checking here (which I consider up to you, could add an assert if you don't want it).

Yea still want to make sure we have a runtime check; internally our error handling is a mess with extensions so refactoring is a huge effort. Hoping to have a more consistent standard going forward

The PyDatetimeScalarObject cast here shouldn't be needed any more since you refactored the function signature. Does numpy provide a macro to access the unit? Would be even better if we can not do any casting on our end

On review of some numpy internals I don't even think you need a macro. Can blow away the casts altogether - if that doesn't work let me know and can take a closer look

https://github.com/numpy/numpy/blob/6dadb8c40451e934075904f6acdfe341d3b8762e/numpy/core/src/multiarray/scalartypes.c.src#L545

pandas/_libs/src/klib/khash_python.h

WillAyd · 2023-01-24T17:23:02Z

pandas/_libs/src/klib/khash_python.h

+    if ((dts.year > 0) && (dts.year <= 9999) && (dts.ps == 0) && (dts.as == 0)) {
+        // we CAN cast to pydatetime, so use that hash to ensure we compare
+        // as matching standard library datetimes (and pd.Timestamps)
+        PyDateTime_IMPORT;


Is there a way to move this out of the function body? Maybe it can go into the global hashtable code instead?

that causes a complaint at build time. there is another use of PyDateTime_IMPORT in the json code that is also runtime

That does not seem clean to be honest. You are effectively doing a lookup datetime.datetime_CAPI and unpacking the capsule.

This being in a header makes things a bit tricky maybe. The import can be on the top level of the cython file, but not of a C-file and thus even less of a header. If this is only used in Cython, then you could just put it there (if it gets used without the import you get a segfault, which should be relatively clean to read in gdb/lldb, but of course not super easy, maybe a comment on the PyDateTime_FromDateAndTime so someone who finds a crash there knows why?).

If this was in a C file, you would need to add a datetime_init() function and make sure it gets called on import.

Long story short: This is used once, so probably move PyDateTime_IMPORT to cython top-level.

So does this not work to move to the global hashtable code in the cython file? Also agree with @seberg would like to avoid using the macro here

ill defer to you guys on this, but im uncomfortable with having this c file have an implicit dependency on the higher-level cython file

note there's also a coment in np_datetime.c saying it would be helpful to get PyDateTime_IMPORT to work in that file.

Yes, the implicit dependency is not nice, but I think that is just something to live with. You need some initialization, which needs to be a function call and that cannot happen in a header file. I would just say: If you forget, you get a clean segfault pointing at the first use of Datetime code in gdb/lldb. We put a comment there explaining things.

Just some (hopefully helpful) context - the header file never gets "executed" in any sense, so just putting the macro in a header file doesn't by itself just run the code you've inserted.

The entry point for C extensions is a module initialization function like you see in the Python documentation - this gets executed when the Python interpreter loads the extension:

https://docs.python.org/3/extending/extending.html#the-module-s-method-table-and-initialization-function

I am assuming that the global namespace of the Cython file is akin to the module initialization; if that is true then it would be natural for the import to go there rather than the header.

With that said the workaround suggested here is fine, but I don't think the dependency problem you were worried about is too big of a deal

pandas/_libs/src/klib/khash_python.h

WillAyd · 2023-01-24T17:26:45Z

Nice job @jbrockmendel - love to see you stepping into the C world

jbrockmendel · 2023-01-24T20:28:12Z

Updated to handle cases outside pydatetime bounds. The newly-handled cases are not yet tested.

pandas/_libs/src/klib/khash_python.h

setup.py

jbrockmendel · 2023-01-26T23:48:05Z

@seberg thoughts on the approach here?

seberg

Some comments. I would like to suggest another scheme for hashing, but I don't have a clean thought and I suppose it can be improved...

If this is actually working on the object, then this is really patching over NumPy's hash?

Also curious, if you are working on a datetime column itself (no interaction between units or Python objects) then I guess you would continue to use the integer value directly?

pandas/_libs/src/klib/khash_python.h

seberg · 2023-01-27T08:29:33Z

pandas/_libs/src/klib/khash_python.h

+        dt = PyDateTime_FromDateAndTime(
+            dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us
+        );
+        return PyObject_Hash(dt);


Missing the Py_DECREF(dt) here, probably best to write as:

hash = PyObject_Hash(dt); Py_DECREF(dt); return hash;

I would not add explicit error handling in this case as we are at the end of the function. The error we return from the function is -1, so is the value you would guard against.

Actually, I now doubt that -1 is even reachable as a value here, since anything that might turn out as -1 would be handled by the Python hasher. So if you want to be pedantic and find out if this should be there, have to return -1 explicitly, I guess.

seberg · 2023-01-27T08:30:01Z

pandas/_libs/src/klib/khash_python.h

+        return PyObject_Hash(dt);
+    }
+
+    return hash_datetime_value_and_reso(value, unit, &dts);


tp_hash is documented that it should not return -1, I think this is imprecise and it must not. So if the value is -1 here, you should return -2.

Why not -1 with PyErr being set? AFAICT that is what the cpython docs suggest say to do:

https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_hash

I'm a bit wary of adding -2 to our pattern for error handling

Yes of course -1 must be returned with an error set. The point is that for hashing you must not return -1 if there wasn't an error. And the docs are a bit fuzzy, but Python assumes tp_hash ensures this.

Note:

class A: def __hash__(self): return -1 hash(A()) # prints -2

Nothing is added to error handling, the point is you do not have to check PyErr_Occured() on -1 when using this function.

(Of course you can make possibly make a different choice likely if you do not want to use this ever for tp_hash, but -1 are not really a thing in the Python work.)

seberg · 2023-01-27T08:33:11Z

pandas/_libs/src/klib/khash_python.h

+    if ((dts.year > 0) && (dts.year <= 9999) && (dts.ps == 0) && (dts.as == 0)) {
+        // we CAN cast to pydatetime, so use that hash to ensure we compare
+        // as matching standard library datetimes (and pd.Timestamps)
+        PyDateTime_IMPORT;


That does not seem clean to be honest. You are effectively doing a lookup datetime.datetime_CAPI and unpacking the capsule.

This being in a header makes things a bit tricky maybe. The import can be on the top level of the cython file, but not of a C-file and thus even less of a header. If this is only used in Cython, then you could just put it there (if it gets used without the import you get a segfault, which should be relatively clean to read in gdb/lldb, but of course not super easy, maybe a comment on the PyDateTime_FromDateAndTime so someone who finds a crash there knows why?).

If this was in a C file, you would need to add a datetime_init() function and make sure it gets called on import.

Long story short: This is used once, so probably move PyDateTime_IMPORT to cython top-level.

pandas/_libs/src/klib/khash_python.h

seberg · 2023-01-27T09:01:13Z

pandas/_libs/src/klib/khash_python.h

+Py_hash_t PANDAS_INLINE np_datetime64_object_hash(PyObject* key) {
+    // GH#50690 numpy's hash implementation does not preserve comparabity
+    // either across resolutions or with standard library objects.
+    NPY_DATETIMEUNIT unit = (NPY_DATETIMEUNIT)((PyDatetimeScalarObject*)key)->obmeta.base;


You return -1 here, but that would work if you want to have runtime checking here (which I consider up to you, could add an assert if you don't want it).

pandas/_libs/src/klib/khash_python.h

jbrockmendel · 2023-01-27T16:09:02Z

If this is only used in Cython, then you could just put it there

This is called by kh_python_hash_func in this same file. kh_python_hash_func is itself called from cython (but not other C files i think)

jbrockmendel · 2023-01-27T16:14:54Z

If this is actually working on the object, then this is really patching over NumPy's hash?

We aren't setting np.datetime64.__hash__ = this_new_thing if that's what you're asking. This is used in our indexing code when doing dict-like lookups that may encounter mixed-unit dt64s, pydatetimes, or Timestamps and needs to have those hashes match where appropriate.

Also curious, if you are working on a datetime column itself (no interaction between units or Python objects) then I guess you would continue to use the integer value directly?

This is only for object-dtype.

jbrockmendel · 2023-02-07T16:03:16Z

@seberg any thoughts on how to move forward here? this is a blocker for some other non-nano work, so we're likely to do something and id much prefer it if you're on board.

seberg · 2023-02-07T20:26:20Z

Oh, didn't realize it was hanging on that. I would prefer a scheme that doesn't iterate (and avoids modulo). Maybe even using the structs values and a typical hashing algorithm, like what a tuple (year, month, ...) would give.

If you like that idea, I can give you a function that would get a hash from that (should be about 30-50 lines of C/cython code, stolen from Python, I have done that before internally in NumPy).

From what it sounds, it shouldn't matter too much if we find a better scheme in the future? For the scalars (also in NumPy) it also seems like it might be nice if we would just cache things like the hash and even the full struct (if they exist), since memory bloat is likely not a huge issue for scalars?

Besides, that I suspect my only comment left is that moving the IMPORT out would be good, I can try to do that for you, or you just decide to keep it until a time someone fixes it up...

jbrockmendel · 2023-02-07T20:49:48Z

If you like that idea, I can give you a function that would get a hash from that (should be about 30-50 lines of C/cython code, stolen from Python, I have done that before internally in NumPy).

That sounds great, thanks.

seberg · 2023-02-08T09:27:58Z

OK, here is a larger diff that should do the trick I think. I also included moving the datetime import.

I do think it would probably make sense to munch together e.g. minutes*60+seconds for simplicity. It might be good to make sure that whatever such calculation we do the result fits into 32bits.
Maybe the best would be to do that starting from the smallest unit and always much together until things might overflow?
(I honestly didn't think about it, can you give that a shot @jbrockmendel?)

diff --git a/pandas/_libs/khash_for_primitive_helper.pxi.in b/pandas/_libs/khash_for_primitive_helper.pxi.in
index d0934b3e0e..56743530e9 100644
--- a/pandas/_libs/khash_for_primitive_helper.pxi.in
+++ b/pandas/_libs/khash_for_primitive_helper.pxi.in
@@ -4,6 +4,11 @@ Template for wrapping khash-tables for each primitive `dtype`
 WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
 """
 
+# Hashing datetimes requires the datetime API be imported, this ensures it:
+from cpython.datetime cimport import_datetime
+
+import_datetime()
+
 {{py:
 
 # name, c_type
diff --git a/pandas/_libs/src/klib/khash_python.h b/pandas/_libs/src/klib/khash_python.h
index bd3011b9b1..e6f519b7a0 100644
--- a/pandas/_libs/src/klib/khash_python.h
+++ b/pandas/_libs/src/klib/khash_python.h
@@ -308,6 +308,15 @@ khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key);
 #define _PandasHASH_XXROTATE(x) ((x << 13) | (x >> 19))  /* Rotate left 13 bits */
 #endif
 
+Py_uhash_t PANDAS_INLINE tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane)
+{
+    acc += lane * _PandasHASH_XXPRIME_2;
+    acc = _PandasHASH_XXROTATE(acc);
+    acc *= _PandasHASH_XXPRIME_1;
+    return acc;
+}
+
+
 Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
     Py_ssize_t i, len = Py_SIZE(key);
     PyObject **item = key->ob_item;
@@ -318,9 +327,7 @@ Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
         if (lane == (Py_uhash_t)-1) {
             return -1;
         }
-        acc += lane * _PandasHASH_XXPRIME_2;
-        acc = _PandasHASH_XXROTATE(acc);
-        acc *= _PandasHASH_XXPRIME_1;
+        acc = tuple_update_uhash(acc, lane);
     }
 
     /* Add input length, mangled to keep the historical value of hash(()). */
@@ -333,98 +340,32 @@ Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
 }
 
 
-// TODO: if nanos is most common/important, might be most performant to
-// make that canonical and others cast to that?
-Py_hash_t PANDAS_INLINE hash_datetime_value_and_reso(npy_datetime value, NPY_DATETIMEUNIT unit, npy_datetimestruct* dts) {
-    // If we cannot cast to pydatetime, then the question is if there are
-    // other-resolution datetime64 objects that we might be equal to whose
-    // hashes we need to match. We let year-reso objects return value, and make
-    // higher-resolution cases responsible for checking of they match.
-    if (unit == NPY_FR_Y) {
-        if (value == -1) {
-            // https://github.com/pandas-dev/pandas/pull/50960#discussion_r1088695136
-            return -2;
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_M) {
-        if ((value % 12) == 0) {
-            return hash_datetime_value_and_reso(value / 12, NPY_FR_Y, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_W) {
-        if (dts->day == 1) {
-            value = (dts->year - 1970) * 12 + dts->month;
-            return hash_datetime_value_and_reso(value, NPY_FR_M, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_D) {
-        if ((value % 7) == 0) {
-            return hash_datetime_value_and_reso(value / 7, NPY_FR_W, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_h) {
-        if ((value % 24) == 0) {
-            return hash_datetime_value_and_reso(value / 24, NPY_FR_D, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_m) {
-        if ((value % 60) == 0) {
-            return hash_datetime_value_and_reso(value / 60, NPY_FR_h, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_s) {
-        if ((value % 60) == 0) {
-            return hash_datetime_value_and_reso(value / 60, NPY_FR_m, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_ms) {
-        if ((value % 1000) == 0) {
-            return hash_datetime_value_and_reso(value / 1000, NPY_FR_s, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_us) {
-        if ((value % 1000) == 0) {
-            return hash_datetime_value_and_reso(value / 1000, NPY_FR_ns, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_ns) {
-        if ((value % 1000) == 0) {
-            return hash_datetime_value_and_reso(value / 1000, NPY_FR_us, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_ps) {
-        if ((value % 1000) == 0) {
-            return hash_datetime_value_and_reso(value / 1000, NPY_FR_ns, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_fs) {
-        if ((value % 1000) == 0) {
-            return hash_datetime_value_and_reso(value / 1000, NPY_FR_ps, dts);
-        }
-        return value;
-    }
-    else if (unit == NPY_FR_as) {
-        if ((value % 1000) == 0) {
-            return hash_datetime_value_and_reso(value / 1000, NPY_FR_fs, dts);
-        }
-        return value;
-    }
-    else {
-        // i.e. NPY_FR_GENERIC
-        // we default to treating these like nanos
-        return hash_datetime_value_and_reso(value, NPY_FR_ns, dts);
+Py_hash_t PANDAS_INLINE hash_datetime_from_struct(npy_datetimestruct* dts) {
+    /* 
+     * If we cannot cast to datetime, use the datetime struct values directly
+     * and mix them similar to a tuple.
+     */
+
+    Py_uhash_t acc = _PandasHASH_XXPRIME_5;
+#if 64 <= SIZEOF_PY_UHASH_T
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->year);
+#else
+    /* Mix lower and uper bits of the year if int64 is larger */
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->year);
+    acc = tuple_update_uhash(acc, (Py_uhash_t)(dts->year >> SIZEOF_PY_UHASH_T));
+#endif
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->month);
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->day);
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->min);
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->sec);
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->us);
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->ps);
+    acc = tuple_update_uhash(acc, (Py_uhash_t)dts->as);
+    /* should be a need to mix length, as it is fixed anyway? */
+    if (acc == (Py_uhash_t)-1) {
+        acc = (Py_uhash_t)-2;
     }
+    return acc;
 }
 
 
@@ -448,11 +389,13 @@ Py_hash_t np_datetime64_object_hash(PyObject* key) {
     if ((dts.year > 0) && (dts.year <= 9999) && (dts.ps == 0) && (dts.as == 0)) {
         // we CAN cast to pydatetime, so use that hash to ensure we compare
         // as matching standard library datetimes (and pd.Timestamps)
-        PyDateTime_IMPORT;
-
         PyObject* dt;
         Py_hash_t hash;
 
+        /*
+         * Datetime API is needed:  User must ensure `PyDatetime_IMPORT;` or 
+         * `from cpython.datetime cimport import_datetime; import_datetime()`
+         */
         dt = PyDateTime_FromDateAndTime(
             dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us
         );
@@ -464,7 +407,7 @@ Py_hash_t np_datetime64_object_hash(PyObject* key) {
         return hash;
     }
 
-    return hash_datetime_value_and_reso(value, unit, &dts);
+    return hash_datetime_from_struct(&dts);
 }

WillAyd · 2023-03-13T17:20:23Z

pandas/_libs/src/klib/khash_python.h

@@ -330,6 +332,52 @@ Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
 }


+// TODO: same thing for timedelta64 objects
+Py_hash_t np_datetime64_object_hash(PyDatetimeScalarObject* key) {


I think you can move this to the capsule. It would fit much more naturally there than the khash header

The capsule also already does the PyDateTime_IMPORT, which I think will help your remaining issue

WillAyd · 2023-03-13T17:21:42Z

setup.py

    },
    "_libs.indexing": {"pyxfile": "_libs/indexing"},
    "_libs.internals": {"pyxfile": "_libs/internals"},
    "_libs.interval": {
        "pyxfile": "_libs/interval",
        "include": klib_include,
-        "depends": _pxi_dep["interval"],
+        "depends": _pxi_dep["interval"] + tseries_depends,
+        "sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"],


You shouldn't add any new sources arguments to setup.py - this is what causes undefined symbols during parallel compilation with setuptools

reverted that and now im getting a different failure

Yea I think you also need to move the implementation out of the header file into the capsule to resolve that

WillAyd · 2023-03-13T20:10:14Z

I think your final issue is going to be the lack of a PandasDateTime_IMPORT in the timestamps.pyx file. Right now you are getting the includes there but not actually assigning to the static variable that holds all the function pointers from the capsule

jbrockmendel · 2023-03-13T20:20:28Z

timestamps.pyx has a import_pandas_datetime(). I thought that was equivalent to PandasDateTime_IMPORT?

WillAyd · 2023-03-13T21:17:46Z

Ah OK sorry I misread that. So the problem is that you are updating khash_python.h and introducing some new symbol names. To hack it I think you need to update the remaining files that don't have access to those symbols to do the pandas_datetime_import (ex: algos.pyx, hashtable.pyx, etc...) Basically anywhere in the logs where you see this:

In file included from pandas/_libs/src/klib/khash_python.h:13,
                 from pandas/_libs/algos.c:802:
/home/runner/micromamba/envs/test/include/python3.8/datetime.h:189:25: error: ‘PyDateTimeAPI’ defined but not used [-Werror=unused-variable]
  189 | static PyDateTime_CAPI *PyDateTimeAPI = NULL;

The algos.c is an indicator that algos.pyx doesn't have access to the capsule symbols, but you are inserting code from a header file that assumes it does.

You can try to hack around that, but I think also better if we figure out a way to restructure things so that the headers and modules aren't mixing up symbol requirements like they are now. Is khash_python something that was vendored?

jbrockmendel · 2023-03-13T21:43:39Z

Is khash_python something that was vendored?

I'm pretty sure khash.h was vendored, not sure about khash_python.h.

I'm increasingly making "grumpy old man" mutterings about this capsule thing.

WillAyd · 2023-03-13T21:48:35Z

It has less to do with the capsule and more to do with project structure. Putting a lot of the datetime stuff into the khash headers intertwines link-time dependencies in a non-obvious way.

Before you would have to update the sources argument in the setup script. The analogy now is invoking the capsule import

WillAyd · 2023-03-13T22:43:37Z

pandas/_libs/tslibs/src/datetime/pd_datetime.h

+  PandasDateTimeAPI->hash_datetime_from_struct((dts))
+#define np_datetime64_object_hash(dts)                                         \
+  PandasDateTimeAPI->np_datetime64_object_hash((key))
+#define tuple_update_uhash(acc, lane)                                          \


OK need to be a bit careful here. Remember that C has no concept of namespaces - if you define tuple_update_hash here and also in khash.h, then whichever code being inserted by the pre-processor will depend upon whichever one of those files was last included, which can be very confusing.

With the capsule, the function that you ultimately include as part of the PandasDateTime_CAPI struct should be defined as static within the pd_datetime.c module, and the macro which resolves to that should have a unique name

tuple_update_hash is accessed in khash_python.h, is only defined in np_datetime.c AFAICT

Header files are loosely just copied into whatever they are included into. So you are doing #include khash_python.h with one declaration of tuple_update_uhash and then doing #include pd_datetime.h which has a different notion of what tuple_update_uhash is, your target ultimately gets two different ideas for what tuple_update_uhash means. That's basically what the current build failures are telling you

How do they have a different notion of what tuple_update_uhash is? It's only defined in one place.

Opened #51951 to revert the capsuling.

You have a declaration for tuple_update_uhash in khash_python.h then try to define it via a macro in pd_datetime.h

I understand it is frustrating but the implementation here is not correct. Have you checked out any books on C programming? K&R is a nice book for a quick intro - might be worth reading through that or another book before coming back to this

Ah sorry not khash_python looks like np_datetime

OK, so if i remove it from np_datetime.h i get build failure locally error: implicit declaration of function 'tuple_update_uhash' is invalid in C99

That's right. The way the includes happen and the way you have things structured now there you are either providing conflicting declarations for tuple_update_uhash for files that aren't pd_datetime.c OR you are providing no declaration for pd_datetime.c. You would probably be much better served if you had your hashing functions defined as static functions within the capsule rather than your current structure.

I really think you'd be better off getting an intro to C book and learning a bit more about what headers/modules do with respect to the different compilation phases. Of course always good to get hands-on experience, but doing some of the basics first before trying to dive in would probably make the process easier

Will I appreciate your encouragement but I have a long reading list and dayjob responsibilities. Can you just tell me how to fix the [expletive deleted] thing that was working a week ago?

Yes. Move your hashing functions to the capsule as static functions

jbrockmendel · 2023-03-14T21:56:14Z

@seberg think we can short-circuit the stuck-in-the-mud-ness here by upstreaming?

seberg · 2023-05-15T13:06:55Z

I have started here for NumPy, its really mostly copying the code from here https://github.com/seberg/numpy/pull/new/datetime-hash (admittedly, I also updated the void hash to since the Python tuple hash changed over time).

Would it be easy to figure out the timedelta hash as well?

jbrockmendel · 2023-05-15T15:19:01Z

Would it be easy to figure out the timedelta hash as well?

Yes. The approach taken by pd.Timedelta is basically the same:

if can_be_represented_by_pytimedelta:
    return same_hash_as_pytimedelta
else:
     [same logic making sure we match across resolutions]

jbrockmendel · 2023-07-12T20:37:23Z

@seberg is the upstream analogue of this PR likely to go in? i.e. is it worth the effort to try to get this back on track?

seberg · 2023-07-13T06:16:30Z

I don't really see a show stopper, since we had more or less agreement on the hashing scheme and I also think it can be changed in principle. (There was an old PR with a different scheme, but I am not sure if it petered out on review or was never quite polished up.)

jbrockmendel · 2023-08-15T16:16:54Z

Closing in favor of numpy/numpy#24408

BUG: hashing datetime64 objects

7761ecd

WillAyd requested changes Jan 24, 2023

View reviewed changes

mroeschke added the Datetime Datetime data dtype label Jan 24, 2023

seberg mentioned this pull request Jan 24, 2023

datetime64 breaks equality and hash invariant. numpy/numpy#3836

Open

jbrockmendel added 2 commits January 24, 2023 12:27

handle cases out of pydatetime bounds

610b0c6

Merge branch 'main' into bug-hash-dt64

0e140ae

Merge branch 'main' into bug-hash-dt64

919383c

WillAyd requested changes Jan 25, 2023

View reviewed changes

pandas/_libs/src/klib/khash_python.h Outdated Show resolved Hide resolved

pandas/_libs/src/klib/khash_python.h Outdated Show resolved Hide resolved

jbrockmendel added 5 commits January 25, 2023 07:45

Merge branch 'main' into bug-hash-dt64

ae8c0bb

troubleshoot CI builds

92a39eb

troubleshoot CI builds

2f67805

troubleshoot CI builds

0635f86

troubleshoot CI builds

229ab72

WillAyd reviewed Jan 25, 2023

View reviewed changes

setup.py Show resolved Hide resolved

jbrockmendel added 2 commits January 25, 2023 12:57

troubleshoot CI builds

6e96805

Merge branch 'main' into bug-hash-dt64

24fda82

seberg reviewed Jan 27, 2023

View reviewed changes

jbrockmendel added 2 commits January 27, 2023 09:02

suggested edits

058b666

Merge branch 'main' into bug-hash-dt64

7398991

jbrockmendel added 2 commits February 8, 2023 11:48

Merge branch 'main' into bug-hash-dt64

3fdf564

use sebergs suggestion

6e4836e

macro

5513721

WillAyd reviewed Mar 13, 2023

View reviewed changes

jbrockmendel added 3 commits March 13, 2023 10:26

revert sources

875d6af

Merge branch 'main' into bug-hash-dt64

a29a56a

Move np_datetime64_object_hash to np_datetime.c

40e6e17

import_pandas_datetime more

15a701c

troubleshoot

af25f40

WillAyd reviewed Mar 13, 2023

View reviewed changes

jbrockmendel mentioned this pull request Mar 13, 2023

Revert "Use PyCapsule for internal datetime functions" #51951

Merged

Merge branch 'main' into bug-hash-dt64

9d5cb46

jbrockmendel mentioned this pull request Mar 14, 2023

BUG (2.0rc0): groupby changes dtype unexpectedly from timedelta64[s] to timedelta64[ns] #51961

Closed

3 tasks

jbrockmendel added 4 commits March 14, 2023 16:26

Merge branch 'main' into bug-hash-dt64

d5a031d

post-merge merges

bd7d432

frickin guess

394d86e

Merge branch 'main' into bug-hash-dt64

1766bc3

seberg mentioned this pull request Aug 13, 2023

BUG: Fix datetime hash to be a bit more sane and start on timedelta numpy/numpy#24408

Closed

jbrockmendel closed this Aug 15, 2023

jbrockmendel deleted the bug-hash-dt64 branch August 15, 2023 16:17

BUG: hashing datetime64 objects #50960

BUG: hashing datetime64 objects #50960

Conversation

jbrockmendel commented Jan 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Jan 24, 2023

jbrockmendel commented Jan 24, 2023

jbrockmendel commented Jan 26, 2023

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jan 27, 2023

jbrockmendel commented Jan 27, 2023

jbrockmendel commented Feb 7, 2023

seberg commented Feb 7, 2023

jbrockmendel commented Feb 7, 2023

seberg commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Mar 13, 2023

jbrockmendel commented Mar 13, 2023

WillAyd commented Mar 13, 2023 • edited Loading

jbrockmendel commented Mar 13, 2023

WillAyd commented Mar 13, 2023

WillAyd Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 14, 2023

seberg commented May 15, 2023

jbrockmendel commented May 15, 2023

jbrockmendel commented Jul 12, 2023

seberg commented Jul 13, 2023

jbrockmendel commented Aug 15, 2023

jbrockmendel commented Jan 24, 2023 •

edited

Loading

WillAyd commented Mar 13, 2023 •

edited

Loading

WillAyd Mar 13, 2023 •

edited

Loading

WillAyd Mar 14, 2023 •

edited

Loading