Skip to content

BUG: hashing datetime64 objects #50960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 64 commits into from
Closed
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
7761ecd
BUG: hashing datetime64 objects
jbrockmendel Jan 24, 2023
610b0c6
handle cases out of pydatetime bounds
jbrockmendel Jan 24, 2023
0e140ae
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 24, 2023
919383c
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 24, 2023
ae8c0bb
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 25, 2023
92a39eb
troubleshoot CI builds
jbrockmendel Jan 25, 2023
2f67805
troubleshoot CI builds
jbrockmendel Jan 25, 2023
0635f86
troubleshoot CI builds
jbrockmendel Jan 25, 2023
229ab72
troubleshoot CI builds
jbrockmendel Jan 25, 2023
6e96805
troubleshoot CI builds
jbrockmendel Jan 25, 2023
24fda82
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 26, 2023
058b666
suggested edits
jbrockmendel Jan 27, 2023
7398991
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 31, 2023
3fdf564
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 8, 2023
6e4836e
use sebergs suggestion
jbrockmendel Feb 8, 2023
f55337a
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 9, 2023
a97dfc9
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 9, 2023
1338ca2
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 9, 2023
9fb1987
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 10, 2023
818682c
suggested edits
jbrockmendel Feb 10, 2023
74ab540
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 10, 2023
037ba05
remove unnecessary casts
jbrockmendel Feb 11, 2023
7a4b1ab
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 11, 2023
47e5247
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 11, 2023
dcd09dd
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 11, 2023
32d479b
Merge branch 'bug-hash-dt64' of github.com:jbrockmendel/pandas into b…
jbrockmendel Feb 11, 2023
d47cfd8
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 12, 2023
c091317
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 12, 2023
1ce791e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 13, 2023
704fb69
suggested edit for PyDateTime_IMPORT
jbrockmendel Feb 13, 2023
6d962b0
Merge branch 'bug-hash-dt64' of github.com:jbrockmendel/pandas into b…
jbrockmendel Feb 13, 2023
f838953
revert delay
jbrockmendel Feb 13, 2023
0d8500a
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 13, 2023
58a29b6
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 14, 2023
95069e0
restore check
jbrockmendel Feb 14, 2023
7362f3e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 14, 2023
dd08670
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 15, 2023
998a4cc
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 15, 2023
b75730b
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 17, 2023
5c57a5e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 20, 2023
6b4460f
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 21, 2023
c94609b
add test
jbrockmendel Feb 22, 2023
afe9493
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 26, 2023
4fecc97
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 3, 2023
b4cc41e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 8, 2023
c620339
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 9, 2023
3633653
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 11, 2023
c55f182
shot in the dark
jbrockmendel Mar 11, 2023
6e2bbf0
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 12, 2023
23c2826
capsule stuff
jbrockmendel Mar 12, 2023
143b3a3
guessing
jbrockmendel Mar 13, 2023
ffb8365
still tryin
jbrockmendel Mar 13, 2023
1fdfd64
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 13, 2023
5513721
macro
jbrockmendel Mar 13, 2023
875d6af
revert sources
jbrockmendel Mar 13, 2023
a29a56a
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 13, 2023
40e6e17
Move np_datetime64_object_hash to np_datetime.c
jbrockmendel Mar 13, 2023
15a701c
import_pandas_datetime more
jbrockmendel Mar 13, 2023
af25f40
troubleshoot
jbrockmendel Mar 13, 2023
9d5cb46
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 14, 2023
d5a031d
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 14, 2023
bd7d432
post-merge merges
jbrockmendel Mar 14, 2023
394d86e
frickin guess
jbrockmendel Mar 14, 2023
1766bc3
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 49 additions & 3 deletions pandas/_libs/src/klib/khash_python.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@
typedef npy_complex64 khcomplex64_t;
typedef npy_complex128 khcomplex128_t;

// get pandas_datetime_to_datetimestruct
#include <../../tslibs/src/datetime/np_datetime.h>

#include "datetime.h"

// khash should report usage to tracemalloc
#if PY_VERSION_HEX >= 0x03060000
Expand Down Expand Up @@ -305,6 +308,7 @@ khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key);
#define _PandasHASH_XXROTATE(x) ((x << 13) | (x >> 19)) /* Rotate left 13 bits */
#endif


Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
Py_ssize_t i, len = Py_SIZE(key);
PyObject **item = key->ob_item;
Expand All @@ -315,9 +319,7 @@ Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
if (lane == (Py_uhash_t)-1) {
return -1;
}
acc += lane * _PandasHASH_XXPRIME_2;
acc = _PandasHASH_XXROTATE(acc);
acc *= _PandasHASH_XXPRIME_1;
acc = tuple_update_uhash(acc, lane);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed that you removed this here. If we duplicate the code, maybe just don't change it here?

}

/* Add input length, mangled to keep the historical value of hash(()). */
Expand All @@ -330,6 +332,46 @@ Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
}


// TODO: same thing for timedelta64 objects
Py_hash_t np_datetime64_object_hash(PyDatetimeScalarObject* key) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can move this to the capsule. It would fit much more naturally there than the khash header

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capsule also already does the PyDateTime_IMPORT, which I think will help your remaining issue

// GH#50690 numpy's hash implementation does not preserve comparabity
// either across resolutions or with standard library objects.
// See also Timestamp.__hash__

NPY_DATETIMEUNIT unit = (NPY_DATETIMEUNIT)key->obmeta.base;
npy_datetime value = key->obval;
npy_datetimestruct dts;

if (value == NPY_DATETIME_NAT) {
// np.datetime64("NaT") in any reso
return NPY_DATETIME_NAT;
}

pandas_datetime_to_datetimestruct(value, unit, &dts);

if ((dts.year > 0) && (dts.year <= 9999) && (dts.ps == 0) && (dts.as == 0)) {
// we CAN cast to pydatetime, so use that hash to ensure we compare
// as matching standard library datetimes (and pd.Timestamps)
PyDateTime_IMPORT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to move this out of the function body? Maybe it can go into the global hashtable code instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that causes a complaint at build time. there is another use of PyDateTime_IMPORT in the json code that is also runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not seem clean to be honest. You are effectively doing a lookup datetime.datetime_CAPI and unpacking the capsule.

This being in a header makes things a bit tricky maybe. The import can be on the top level of the cython file, but not of a C-file and thus even less of a header. If this is only used in Cython, then you could just put it there (if it gets used without the import you get a segfault, which should be relatively clean to read in gdb/lldb, but of course not super easy, maybe a comment on the PyDateTime_FromDateAndTime so someone who finds a crash there knows why?).

If this was in a C file, you would need to add a datetime_init() function and make sure it gets called on import.

Long story short: This is used once, so probably move PyDateTime_IMPORT to cython top-level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this not work to move to the global hashtable code in the cython file? Also agree with @seberg would like to avoid using the macro here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill defer to you guys on this, but im uncomfortable with having this c file have an implicit dependency on the higher-level cython file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note there's also a coment in np_datetime.c saying it would be helpful to get PyDateTime_IMPORT to work in that file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the implicit dependency is not nice, but I think that is just something to live with. You need some initialization, which needs to be a function call and that cannot happen in a header file. I would just say: If you forget, you get a clean segfault pointing at the first use of Datetime code in gdb/lldb. We put a comment there explaining things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some (hopefully helpful) context - the header file never gets "executed" in any sense, so just putting the macro in a header file doesn't by itself just run the code you've inserted.

The entry point for C extensions is a module initialization function like you see in the Python documentation - this gets executed when the Python interpreter loads the extension:

https://docs.python.org/3/extending/extending.html#the-module-s-method-table-and-initialization-function

I am assuming that the global namespace of the Cython file is akin to the module initialization; if that is true then it would be natural for the import to go there rather than the header.

With that said the workaround suggested here is fine, but I don't think the dependency problem you were worried about is too big of a deal

Copy link
Contributor

@seberg seberg Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyDateTime_IMPORT;
if (PyDateTimeAPI == NULL) {
/* delayed import, may be nice to move to import time */
PyDateTime_IMPORT;
if (PyDateTimeAPI == NULL) {
return -1;
}
}

I think this is decent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops had, added an error check, although PyDateTime_IMPORT probably can't fail here anyway (I am sure by this time datetime had been imported somewhere).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed commit with this (but not the NULL check) and it failed on the CI, so reverted it for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had misspelled PyDateTimeAPI, with a lower T instead of capital one.


PyObject* dt;
Py_hash_t hash;

dt = PyDateTime_FromDateAndTime(
dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us
);
if (dt == NULL) {
return -1;
}
hash = PyObject_Hash(dt);
Py_DECREF(dt);
return hash;
}

return hash_datetime_from_struct(&dts);
}


khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key) {
Py_hash_t hash;
// For PyObject_Hash holds:
Expand All @@ -351,6 +393,10 @@ khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key) {
else if (PyTuple_CheckExact(key)) {
hash = tupleobject_hash((PyTupleObject*)key);
}
else if (PyObject_TypeCheck(key, &PyDatetimeArrType_Type)) {
// GH#50690
hash = np_datetime64_object_hash((PyDatetimeScalarObject *)key);
}
else {
hash = PyObject_Hash(key);
}
Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/tslibs/np_datetime.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ cdef extern from "src/datetime/np_datetime.h":
pandas_timedeltastruct *result
) nogil

Py_hash_t hash_datetime_from_struct(npy_datetimestruct* dts) except? -1

cdef bint cmp_scalar(int64_t lhs, int64_t rhs, int op) except -1

cdef check_dts_bounds(npy_datetimestruct *dts, NPY_DATETIMEUNIT unit=?)
Expand Down
53 changes: 53 additions & 0 deletions pandas/_libs/tslibs/src/datetime/np_datetime.c
Original file line number Diff line number Diff line change
Expand Up @@ -1091,3 +1091,56 @@ PyArray_DatetimeMetaData
get_datetime_metadata_from_dtype(PyArray_Descr *dtype) {
return (((PyArray_DatetimeDTypeMetaData *)dtype->c_metadata)->meta);
}


// we could use any hashing algorithm, this is the original CPython's for tuples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these same defines in khash_python.h already. Not sure about duplicating here versus refactoring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah I moved some of it rather than duplicating, likely could go further down that path. I'd like to get this in soonish so prefer to do bigger refactors separately

Copy link
Member

@WillAyd WillAyd Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with C its generally really hard to track down the effects of copy/pasting defines across headers and implementations though. Maybe we just create a cpython_hash.h file that khash and np can include?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it into its own file seems nice probably. I wouldn't care too much, since this is only a problem if you would include the other header here (even then it might not be since it matches).

On a general note, it might be good to add include guards to your headers:

#ifndef PD_LIBS_..._CPYTHON_HASH_H
#define PD_LIBS_..._CPYTHON_HASH_H

<header content>

#endif  /* PD_LIBS_..._CPYTHON_HASH_H */

(some pattern there, google style suggests full paths and at least a partial path would make sense I think)


#if SIZEOF_PY_UHASH_T > 4
#define _PandasHASH_XXPRIME_1 ((Py_uhash_t)11400714785074694791ULL)
#define _PandasHASH_XXPRIME_2 ((Py_uhash_t)14029467366897019727ULL)
#define _PandasHASH_XXPRIME_5 ((Py_uhash_t)2870177450012600261ULL)
#define _PandasHASH_XXROTATE(x) ((x << 31) | (x >> 33)) /* Rotate left 31 bits */
#else
#define _PandasHASH_XXPRIME_1 ((Py_uhash_t)2654435761UL)
#define _PandasHASH_XXPRIME_2 ((Py_uhash_t)2246822519UL)
#define _PandasHASH_XXPRIME_5 ((Py_uhash_t)374761393UL)
#define _PandasHASH_XXROTATE(x) ((x << 13) | (x >> 19)) /* Rotate left 13 bits */
#endif


Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane) {
static inline Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane) {

The static inline should be the same as pandas inline. inline used to not exist, but you can now rely on it. So basically PANDAS_INLINE can be blanket replaced with static inline.

However, you can of course also just use static or PANDAS_INLINE here. You should add one of these (i.e. static should be there) to ensure that this is private to this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other PRs I've asked us to move away from specifying inline explicitly instead allowing the compiler to choose for us. Is that the wrong approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC its a compiler hint mainly but I am not sure where it matters in many places.

Copy link
Member

@WillAyd WillAyd Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea on second read this might be a place where it actually could inline. Thought it was using the Python runtime at first but I see it doesn't now, so it might have a chance

gcc has -Winline to tell you when the hint is ignored would be interesting to see on this

acc += lane * _PandasHASH_XXPRIME_2;
acc = _PandasHASH_XXROTATE(acc);
acc *= _PandasHASH_XXPRIME_1;
return acc;
}

// https://github.com/pandas-dev/pandas/pull/50960
Py_hash_t
hash_datetime_from_struct(npy_datetimestruct* dts) {
/*
* If we cannot cast to datetime, use the datetime struct values directly
* and mix them similar to a tuple.
*/

Py_uhash_t acc = _PandasHASH_XXPRIME_5;
#if 64 <= SIZEOF_PY_UHASH_T
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->year);
#else
/* Mix lower and uper bits of the year if int64 is larger */
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->year);
acc = tuple_update_uhash(acc, (Py_uhash_t)(dts->year >> SIZEOF_PY_UHASH_T));
#endif
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->month);
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->day);
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->min);
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->sec);
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->us);
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->ps);
acc = tuple_update_uhash(acc, (Py_uhash_t)dts->as);
Comment on lines +1080 to +1083
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could bunch everything below second (or even minute or more?) into a single 64bit number (then unfortunately the same trick to split it up if necessary on the platform).

Not sure that is worthwhile, probably a tiny bit faster, but I am mainly wondering if it might generalize a bit nicer if we use a simpler scheme that doesn't require the full datetime struct in principle.

/* should be a need to mix length, as it is fixed anyway? */
if (acc == (Py_uhash_t)-1) {
acc = (Py_uhash_t)-2;
}
return acc;
}
3 changes: 3 additions & 0 deletions pandas/_libs/tslibs/src/datetime/np_datetime.h
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,7 @@ PyArray_DatetimeMetaData get_datetime_metadata_from_dtype(
PyArray_Descr *dtype);


Py_hash_t hash_datetime_from_struct(npy_datetimestruct* dts);
Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Py_uhash_t tuple_update_uhash(Py_uhash_t acc, Py_uhash_t lane);

no need to make this public.


#endif // PANDAS__LIBS_TSLIBS_SRC_DATETIME_NP_DATETIME_H_
10 changes: 6 additions & 4 deletions pandas/_libs/tslibs/timestamps.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ from pandas._libs.tslibs.np_datetime cimport (
get_datetime64_unit,
get_datetime64_value,
get_unit_from_dtype,
hash_datetime_from_struct,
npy_datetimestruct,
npy_datetimestruct_to_datetime,
pandas_datetime_to_datetimestruct,
Expand Down Expand Up @@ -308,11 +309,12 @@ cdef class _Timestamp(ABCTimestamp):
# -----------------------------------------------------------------

def __hash__(_Timestamp self):
if self.nanosecond:
return hash(self._value)
if not (1 <= self.year <= 9999):
cdef:
npy_datetimestruct dts
if not (1 <= self.year <= 9999) or self.nanosecond:
# out of bounds for pydatetime
return hash(self._value)
pydatetime_to_dtstruct(self, &dts)
return hash_datetime_from_struct(&dts)
if self.fold:
return datetime.__hash__(self.replace(fold=0))
return datetime.__hash__(self)
Expand Down
26 changes: 26 additions & 0 deletions pandas/tests/indexes/object/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,3 +187,29 @@ def test_slice_locs_dup(self):
assert index2.slice_locs(end="a") == (0, 6)
assert index2.slice_locs("d", "b") == (0, 4)
assert index2.slice_locs("c", "a") == (2, 6)


def test_np_datetime64_objects():
# GH#50690
ms = np.datetime64(1, "ms")
us = np.datetime64(1000, "us")

left = Index([ms], dtype=object)
right = Index([us], dtype=object)

assert left[0] in right
assert right[0] in left

assert left.get_loc(right[0]) == 0
assert right.get_loc(left[0]) == 0

# non-monotonic cases go through different paths in cython code
sec = np.datetime64("9999-01-01", "s")
day = np.datetime64("2016-01-01", "D")
left2 = Index([ms, sec, day], dtype=object)

expected = np.array([0], dtype=np.intp)
res = left2[:1].get_indexer(right)
tm.assert_numpy_array_equal(res, expected)
res = left2.get_indexer(right)
tm.assert_numpy_array_equal(res, expected)
22 changes: 17 additions & 5 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,7 +442,8 @@ def srcpath(name=None, suffix=".pyx", subdir="src"):
"_libs.algos": {
"pyxfile": "_libs/algos",
"include": klib_include,
"depends": _pxi_dep["algos"],
"depends": _pxi_dep["algos"] + tseries_depends,
"sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the point of the capsule is you don't specify sources any more. Since you are adding a new function here, you'll want to add it to the datetime capsule itself

Copy link
Member Author

@jbrockmendel jbrockmendel Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m still struggling here with builds failures I cant reproduce. Any other thoughts?

},
"_libs.arrays": {"pyxfile": "_libs/arrays"},
"_libs.groupby": {"pyxfile": "_libs/groupby"},
Expand All @@ -453,21 +454,30 @@ def srcpath(name=None, suffix=".pyx", subdir="src"):
"depends": (
["pandas/_libs/src/klib/khash_python.h", "pandas/_libs/src/klib/khash.h"]
+ _pxi_dep["hashtable"]
+ tseries_depends
),
"sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"],
},
"_libs.index": {
"pyxfile": "_libs/index",
"include": klib_include,
"depends": _pxi_dep["index"],
"depends": _pxi_dep["index"] + tseries_depends,
"sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"],
},
"_libs.indexing": {"pyxfile": "_libs/indexing"},
"_libs.internals": {"pyxfile": "_libs/internals"},
"_libs.interval": {
"pyxfile": "_libs/interval",
"include": klib_include,
"depends": _pxi_dep["interval"],
"depends": _pxi_dep["interval"] + tseries_depends,
"sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't add any new sources arguments to setup.py - this is what causes undefined symbols during parallel compilation with setuptools

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted that and now im getting a different failure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think you also need to move the implementation out of the header file into the capsule to resolve that

},
"_libs.join": {
"pyxfile": "_libs/join",
"depends": tseries_depends,
"include": klib_include,
"sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"],
},
"_libs.join": {"pyxfile": "_libs/join", "include": klib_include},
"_libs.lib": {
"pyxfile": "_libs/lib",
"depends": lib_depends + tseries_depends,
Expand All @@ -481,10 +491,12 @@ def srcpath(name=None, suffix=".pyx", subdir="src"):
"depends": [
"pandas/_libs/src/parser/tokenizer.h",
"pandas/_libs/src/parser/io.h",
],
]
+ tseries_depends,
"sources": [
"pandas/_libs/src/parser/tokenizer.c",
"pandas/_libs/src/parser/io.c",
"pandas/_libs/tslibs/src/datetime/np_datetime.c",
],
},
"_libs.reduction": {"pyxfile": "_libs/reduction"},
Expand Down