Skip to content

API: Timestamp(pydatetime) microsecond reso #49034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Nov 30, 2022

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I've got one json test still failing locally cc @WillAyd see the comment in the ujson file, suggestions?

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2022

I think the get_long_attr function currently in the JSON implicitly expects nanosecond resolution, which would be incorrect and mangles the subsequent functions that convert to the unit of the encoder. I think the easiest approach would be to update the get_long_attr function to read the unit of the object and convert to nanosecond as required

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2022

Also wanted to add...this is awesome. Very impressive work

@mroeschke mroeschke added Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods labels Oct 11, 2022
@jbrockmendel
Copy link
Member Author

update the get_long_attr function

I tried adding

PyObject *nano_obj = PyObject_CallMethod(o, "_as_unit", "(s)", "ns");

and am getting a segfault. is there a better way to do this?

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2022

I think you want

PyObject *nano_obj = PyObject_CallMethod(o, "_as_unit", "s", "ns");

The (s) would mean you are building a tuple.

If you get stuck from there I would suggest running python with gdb attached to it via gdb python then doing something like run -m pytest --lf (or run whatever script gives you the segfault)

Assuming you built the extensions with debugging symbols enabled your debugger will drop you right at the point where the segfault occurs and let you inspect the state of things. If the above doesn't do it alone, my guess is there are other types of objects that we deal with than just timestamps in that function. Ideally we would do a type check...but we aren't at the moment, so dealing with anything but a Timestamp and trying to invoke that method could also cause a segfault

Lastly you might find this useful:

https://pandas.pydata.org/docs/dev/development/debugging_extensions.html#improve-debugger-printing

Good luck. If you are still having trouble ping me again and can look deeper

@jbrockmendel
Copy link
Member Author

I ended up effectively re-implementing _as_unit. No more segfault, but im just getting wrong answers.

static npy_int64 get_long_attr(PyObject *o, const char *attr) {
    // NB we are implicitly assuming that o is a Timedelta or Timestamp, or NaT

    npy_int64 long_val;
    PyObject *value = PyObject_GetAttrString(o, attr);
    long_val =
        (PyLong_Check(value) ? PyLong_AsLongLong(value) : PyLong_AsLong(value));

    Py_DECREF(value);

    if (long_val == NPY_MIN_INT64) {
        // i.e. o is NaT
        return long_val;
    }

    // ensure we are in nanoseconds, similar to Timestamp._as_reso or _as_unit
    NPY_DATETIMEUNIT reso = (NPY_DATETIMEUNIT)PyObject_GetAttrString(o, "_reso");

    if (reso == NPY_FR_us) {
        long_val = long_val * 1000L;
    } else if (reso == NPY_FR_ms) {
        long_val = long_val * 1000000L;
    } else if (reso == NPY_FR_s) {
        long_val = long_val * 1000000000L;
    }

    return long_val;
}

The test case thats failing should go through the NPY_FR_us branch, but apparently none of these conditions are evaluating as truthy.

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2022

Ah I gotcha. The problem there is you essentially are getting back a PyObject - casting it to an NPY_DATETIMEUNIT doesn't change anything about the object itself or the bytes that comprise it; in this case you would just errantly be suppressing warnings with the cast

If you want the primitive integer value for comparison against the NPY_DATETIMEUNIT enum you'd likely want to do something like:

if (!PyLong_Check(reso)) {
  // TODO: we should have error handling here, but one step at a time...
}

long cReso = PyLong_AsLong(reso);
if (cReso == -1 && PyErr_Occurred()) {
  // TODO: we should have error handling here, but one step at a time...
}

And then use cReso for your comparisons

Also be sure to Py_DECREF(reso) when you are done with it

@WillAyd
Copy link
Member

WillAyd commented Oct 12, 2022

@jbrockmendel just to drive home the issue with the cast, I think it is helpful to think about the bytes that are getting moved around. When comparing against the NPY_DATETIMEUNIT enum to see if something is in milliseconds, you are essentially comparisong against the integer 8. Assuming a 32 bit integer on a little endian platform, the binary structure of that is 00000000 00000000 00000000 00001000.

By contrast, a PyObject is a struct that will occupy more memory than 32 bits and have a totally different binary representation. It's not important what exactly that is for now, but let's assume that it's 00110011 00110011 00110011 00110011 00110011 00110011 00110011 00110011 and so on...

When you cast the PyObject to a NPY_DATETIMEUNIT you are basically just casting it to an integer and therefore only interpreting an integer's amount of bytes. I don't know what the exact rules for that are and if its well defined, but let's assume for simplicity it just takes the first 32 bytes of the PyObject. Your previous comparisons would then just be trying to compare against 00110011 00110011 00110011 00110011, which still doesn't equal the 8 you are looking for.

@mroeschke mroeschke added this to the 2.0 milestone Oct 18, 2022
elif PyDate_Check(ts):
# Keep the converter same as PyDateTime's
ts = datetime.combine(ts, time())
return convert_datetime_to_tsobject(ts, tz)
return convert_datetime_to_tsobject(ts, tz, nanos=0, reso=NPY_FR_us) # TODO: or lower?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I would cast dates to the lowest resolution (and document it)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew for this too?

if isinstance(ts, ABCTimestamp):
reso = abbrev_to_npy_unit(ts.unit) # TODO: faster way to do this?
else:
# TODO: what if user explicitly passes nanos=0?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to hit this? Maybe we should raise instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it could happen with pd.Timestamp(pydatetime_obj, nanosecond=0)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally lgtm. A few comments that aren't blockers

@@ -508,7 +508,10 @@ def _unbox_scalar(self, value) -> np.datetime64:
if not isinstance(value, self._scalar_type) and value is not NaT:
raise ValueError("'value' should be a Timestamp.")
self._check_compatible_with(value)
return value.asm8
if value is NaT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible I've missed this conversation but do we need to give consideration to a generic NaT type that can hold different precisions? Or are we always going to use numpy's value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to give consideration to a generic NaT type that can hold different precisions

The closest I've seen to this has been a discussion of having a separate NaT-like for timedelta. I'm not aware of any discussion of a resolution-specific NaT.

@@ -360,7 +360,14 @@ def wrapper(
if out_dtype is not None:
out = out.view(out_dtype)
if fill_wrap is not None:
# FIXME: if we get here with dt64/td64 we need to be sure we have
# matching resos
if fill_value.dtype.kind == "m":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth making a re-usable function for this? I could see this useful in other areas. Something like reso_for_type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like to hold off on that as out of scope

elif PyDate_Check(ts):
# Keep the converter same as PyDateTime's
# For date object we give the lowest supporte resolution, ie. "s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test where we construct a Timestamp from a datetime.date?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedicated test added + green

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM can merge on green

@mroeschke mroeschke merged commit 41db572 into pandas-dev:main Nov 30, 2022
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the nano-tstamp-pydatetime branch November 30, 2022 23:37
@jbrockmendel
Copy link
Member Author

thanks for reviews @mroeschke and @WillAyd. we're getting close to having full non-nano support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants