Skip to content

Vectorized ISO Format #28863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Oct 9, 2019

Needs work but could certainly use some guidance from @jbrockmendel

This uses the dt -> ISO string formatting that is deeply nested in the JSON code. It doesn't handle time zones properly (see #12997), doesn't match what you would get by default from Timestamp.isoformat (different precision) and doesn't support Timedeltas yet. When Timedeltas are supported this could ease some of the performance issues @cbertinato is seeing in #28595

In any case looking for guidance and thoughts on how to properly implement this, if this is even in the right direction

Here's a rough benchmark on performance:

@WillAyd WillAyd added Datetime Datetime data dtype Performance Memory or execution speed performance labels Oct 9, 2019
@@ -43,6 +48,36 @@ def get_time_micros(ndarray[int64_t] dtindex):
return micros


@cython.wraparound(False)
@cython.boundscheck(False)
def get_datetime_isoformats(ndarray[int64_t] dtindex) -> ndarray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC cython is not great about mixing and matching annotation formats. i.e. the ndarray[int64_t] is cython-style and the -> ndarray is python style and id be (pleasantly) surprised if cython understands that


Parameters
----------
dtindex : DatetimeArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeArray is inaccurate here. So is "datetimelike" above

dt64_to_dtstruct(dtindex[i], &dts);
# TODO: handle bad return
convert_status = make_iso_8601_datetime(&dts, buf, 34, NPY_FR_ns)
out[i] = buf.decode("UTF-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is returning this as bytes essential?

side-thought: could tweak perf by making out = np.empty(count, dtype="S34")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_iso_8601_datetime operates on C strings so this converts the C string to a unicode object. I find this a little surprising that this is how it works but was pulled directly from Cython docs

http://docs.cython.org/en/latest/src/tutorial/strings.html#decoding-bytes-to-text

from pandas._libs.tslibs.nattype cimport NPY_NAT


cdef extern from "./src/datetime/np_datetime_strings.h":
int make_iso_8601_datetime(npy_datetimestruct *dts, char *outstr, int outlen,
NPY_DATETIMEUNIT base)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All else equal I'd rather do the "extern" in np_datetime and cimport from there. Trying to minimize the surface area of "src". That said, the ship may have sailed on that one, in which case never mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I'll give that a try. Definitely sounds logical

result = fields.get_datetime_isoformats(values)
else:
result = fields.get_date_name_field(values, field)
result = self._maybe_mask_results(result, fill_value=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is maybe_mask_results necessary for the isoformat case? i.e. does it handle NaT?

"2000-01-03T00:00:00.000000000Z",
],
),
# "US/Eastern",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls mangle to ensure this fails linting. want to avoid accidentally merging commented-out code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea my hope is to get this working; just a stub for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a while I've been wanting to make a PR at numpy to get them to expose these functions so we could #include them and not re-implement them. Every time I've trird, my C-fu has not been up to the task. You might have better luck, if you're interested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I’ll check out their repo. Would be nice instead of having to vendor


@pytest.mark.skip
def test_dti_isoformat_period_raises(self):
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is idiomatic for typing overriding, but pass is still preferred for this kind of thing, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here

Copy link
Member Author

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. Is this the appropriate way to be registering the accessor method? Wasn't sure of the best way to do that since this should I think only be applicable to datetimes and timedeltas but not periods

"2000-01-03T00:00:00.000000000Z",
],
),
# "US/Eastern",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea my hope is to get this working; just a stub for now


@pytest.mark.skip
def test_dti_isoformat_period_raises(self):
...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here

@jbrockmendel
Copy link
Member

Is this the appropriate way to be registering the accessor method?

I think so.

@jbrockmendel
Copy link
Member

There are a handful of np_datetime/np_datetime_strings funcs that we only use in the ujson code. Would this replace that usage? i.e. as a follow-up, could we get this func out of the C code and into cython?

@WillAyd
Copy link
Member Author

WillAyd commented Oct 9, 2019

This wouldn't replace it because this would only handle datetimelike objects when they appear in a DTA; ujson still handles these as one-off appearances in object arrays

Not sure about conversion to Cython. I think your idea to have this exposed in numpy would be ideal atm, so I'll check that route first and then maybe see if cythonizing makes sense thereafter

@WillAyd
Copy link
Member Author

WillAyd commented Oct 11, 2019

Might revisit this after #28912

@WillAyd WillAyd closed this Oct 11, 2019
@WillAyd WillAyd deleted the vector-isoformat branch January 16, 2020 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_iso methods for DatetimeLikeArray
2 participants