-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Vectorized ISO Format #28863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized ISO Format #28863
Conversation
@@ -43,6 +48,36 @@ def get_time_micros(ndarray[int64_t] dtindex): | |||
return micros | |||
|
|||
|
|||
@cython.wraparound(False) | |||
@cython.boundscheck(False) | |||
def get_datetime_isoformats(ndarray[int64_t] dtindex) -> ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC cython is not great about mixing and matching annotation formats. i.e. the ndarray[int64_t]
is cython-style and the -> ndarray
is python style and id be (pleasantly) surprised if cython understands that
|
||
Parameters | ||
---------- | ||
dtindex : DatetimeArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DatetimeArray is inaccurate here. So is "datetimelike" above
dt64_to_dtstruct(dtindex[i], &dts); | ||
# TODO: handle bad return | ||
convert_status = make_iso_8601_datetime(&dts, buf, 34, NPY_FR_ns) | ||
out[i] = buf.decode("UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is returning this as bytes essential?
side-thought: could tweak perf by making out = np.empty(count, dtype="S34")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_iso_8601_datetime
operates on C strings so this converts the C string to a unicode object. I find this a little surprising that this is how it works but was pulled directly from Cython docs
http://docs.cython.org/en/latest/src/tutorial/strings.html#decoding-bytes-to-text
from pandas._libs.tslibs.nattype cimport NPY_NAT | ||
|
||
|
||
cdef extern from "./src/datetime/np_datetime_strings.h": | ||
int make_iso_8601_datetime(npy_datetimestruct *dts, char *outstr, int outlen, | ||
NPY_DATETIMEUNIT base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All else equal I'd rather do the "extern" in np_datetime and cimport from there. Trying to minimize the surface area of "src". That said, the ship may have sailed on that one, in which case never mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool I'll give that a try. Definitely sounds logical
result = fields.get_datetime_isoformats(values) | ||
else: | ||
result = fields.get_date_name_field(values, field) | ||
result = self._maybe_mask_results(result, fill_value=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is maybe_mask_results necessary for the isoformat case? i.e. does it handle NaT?
"2000-01-03T00:00:00.000000000Z", | ||
], | ||
), | ||
# "US/Eastern", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls mangle to ensure this fails linting. want to avoid accidentally merging commented-out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea my hope is to get this working; just a stub for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think should be possible to support timezones if we vendor updates from Numpy:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a while I've been wanting to make a PR at numpy to get them to expose these functions so we could #include
them and not re-implement them. Every time I've trird, my C-fu has not been up to the task. You might have better luck, if you're interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool I’ll check out their repo. Would be nice instead of having to vendor
|
||
@pytest.mark.skip | ||
def test_dti_isoformat_period_raises(self): | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is idiomatic for typing overriding, but pass
is still preferred for this kind of thing, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. Is this the appropriate way to be registering the accessor method? Wasn't sure of the best way to do that since this should I think only be applicable to datetimes and timedeltas but not periods
"2000-01-03T00:00:00.000000000Z", | ||
], | ||
), | ||
# "US/Eastern", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea my hope is to get this working; just a stub for now
|
||
@pytest.mark.skip | ||
def test_dti_isoformat_period_raises(self): | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here
I think so. |
There are a handful of np_datetime/np_datetime_strings funcs that we only use in the ujson code. Would this replace that usage? i.e. as a follow-up, could we get this func out of the C code and into cython? |
This wouldn't replace it because this would only handle datetimelike objects when they appear in a DTA; ujson still handles these as one-off appearances in object arrays Not sure about conversion to Cython. I think your idea to have this exposed in numpy would be ideal atm, so I'll check that route first and then maybe see if cythonizing makes sense thereafter |
Might revisit this after #28912 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Needs work but could certainly use some guidance from @jbrockmendel
This uses the dt -> ISO string formatting that is deeply nested in the JSON code. It doesn't handle time zones properly (see #12997), doesn't match what you would get by default from
Timestamp.isoformat
(different precision) and doesn't support Timedeltas yet. When Timedeltas are supported this could ease some of the performance issues @cbertinato is seeing in #28595In any case looking for guidance and thoughts on how to properly implement this, if this is even in the right direction
Here's a rough benchmark on performance: