-
Notifications
You must be signed in to change notification settings - Fork 21
Add some more dtypes: Date, Datetime, Duration, String #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aa732ff
to
07bb3d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names and attributes look good to me - readable/understandable. My main question right now: did you choose the available units as the minimum common set that all relevant libraries support today? Arrow has some other ones, like seconds for Datetime
and days/milliseconds(?) for Date
, and months for `Duration: https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
Yes, exactly - technically even 's' resolution in polars would be quite tricky as We could just add |
Thanks for the context. I'd leave |
I've removed OK to get this in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. There may be more to add later, but the names and attributes seem good/future-proof. If a categorical dtype is a for a later PR, is this ready to go in @MarcoGorelli?
thanks - shall we keep this out for now so we can tag a first version? |
Ah okay, that's why you marked it as draft. Sure, sounds good. |
|
||
Attributes | ||
---------- | ||
time_unit : Literal['ms', 'us', 'ns'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having gone through some work in this area recently, nanosecond support isn't common in relational databases where for example DuckDB and Postgres do datetimes as number of microseconds (DuckDB Postgres). Going from microseconds upwards to milliseconds and seconds is generally more easily handled and straightforward, but going down to nanoseconds is not.
Maybe we should consider punting on nanoseconds for the time being?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mypy is unhappy with the use of |
instead of Union
- syntax too recent.
Other than that, LGTM now and the last concern about nanoseconds was addressed by removing 'ns'
. Touched upon again in the last call, and seems good to move ahead. Can be merged once CI is green.
cool, thanks, merging then |
class Date: | ||
""" | ||
Date type. | ||
|
||
There is no guarantee about the range of dates available. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this date supposed to be represented? (Keith asked the same question at the time of review (#197 (comment)), but resolved the comment without that there was any answer).
I certainly understand that we can say there is no guarantee about the range of dates that is supported by each library, but we still need to specify how to interpret the data. This is an interchange protocol where users access the buffers, not a standard API where only behaviour matters. So what does the buffer of (supposingly) integers mean here?
Is it like a numpy datetime64[D]
?
For example Arrow supports integers representing both days or milliseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, sorry ignore my comment, I thought I was commenting on a PR for the interchange protocol, but this is for the standard API ;) So here of course only behaviour matters, and the underlying buffers are an implementation detail of the library.
Was confused because we linked to this from a discussion about supporting "date" in the interchange protocol in the pyarrow implementation.
I suggest we leave categorical for a separate PR (will probably require discussion)
EDIT: marking as draft for now, let's leave this out from the first version