Skip to content

Add some more dtypes: Date, Datetime, Duration, String #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 30, 2023

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Jul 12, 2023

I suggest we leave categorical for a separate PR (will probably require discussion)


EDIT: marking as draft for now, let's leave this out from the first version

@MarcoGorelli MarcoGorelli force-pushed the more-dtypes branch 2 times, most recently from aa732ff to 07bb3d8 Compare July 12, 2023 10:54
@MarcoGorelli MarcoGorelli changed the title Add some more dtypes Add some more dtypes: Date, Datetime, Duration, String Jul 12, 2023
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names and attributes look good to me - readable/understandable. My main question right now: did you choose the available units as the minimum common set that all relevant libraries support today? Arrow has some other ones, like seconds for Datetime and days/milliseconds(?) for Date, and months for `Duration: https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

@MarcoGorelli
Copy link
Contributor Author

Yes, exactly - technically even Date would be a stretch at the moment, as it's not in pandas, but I think we can work around that (by just always validating that anything sub-hourly is zero)

's' resolution in polars would be quite tricky as chrono is currently limited to dates in the -280,000, 280,000 range https://docs.rs/chrono/latest/chrono/#limitations

We could just add 's' to the standard and note that the full range of dates for that unit may not necessarily be reached for every implementation?

@rgommers
Copy link
Member

We could just add 's' to the standard and note that the full range of dates for that unit may not necessarily be reached for every implementation?

Thanks for the context. I'd leave 's' out if it's tricky to implement - if we end up having a bunch of things that have spotty support, that seems worse than those features not being standardized.

@MarcoGorelli MarcoGorelli marked this pull request as draft July 26, 2023 10:08
@MarcoGorelli MarcoGorelli marked this pull request as ready for review July 26, 2023 10:20
@MarcoGorelli
Copy link
Contributor Author

I've removed Date (as pandas doesn't support it yet), and noted the lack of guarantees for Datetime

OK to get this in?

@MarcoGorelli MarcoGorelli marked this pull request as draft August 1, 2023 12:50
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. There may be more to add later, but the names and attributes seem good/future-proof. If a categorical dtype is a for a later PR, is this ready to go in @MarcoGorelli?

@MarcoGorelli
Copy link
Contributor Author

thanks - shall we keep this out for now so we can tag a first version?

@rgommers
Copy link
Member

rgommers commented Aug 2, 2023

Ah okay, that's why you marked it as draft. Sure, sounds good.


Attributes
----------
time_unit : Literal['ms', 'us', 'ns']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having gone through some work in this area recently, nanosecond support isn't common in relational databases where for example DuckDB and Postgres do datetimes as number of microseconds (DuckDB Postgres). Going from microseconds upwards to milliseconds and seconds is generally more easily handled and straightforward, but going down to nanoseconds is not.

Maybe we should consider punting on nanoseconds for the time being?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@MarcoGorelli MarcoGorelli marked this pull request as ready for review September 27, 2023 11:29
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy is unhappy with the use of | instead of Union - syntax too recent.

Other than that, LGTM now and the last concern about nanoseconds was addressed by removing 'ns'. Touched upon again in the last call, and seems good to move ahead. Can be merged once CI is green.

@MarcoGorelli
Copy link
Contributor Author

from __future__ import annotations is all we need

cool, thanks, merging then

@MarcoGorelli MarcoGorelli merged commit f821dbe into data-apis:main Sep 30, 2023
Comment on lines +39 to +44
class Date:
"""
Date type.

There is no guarantee about the range of dates available.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this date supposed to be represented? (Keith asked the same question at the time of review (#197 (comment)), but resolved the comment without that there was any answer).

I certainly understand that we can say there is no guarantee about the range of dates that is supported by each library, but we still need to specify how to interpret the data. This is an interchange protocol where users access the buffers, not a standard API where only behaviour matters. So what does the buffer of (supposingly) integers mean here?

Is it like a numpy datetime64[D]?
For example Arrow supports integers representing both days or milliseconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry ignore my comment, I thought I was commenting on a PR for the interchange protocol, but this is for the standard API ;) So here of course only behaviour matters, and the underlying buffers are an implementation detail of the library.

Was confused because we linked to this from a discussion about supporting "date" in the interchange protocol in the pyarrow implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants