Skip to content

Add namespace.date #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 25, 2023
Merged

Conversation

MarcoGorelli
Copy link
Contributor

No description provided.

@@ -241,3 +242,19 @@ def is_dtype(dtype: DType, kind: str | tuple[str, ...]) -> bool:
-------
bool
"""

def date(year: int, month: int, day: int) -> Scalar:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define rules for this:

  • Standard library datetime.date doesn't allow for negative years. Pandas, PyArrow, cudf, duckdb, etc. all support negative dates. Unsure if Polars supports negative dates, my attempts at constructing / casting into a negative date Series / Scalar have failed thus far.
  • Standard library datetime.date has a max year of 9999 which doesn't support the full range of years that can be supported with an int32 number of days since epoch, which is what PyArrow, cudf, duckdb, and others support. Again, unsure what Polars supports here.

Do we wish to follow the constraints of the standard library datetime.date or do something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be supported in polars, could you show what you tried that failed please?

In [15]: pl.select(pl.date(-2000, 1, 1))
Out[15]:
shape: (1, 1)
┌─────────────┐
│ date        │
│ ---         │
│ date        │
╞═════════════╡
│ -2000-01-01 │
└─────────────┘

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these throw, in either __repr__ or in trying to take a scalar element via __getitem__(0) likely from trying to use Python datetime.datetime or datetime.date objects:

import polars as pl
import numpy as np

pl.Series([np.iinfo('int64').min + 1]).cast(pl.Datetime('ms'))
pl.Series([np.iinfo('int32').min]).cast(pl.Date())
pl.Series([np.iinfo('int32').max]).cast(pl.Date())

Additionally, trying a few datetime operations seems to silently yield incorrect results:

import polars as pl
import numpy as np

bad = pl.Series([np.iinfo('int32').max, np.iinfo('int32').min]).cast(pl.Date())
good = pl.Series([10000000, -10000000]).cast(pl.Date())

print(good.dt.year())
print(bad.dt.year())

PyArrow similarly fails in __repr__ from trying to use datetime.date objects:

import pyarrow as pa
import numpy as np

pa.scalar(np.iinfo('int32').min, pa.date32())

But when calculating things, pyarrow seems to be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those throw when you try to bring unsupported things into Python land

but they're valid so long as you stay within Polars:

In [18]: df
Out[18]:
shape: (3, 2)
┌─────────────┬───────┐
│ tsvalue │
│ ------   │
│ datei64   │
╞═════════════╪═══════╡
│ -3000-01-014     │
│ -2000-01-011     │
│ 1000-01-012     │
└─────────────┴───────┘

In [19]: df.filter(pl.col.ts > pl.date(-2500, 1, 1))
Out[19]:
shape: (2, 2)
┌─────────────┬───────┐
│ tsvalue │
│ ------   │
│ datei64   │
╞═════════════╪═══════╡
│ -2000-01-011     │
│ 1000-01-012     │
└─────────────┴───────┘

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example with using .dt.year() returns incorrect results:

import polars as pl
import numpy as np
import pyarrow as pa
import pyarrow.compute

bad = pl.Series([np.iinfo('int32').max, np.iinfo('int32').min]).cast(pl.Date())
print(bad.dt.year())  # incorrect
print(pa.compute.year(bad.to_arrow()))  # correct
shape: (2,)
Series: '' [i32]
[
        2147483647
        -2147483648
]
[
  20599,
  20599
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that looks like a bug, thanks! pola-rs/polars#11991

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should throw an "out-bounds" error, as it's outside the bounds supported by the underlying Chrono library in Rust (which are approx. -280,000 to 280,000)

But negative dates are supported - one reason pl.date exists (other than to support expressions) is that you pass things to it which the stdlib date doesn't support, like negative dates and nanoseconds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, well either way it looks like there's different restrictions across different libraries and lots of bugs across the board with regards to __repr__ and __getitem__.

It looks like Polars limits dates to greater than or equal to -262145-01-01 and less than or equal to 262143-12-31 based on its Rust implementation. Other libraries generally use a 32-bit signed integer of days since epoch which yields greater than or equal to -5877641-06-23 and less than or equal to 5881580-07-11.

How do you propose we move forward here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion, but I'd be OK with just saying that the full 32-bit signed integer range of days is supported and just noting that Polars isn't 100% compliant in this regard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I'll start raising some issues upstream across libraries regarding the __repr__ issues I uncovered here.

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending one docstring suggestion for clarification

@MarcoGorelli
Copy link
Contributor Author

thanks for your review and discussion!

@MarcoGorelli MarcoGorelli merged commit b54bb26 into data-apis:main Oct 25, 2023
MarcoGorelli added a commit to MarcoGorelli/dataframe-api that referenced this pull request Oct 26, 2023
* drive by: put __all__ in alphabetical order

* add to namespace, type return as scalar

* remove outdated type ignore

* note range of support

* Update spec/API_specification/dataframe_api/__init__.py

Co-authored-by: Keith Kraus <[email protected]>

---------

Co-authored-by: Keith Kraus <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants