Skip to content

Add namespace.date #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 25, 2023
61 changes: 40 additions & 21 deletions spec/API_specification/dataframe_api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,35 +11,36 @@
from .dtypes import *

if TYPE_CHECKING:
from .typing import DType
from .typing import DType, Scalar

__all__ = [
"__dataframe_api_version__",
"DataFrame",
"Column",
"column_from_sequence",
"column_from_1d_array",
"concat",
"dataframe_from_columns",
"dataframe_from_2d_array",
"is_null",
"null",
"Int64",
"Int32",
"Int16",
"Int8",
"UInt64",
"UInt32",
"UInt16",
"UInt8",
"Float64",
"Float32",
"Bool",
"Column",
"DataFrame",
"Date",
"Datetime",
"Duration",
"Float32",
"Float64",
"Int16",
"Int32",
"Int64",
"Int8",
"String",
"UInt16",
"UInt32",
"UInt64",
"UInt8",
"__dataframe_api_version__",
"column_from_1d_array",
"column_from_sequence",
"concat",
"dataframe_from_2d_array",
"dataframe_from_columns",
"date",
"is_dtype",
"is_null",
"null",
]


Expand Down Expand Up @@ -234,3 +235,21 @@ def is_dtype(dtype: DType, kind: str | tuple[str, ...]) -> bool:
-------
bool
"""

def date(year: int, month: int, day: int) -> Scalar:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define rules for this:

  • Standard library datetime.date doesn't allow for negative years. Pandas, PyArrow, cudf, duckdb, etc. all support negative dates. Unsure if Polars supports negative dates, my attempts at constructing / casting into a negative date Series / Scalar have failed thus far.
  • Standard library datetime.date has a max year of 9999 which doesn't support the full range of years that can be supported with an int32 number of days since epoch, which is what PyArrow, cudf, duckdb, and others support. Again, unsure what Polars supports here.

Do we wish to follow the constraints of the standard library datetime.date or do something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be supported in polars, could you show what you tried that failed please?

In [15]: pl.select(pl.date(-2000, 1, 1))
Out[15]:
shape: (1, 1)
┌─────────────┐
│ date        │
│ ---         │
│ date        │
╞═════════════╡
│ -2000-01-01 │
└─────────────┘

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these throw, in either __repr__ or in trying to take a scalar element via __getitem__(0) likely from trying to use Python datetime.datetime or datetime.date objects:

import polars as pl
import numpy as np

pl.Series([np.iinfo('int64').min + 1]).cast(pl.Datetime('ms'))
pl.Series([np.iinfo('int32').min]).cast(pl.Date())
pl.Series([np.iinfo('int32').max]).cast(pl.Date())

Additionally, trying a few datetime operations seems to silently yield incorrect results:

import polars as pl
import numpy as np

bad = pl.Series([np.iinfo('int32').max, np.iinfo('int32').min]).cast(pl.Date())
good = pl.Series([10000000, -10000000]).cast(pl.Date())

print(good.dt.year())
print(bad.dt.year())

PyArrow similarly fails in __repr__ from trying to use datetime.date objects:

import pyarrow as pa
import numpy as np

pa.scalar(np.iinfo('int32').min, pa.date32())

But when calculating things, pyarrow seems to be correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those throw when you try to bring unsupported things into Python land

but they're valid so long as you stay within Polars:

In [18]: df
Out[18]:
shape: (3, 2)
┌─────────────┬───────┐
│ tsvalue │
│ ------   │
│ datei64   │
╞═════════════╪═══════╡
│ -3000-01-014     │
│ -2000-01-011     │
│ 1000-01-012     │
└─────────────┴───────┘

In [19]: df.filter(pl.col.ts > pl.date(-2500, 1, 1))
Out[19]:
shape: (2, 2)
┌─────────────┬───────┐
│ tsvalue │
│ ------   │
│ datei64   │
╞═════════════╪═══════╡
│ -2000-01-011     │
│ 1000-01-012     │
└─────────────┴───────┘

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example with using .dt.year() returns incorrect results:

import polars as pl
import numpy as np
import pyarrow as pa
import pyarrow.compute

bad = pl.Series([np.iinfo('int32').max, np.iinfo('int32').min]).cast(pl.Date())
print(bad.dt.year())  # incorrect
print(pa.compute.year(bad.to_arrow()))  # correct
shape: (2,)
Series: '' [i32]
[
        2147483647
        -2147483648
]
[
  20599,
  20599
]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that looks like a bug, thanks! pola-rs/polars#11991

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should throw an "out-bounds" error, as it's outside the bounds supported by the underlying Chrono library in Rust (which are approx. -280,000 to 280,000)

But negative dates are supported - one reason pl.date exists (other than to support expressions) is that you pass things to it which the stdlib date doesn't support, like negative dates and nanoseconds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, well either way it looks like there's different restrictions across different libraries and lots of bugs across the board with regards to __repr__ and __getitem__.

It looks like Polars limits dates to greater than or equal to -262145-01-01 and less than or equal to 262143-12-31 based on its Rust implementation. Other libraries generally use a 32-bit signed integer of days since epoch which yields greater than or equal to -5877641-06-23 and less than or equal to 5881580-07-11.

How do you propose we move forward here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion, but I'd be OK with just saying that the full 32-bit signed integer range of days is supported and just noting that Polars isn't 100% compliant in this regard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I'll start raising some issues upstream across libraries regarding the __repr__ issues I uncovered here.

"""
Create date object which can be used for filtering.

The full 32-bit signed integer range of days since epoch should be supported (between -5877641-06-23 and 5881580-07-11 inclusive).

Examples
--------
>>> df: DataFrame
>>> namespace = df.__dataframe_namespace__()
>>> mask = (
... (df.get_column_by_name('date') >= namespace.date(2020, 1, 1))
... & (df.get_column_by_name('date') < namespace.date(2021, 1, 1))
... )
>>> df.filter(mask)
"""

3 changes: 3 additions & 0 deletions spec/API_specification/dataframe_api/typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,9 @@ def is_null(value: object, /) -> bool:
def is_dtype(dtype: Any, kind: str | tuple[str, ...]) -> bool:
...

@staticmethod
def date(year: int, month: int, day: int) -> Scalar:
...

class SupportsDataFrameAPI(Protocol):
def __dataframe_consortium_standard__(
Expand Down
4 changes: 2 additions & 2 deletions spec/API_specification/examples/tpch/q5.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ def query(
== result.get_column_by_name("s_nationkey")
)
& (result.get_column_by_name("r_name") == "ASIA")
& (result.get_column_by_name("o_orderdate") >= namespace.date(1994, 1, 1)) # type: ignore
& (result.get_column_by_name("o_orderdate") < namespace.date(1995, 1, 1)) # type: ignore
& (result.get_column_by_name("o_orderdate") >= namespace.date(1994, 1, 1))
& (result.get_column_by_name("o_orderdate") < namespace.date(1995, 1, 1))
)
result = result.filter(mask)

Expand Down