-
Notifications
You must be signed in to change notification settings - Fork 21
Add namespace.date #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add namespace.date #289
Conversation
a5e9218
to
686a89a
Compare
@@ -241,3 +242,19 @@ def is_dtype(dtype: DType, kind: str | tuple[str, ...]) -> bool: | |||
------- | |||
bool | |||
""" | |||
|
|||
def date(year: int, month: int, day: int) -> Scalar: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to define rules for this:
- Standard library
datetime.date
doesn't allow for negative years. Pandas, PyArrow, cudf, duckdb, etc. all support negative dates. Unsure if Polars supports negative dates, my attempts at constructing / casting into a negative date Series / Scalar have failed thus far. - Standard library
datetime.date
has a max year of 9999 which doesn't support the full range of years that can be supported with anint32
number of days since epoch, which is what PyArrow, cudf, duckdb, and others support. Again, unsure what Polars supports here.
Do we wish to follow the constraints of the standard library datetime.date
or do something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be supported in polars, could you show what you tried that failed please?
In [15]: pl.select(pl.date(-2000, 1, 1))
Out[15]:
shape: (1, 1)
┌─────────────┐
│ date │
│ --- │
│ date │
╞═════════════╡
│ -2000-01-01 │
└─────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these throw, in either __repr__
or in trying to take a scalar element via __getitem__(0)
likely from trying to use Python datetime.datetime
or datetime.date
objects:
import polars as pl
import numpy as np
pl.Series([np.iinfo('int64').min + 1]).cast(pl.Datetime('ms'))
pl.Series([np.iinfo('int32').min]).cast(pl.Date())
pl.Series([np.iinfo('int32').max]).cast(pl.Date())
Additionally, trying a few datetime operations seems to silently yield incorrect results:
import polars as pl
import numpy as np
bad = pl.Series([np.iinfo('int32').max, np.iinfo('int32').min]).cast(pl.Date())
good = pl.Series([10000000, -10000000]).cast(pl.Date())
print(good.dt.year())
print(bad.dt.year())
PyArrow similarly fails in __repr__
from trying to use datetime.date
objects:
import pyarrow as pa
import numpy as np
pa.scalar(np.iinfo('int32').min, pa.date32())
But when calculating things, pyarrow seems to be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those throw when you try to bring unsupported things into Python land
but they're valid so long as you stay within Polars:
In [18]: df
Out[18]:
shape: (3, 2)
┌─────────────┬───────┐
│ ts ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞═════════════╪═══════╡
│ -3000-01-01 ┆ 4 │
│ -2000-01-01 ┆ 1 │
│ 1000-01-01 ┆ 2 │
└─────────────┴───────┘
In [19]: df.filter(pl.col.ts > pl.date(-2500, 1, 1))
Out[19]:
shape: (2, 2)
┌─────────────┬───────┐
│ ts ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞═════════════╪═══════╡
│ -2000-01-01 ┆ 1 │
│ 1000-01-01 ┆ 2 │
└─────────────┴───────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example with using .dt.year()
returns incorrect results:
import polars as pl
import numpy as np
import pyarrow as pa
import pyarrow.compute
bad = pl.Series([np.iinfo('int32').max, np.iinfo('int32').min]).cast(pl.Date())
print(bad.dt.year()) # incorrect
print(pa.compute.year(bad.to_arrow())) # correct
shape: (2,)
Series: '' [i32]
[
2147483647
-2147483648
]
[
20599,
20599
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, that looks like a bug, thanks! pola-rs/polars#11991
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should throw an "out-bounds" error, as it's outside the bounds supported by the underlying Chrono
library in Rust (which are approx. -280,000 to 280,000)
But negative dates are supported - one reason pl.date
exists (other than to support expressions) is that you pass things to it which the stdlib date
doesn't support, like negative dates and nanoseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, well either way it looks like there's different restrictions across different libraries and lots of bugs across the board with regards to __repr__
and __getitem__
.
It looks like Polars limits dates to greater than or equal to -262145-01-01
and less than or equal to 262143-12-31
based on its Rust implementation. Other libraries generally use a 32-bit signed integer of days since epoch which yields greater than or equal to -5877641-06-23
and less than or equal to 5881580-07-11
.
How do you propose we move forward here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion, but I'd be OK with just saying that the full 32-bit signed integer range of days is supported and just noting that Polars isn't 100% compliant in this regard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I'll start raising some issues upstream across libraries regarding the __repr__
issues I uncovered here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved pending one docstring suggestion for clarification
Co-authored-by: Keith Kraus <[email protected]>
thanks for your review and discussion! |
* drive by: put __all__ in alphabetical order * add to namespace, type return as scalar * remove outdated type ignore * note range of support * Update spec/API_specification/dataframe_api/__init__.py Co-authored-by: Keith Kraus <[email protected]> --------- Co-authored-by: Keith Kraus <[email protected]>
No description provided.