-
Notifications
You must be signed in to change notification settings - Fork 21
Add details to expectations for scalars #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
751a131
7cb90ea
fc65648
7c24afd
2714c13
99b91a5
a867f00
9e13924
f197672
a417b1c
409d8f3
0db9871
6520ac4
46bc08c
b879f31
456c152
b8011c7
97d8f9a
3b7bcb6
d598a8d
a12585b
35cd4ed
fade164
29ceed2
bee402f
d1f4daf
15090ac
24d2ad8
8360d96
f69679a
216b5e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
from __future__ import annotations | ||
|
||
from typing import Any, Protocol | ||
|
||
__all__ = ["Scalar"] | ||
|
||
|
||
class Scalar(Protocol): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a clean way from a typing perspective to allow people to pass either We should also have some way to go from Python scalars to these There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I've added an example ( Python scalars do in fact implement the Scalar protocol, so they can be passed without any issue I think this is an argument against adding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
at least, Python floats do trying to do I'm starting to see the writing on the wall for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I imagine Scalars would be typed following the Column dtypes, which then allows us to define a consistent set of type handling and type promotion rules. Otherwise, if Scalars are not typed similarly to Columns, you may need to introspect the value of a Scalar for example in calling |
||
"""Scalar object. | ||
|
||
Not meant to be instantiated directly, but rather created via | ||
`:meth:Column.get_value` or one of the column reductions such | ||
as `:meth:`Column.sum`. | ||
""" | ||
|
||
def __lt__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __le__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __eq__(self, other: object) -> Scalar: # type: ignore[override] | ||
... | ||
|
||
def __ne__(self, other: object) -> Scalar: # type: ignore[override] | ||
... | ||
|
||
def __gt__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __ge__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __add__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __radd__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __sub__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __rsub__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __mul__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __rmul__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __mod__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __rmod__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __pow__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __rpow__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __floordiv__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __rfloordiv__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __truediv__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __rtruediv__(self, other: Any) -> Scalar: | ||
... | ||
|
||
def __neg__(self) -> Scalar: | ||
... | ||
|
||
def __abs__(self) -> Scalar: | ||
... | ||
|
||
def __bool__(self) -> bool: | ||
"""Note that this return a Python scalar. | ||
|
||
Depending on the implementation, this may raise or trigger computation. | ||
""" | ||
... |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,8 @@ | |
from dataframe_api.groupby_object import Aggregation as AggregationT | ||
from dataframe_api.groupby_object import GroupBy | ||
|
||
from .scalar_object import Scalar | ||
|
||
if TYPE_CHECKING: | ||
from collections.abc import Sequence | ||
|
||
|
@@ -53,9 +55,6 @@ | |
Duration, | ||
] | ||
|
||
# Type alias: Mypy needs Any, but for readability we need to make clear this | ||
# is a Python scalar (i.e., an instance of `bool`, `int`, `float`, `str`, etc.) | ||
Scalar = Any | ||
# null is a special object which represents a missing value. | ||
# It is not valid as a type. | ||
NullType = Any | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we can just have the special There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, brilliant point, thanks I guess it could just return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. from #308 (comment) , I think we may need to end up with |
||
|
@@ -183,5 +182,4 @@ def __column_consortium_standard__( | |
"Scalar", | ||
"SupportsColumnAPI", | ||
"SupportsDataFrameAPI", | ||
"Scalar", | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,7 +18,7 @@ class DataFrame: | |
... | ||
|
||
class Column: | ||
def mean(self, skip_nulls: bool = True) -> float | NullType: | ||
def mean(self, skip_nulls: bool = True) -> Scalar | NullType: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's weird that this can return either a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks, you're right - it should just be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if indeed we did go with #308 (comment), then There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know if we would want There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you clarify please? You could do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apologies, we're aligned here. We have different classes per type as opposed to just a top level |
||
... | ||
|
||
larger = df2 > df1.col('foo').mean() | ||
|
@@ -27,15 +27,37 @@ larger = df2 > df1.col('foo').mean() | |
For a GPU dataframe library, it is desirable for all data to reside on the GPU, | ||
and not incur a performance penalty from synchronizing instances of Python | ||
builtin types to CPU. In the above example, the `.mean()` call returns a | ||
`float`. It is likely beneficial though to implement this as a library-specific | ||
scalar object which duck types with `float`. This means that it should (a) have | ||
the same semantics as a builtin `float` when used within a library, and (b) | ||
support usage as a `float` outside of the library (i.e., implement | ||
`__float__`). Duck typing is usually not perfect, for example `isinstance` | ||
usage on the float-like duck type will behave differently. Such explicit "type | ||
of object" checks don't have to be supported. | ||
|
||
The following design rule applies everywhere builtin Python types are used | ||
within this API standard: _where a Python builtin type is specified, an | ||
implementation may always replace it by an equivalent library-specific type | ||
that duck types with the Python builtin type._ | ||
`Scalar`. It is likely beneficial though to implement this as a library-specific | ||
scalar object which (partially) duck types with `float`. The required methods it | ||
must implement are listed in the spec for class `Scalar`. | ||
|
||
## Example | ||
|
||
For example, if a library implements `FancyFloat` and `FancyBool` scalars, | ||
then the following should all be supported: | ||
```python | ||
df: DataFrame | ||
column_1: Column = df.col('a') | ||
column_2: Column = df.col('b') | ||
|
||
scalar: FancyFloat = column_1.std() | ||
result_1: Column = column_2 - column_1.std() | ||
result_2: FancyBool = column_2.std() > column_1.std() | ||
``` | ||
|
||
Note that the scalars above are library-specific ones - they may be used to keep | ||
data on GPU, or to keep data lazy. | ||
|
||
The following, however, may raise, dependening on the | ||
implementation: | ||
```python | ||
df: DataFrame | ||
column = df.col('a') | ||
|
||
if column.std() > 0: # this line may raise! | ||
print('std is positive') | ||
``` | ||
This is because `if column.std() > 0` will call `(column.std() > 0).__bool__()`, | ||
which is required by Python to produce a Python scalar. | ||
Therefore, a purely lazy dataframe library may choose to raise here, whereas as | ||
one which allows for eager execution may return a Python bool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a
dtype
property similar to what we have with Columns?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest I'd be fine with departing completely from the idea of ducktyped Python scalars and just adding extra things (likedtype
orpersist
) if they're usefullet's discuss this part in the next callthis would mean that Python scalars would no longer implement the Scalar Protocol