Skip to content

Add details to expectations for scalars #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
751a131
note what may raise
MarcoGorelli Oct 31, 2023
7cb90ea
Merge remote-tracking branch 'upstream/main' into expand-on-scalars
MarcoGorelli Nov 7, 2023
fc65648
list required methods
MarcoGorelli Nov 8, 2023
7c24afd
add scalar class
MarcoGorelli Nov 8, 2023
2714c13
reword
MarcoGorelli Nov 8, 2023
99b91a5
fixup
MarcoGorelli Nov 8, 2023
a867f00
fixup
MarcoGorelli Nov 8, 2023
9e13924
Merge remote-tracking branch 'upstream/main' into expand-on-scalars
MarcoGorelli Nov 8, 2023
f197672
fixup
MarcoGorelli Nov 8, 2023
a417b1c
Merge remote-tracking branch 'upstream/main' into expand-on-scalars
MarcoGorelli Nov 14, 2023
409d8f3
replace Scalar|NullType with Scalar
MarcoGorelli Nov 14, 2023
0db9871
type null as Scalar
MarcoGorelli Nov 14, 2023
6520ac4
add example of working with scalars
MarcoGorelli Nov 14, 2023
46bc08c
use AnyScalar;
MarcoGorelli Nov 15, 2023
b879f31
add Scalar.dtype and Scalar.persist
MarcoGorelli Nov 15, 2023
456c152
Merge remote-tracking branch 'upstream/main' into expand-on-scalars
MarcoGorelli Nov 15, 2023
b8011c7
update shift arg
MarcoGorelli Nov 15, 2023
97d8f9a
use BoolScalar
MarcoGorelli Nov 15, 2023
3b7bcb6
use float scalar in some parts
MarcoGorelli Nov 15, 2023
d598a8d
use float scalar in some parts
MarcoGorelli Nov 15, 2023
a12585b
string scalar for rename
MarcoGorelli Nov 15, 2023
35cd4ed
intscalar for shift
MarcoGorelli Nov 15, 2023
fade164
numeric scalar for correction
MarcoGorelli Nov 15, 2023
29ceed2
simplify
MarcoGorelli Nov 15, 2023
bee402f
update python builtin types desc
MarcoGorelli Nov 15, 2023
d1f4daf
Merge remote-tracking branch 'upstream/main' into expand-on-scalars
MarcoGorelli Nov 15, 2023
15090ac
fixup
MarcoGorelli Nov 15, 2023
24d2ad8
enable extra ruff rule, note AnyScalar
MarcoGorelli Nov 15, 2023
8360d96
remove some unnecessary nitpick ignores
MarcoGorelli Nov 15, 2023
f69679a
return Self from Scalar.persist, add column.persist
MarcoGorelli Nov 17, 2023
216b5e6
fixup
MarcoGorelli Nov 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion spec/API_specification/dataframe_api/column_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
if TYPE_CHECKING:
from typing_extensions import Self

from .typing import DType, Namespace, NullType, Scalar
from .scalar_object import Scalar
from .typing import DType, Namespace, NullType


__all__ = ["Column"]
Expand Down
87 changes: 87 additions & 0 deletions spec/API_specification/dataframe_api/scalar_object.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
from __future__ import annotations

from typing import Any, Protocol

__all__ = ["Scalar"]


class Scalar(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a dtype property similar to what we have with Columns?

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest I'd be fine with departing completely from the idea of ducktyped Python scalars and just adding extra things (like dtype or persist) if they're useful

let's discuss this part in the next call

this would mean that Python scalars would no longer implement the Scalar Protocol

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a clean way from a typing perspective to allow people to pass either Scalar objects or typical Python scalars that we can handle constructing Scalars from without issue? Without that I think the API will be a bit of a pain with people having to sprinkle Scalar(...) or Scalar.from_py(...) all over.

We should also have some way to go from Python scalars to these Scalar objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added an example (spec/API_specification/examples/05_scalars_example.py)

Python scalars do in fact implement the Scalar protocol, so they can be passed without any issue

I think this is an argument against adding dtype and other methods to our scalars which aren't supported by Python ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python scalars do in fact implement the Scalar protocol, so they can be passed without any issue

at least, Python floats do

trying to do fill_null('foo') wouldn't be fine

I'm starting to see the writing on the wall for FloatScalar / IntScalar / StringScalar ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine Scalars would be typed following the Column dtypes, which then allows us to define a consistent set of type handling and type promotion rules. Otherwise, if Scalars are not typed similarly to Columns, you may need to introspect the value of a Scalar for example in calling fill_null against an Int8 dtype column with an IntScalar where the value is above what is supported by Int8.

"""Scalar object.

Not meant to be instantiated directly, but rather created via
`:meth:Column.get_value` or one of the column reductions such
as `:meth:`Column.sum`.
"""

def __lt__(self, other: Any) -> Scalar:
...

def __le__(self, other: Any) -> Scalar:
...

def __eq__(self, other: object) -> Scalar: # type: ignore[override]
...

def __ne__(self, other: object) -> Scalar: # type: ignore[override]
...

def __gt__(self, other: Any) -> Scalar:
...

def __ge__(self, other: Any) -> Scalar:
...

def __add__(self, other: Any) -> Scalar:
...

def __radd__(self, other: Any) -> Scalar:
...

def __sub__(self, other: Any) -> Scalar:
...

def __rsub__(self, other: Any) -> Scalar:
...

def __mul__(self, other: Any) -> Scalar:
...

def __rmul__(self, other: Any) -> Scalar:
...

def __mod__(self, other: Any) -> Scalar:
...

def __rmod__(self, other: Any) -> Scalar:
...

def __pow__(self, other: Any) -> Scalar:
...

def __rpow__(self, other: Any) -> Scalar:
...

def __floordiv__(self, other: Any) -> Scalar:
...

def __rfloordiv__(self, other: Any) -> Scalar:
...

def __truediv__(self, other: Any) -> Scalar:
...

def __rtruediv__(self, other: Any) -> Scalar:
...

def __neg__(self) -> Scalar:
...

def __abs__(self) -> Scalar:
...

def __bool__(self) -> bool:
"""Note that this return a Python scalar.

Depending on the implementation, this may raise or trigger computation.
"""
...
6 changes: 2 additions & 4 deletions spec/API_specification/dataframe_api/typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
from dataframe_api.groupby_object import Aggregation as AggregationT
from dataframe_api.groupby_object import GroupBy

from .scalar_object import Scalar

if TYPE_CHECKING:
from collections.abc import Sequence

Expand Down Expand Up @@ -53,9 +55,6 @@
Duration,
]

# Type alias: Mypy needs Any, but for readability we need to make clear this
# is a Python scalar (i.e., an instance of `bool`, `int`, `float`, `str`, etc.)
Scalar = Any
# null is a special object which represents a missing value.
# It is not valid as a type.
NullType = Any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can just have the special NULL scalar singleton be of type Scalar instead of a special type here? Not sure what NULL.dtype would yield though since we don't have a null or empty dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, brilliant point, thanks

I guess it could just return None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from #308 (comment) , I think we may need to end up with NullScalar

Expand Down Expand Up @@ -183,5 +182,4 @@ def __column_consortium_standard__(
"Scalar",
"SupportsColumnAPI",
"SupportsDataFrameAPI",
"Scalar",
]
48 changes: 35 additions & 13 deletions spec/design_topics/python_builtin_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ class DataFrame:
...

class Column:
def mean(self, skip_nulls: bool = True) -> float | NullType:
def mean(self, skip_nulls: bool = True) -> Scalar | NullType:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's weird that this can return either a Scalar which is a protocol or a NullType which doesn't guarantee the same interface. It makes it hard to guarantee the ability to nicely method chain here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, you're right - it should just be -> Scalar, correct? Because Scalar could be backed by a null value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if indeed we did go with #308 (comment), then Scalar could be a union of FloatScalar, IntScalar, NullScalar, ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we would want Scalar to be type erased similar to Column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify please? You could do isinstance(value, namespace.FloatScalar) to check the dtype

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, we're aligned here. We have different classes per type as opposed to just a top level Scalar class where type specific functionality is hidden underneath the class (similar to our Column class).

...

larger = df2 > df1.col('foo').mean()
Expand All @@ -27,15 +27,37 @@ larger = df2 > df1.col('foo').mean()
For a GPU dataframe library, it is desirable for all data to reside on the GPU,
and not incur a performance penalty from synchronizing instances of Python
builtin types to CPU. In the above example, the `.mean()` call returns a
`float`. It is likely beneficial though to implement this as a library-specific
scalar object which duck types with `float`. This means that it should (a) have
the same semantics as a builtin `float` when used within a library, and (b)
support usage as a `float` outside of the library (i.e., implement
`__float__`). Duck typing is usually not perfect, for example `isinstance`
usage on the float-like duck type will behave differently. Such explicit "type
of object" checks don't have to be supported.

The following design rule applies everywhere builtin Python types are used
within this API standard: _where a Python builtin type is specified, an
implementation may always replace it by an equivalent library-specific type
that duck types with the Python builtin type._
`Scalar`. It is likely beneficial though to implement this as a library-specific
scalar object which (partially) duck types with `float`. The required methods it
must implement are listed in the spec for class `Scalar`.

## Example

For example, if a library implements `FancyFloat` and `FancyBool` scalars,
then the following should all be supported:
```python
df: DataFrame
column_1: Column = df.col('a')
column_2: Column = df.col('b')

scalar: FancyFloat = column_1.std()
result_1: Column = column_2 - column_1.std()
result_2: FancyBool = column_2.std() > column_1.std()
```

Note that the scalars above are library-specific ones - they may be used to keep
data on GPU, or to keep data lazy.

The following, however, may raise, dependening on the
implementation:
```python
df: DataFrame
column = df.col('a')

if column.std() > 0: # this line may raise!
print('std is positive')
```
This is because `if column.std() > 0` will call `(column.std() > 0).__bool__()`,
which is required by Python to produce a Python scalar.
Therefore, a purely lazy dataframe library may choose to raise here, whereas as
one which allows for eager execution may return a Python bool.