Skip to content

Add design topic page on use of Python builtin types #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 27, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions spec/design_topics/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Design topics & constraints

backwards_compatibility
data_interchange
python_builtin_types
41 changes: 41 additions & 0 deletions spec/design_topics/python_builtin_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Python builtin types and duck typing

Use of Python's builtin types - `bool`, `int`, `float`, `str`, `dict`, `list`,
`tuple`, `datetime.datetime`, etc. - is often natural and convenient. However,
it is also potentially problematic when trying to write performant dataframe
library code or supporting devices other than CPU.

This standard specifies the use of Python types in quite a few places, and uses
them as type annotations. As a concrete example, consider the `mean` method and
the `float` it is documented to return, in combination with the `__gt__` method
(i.e., the `>` operator) on the dataframe:

```python
class DataFrame:
def __gt__(self, other: DataFrame | Scalar) -> DataFrame:
...
def get_column_by_name(self, name: str, /) -> Column:
...

class Column:
def mean(self, skipna: bool = True) -> float:
...

larger = df2 > df1.get_column_by_name('foo').mean()
```

For a GPU dataframe library, it is desirable for all data to reside on the GPU,
and not incur a performance penalty from synchronizing instances of Python
builtin types to CPU. In the above example, the `.mean()` call returns a
`float`. It is likely beneficial though to implement this as a library-specific
scalar object which duck types with `float`. This means that it should (a) have
Comment on lines +29 to +31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, cudf doesn't actually do this, right now, is that correct? (for example it does return numpy scalars for numeric types, i.e. what pandas does)

But cudf would like to do this? (I seem to remember discussions in the past about Scalar objects)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it did. If not, maybe @kkraus14 or @shwina can suggest a better example for where cuDF uses scalars.

That said, numpy scalars are also an example of special objects that duck type, they're not Python builtin float, int, etc.

the same semantics as a builtin `float` when used within a library, and (b)
support usage as a `float` outside of the library (i.e., implement
`__float__`). Duck typing is usually not perfect, for example `isinstance`
usage on the float-like duck type will behave differently. Such explicit "type
of object" checks don't have to be supported.

The following design rule applies everywhere builtin Python types are used
within this API standard: _where a Python builtin type is specified, an
implementation may always replace it by an equivalent library-specific type
that duck types with the Python builtin type._