-
Notifications
You must be signed in to change notification settings - Fork 21
Add design topic page on use of Python builtin types #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,3 +7,4 @@ Design topics & constraints | |
|
||
backwards_compatibility | ||
data_interchange | ||
python_builtin_types |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Python builtin types and duck typing | ||
|
||
Use of Python's builtin types - `bool`, `int`, `float`, `str`, `dict`, `list`, | ||
`tuple`, `datetime.datetime`, etc. - is often natural and convenient. However, | ||
it is also potentially problematic when trying to write performant dataframe | ||
library code or supporting devices other than CPU. | ||
|
||
This standard specifies the use of Python types in quite a few places, and uses | ||
them as type annotations. As a concrete example, consider the `mean` method and | ||
the `float` it is documented to return, in combination with the `__gt__` method | ||
(i.e., the `>` operator) on the dataframe: | ||
|
||
```python | ||
class DataFrame: | ||
def __gt__(self, other: DataFrame | Scalar) -> DataFrame: | ||
... | ||
def get_column_by_name(self, name: str, /) -> Column: | ||
... | ||
|
||
class Column: | ||
def mean(self, skipna: bool = True) -> float: | ||
... | ||
|
||
larger = df2 > df1.get_column_by_name('foo').mean() | ||
``` | ||
|
||
For a GPU dataframe library, it is desirable for all data to reside on the GPU, | ||
and not incur a performance penalty from synchronizing instances of Python | ||
builtin types to CPU. In the above example, the `.mean()` call returns a | ||
`float`. It is likely beneficial though to implement this as a library-specific | ||
scalar object which duck types with `float`. This means that it should (a) have | ||
Comment on lines
+29
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Out of curiosity, cudf doesn't actually do this, right now, is that correct? (for example it does return numpy scalars for numeric types, i.e. what pandas does) But cudf would like to do this? (I seem to remember discussions in the past about Scalar objects) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
the same semantics as a builtin `float` when used within a library, and (b) | ||
support usage as a `float` outside of the library (i.e., implement | ||
`__float__`). Duck typing is usually not perfect, for example `isinstance` | ||
usage on the float-like duck type will behave differently. Such explicit "type | ||
of object" checks don't have to be supported. | ||
|
||
The following design rule applies everywhere builtin Python types are used | ||
within this API standard: _where a Python builtin type is specified, an | ||
implementation may always replace it by an equivalent library-specific type | ||
that duck types with the Python builtin type._ |
Uh oh!
There was an error while loading. Please reload this page.