-
Notifications
You must be signed in to change notification settings - Fork 21
Add design topic page on use of Python builtin types #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,3 +7,4 @@ Design topics & constraints | |
|
||
backwards_compatibility | ||
data_interchange | ||
python_builtin_types |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Python builtin types and duck typing | ||
|
||
Use of Python's builtin types - `bool`, `int`, `float`, `str`, `dict`, `list`, | ||
`tuple`, `datetime.datetime`, etc. - is often natural and convenient. However, | ||
it is also potentially problematic when trying to write performant dataframe | ||
library code or supporting devices other than CPU. | ||
|
||
This standard specifies the use of Python types in quite a few places, and uses | ||
them as type annotations. As a concrete example, consider the `mean` method and | ||
the `float` it is documented to return, in combination with the `__gt__` method | ||
(i.e., the `>` operator) on the dataframe: | ||
|
||
```python | ||
class DataFrame: | ||
def __gt__(self, other: DataFrame | Scalar) -> DataFrame: | ||
... | ||
def get_column_by_name(self, name: str, /) -> Column: | ||
... | ||
|
||
class Column: | ||
def mean(self, skip_nulls: bool = True) -> float: | ||
... | ||
|
||
larger = df2 > df1.get_column_by_name('foo').mean() | ||
``` | ||
|
||
For a GPU dataframe library, it is desirable for all data to reside on the GPU, | ||
and not incur a performance penalty from synchronizing instances of Python | ||
builtin types to CPU. In the above example, the `.mean()` call returns a | ||
`float`. It is likely beneficial though to implement this as a library-specific | ||
scalar object which duck types with `float`. This means that it should (a) have | ||
the same semantics as a builtin `float` when used within a library, and (b) | ||
support usage as a `float` outside of the library (i.e., implement | ||
`__float__`). Duck typing is usually not perfect, for example `isinstance` | ||
usage on the float-like duck type will behave differently. Such explicit "type | ||
of object" checks don't have to be supported. | ||
|
||
The following design rule applies everywhere builtin Python types are used | ||
within this API standard: _where a Python builtin type is specified, an | ||
implementation may always replace it by an equivalent library-specific type | ||
that duck types with the Python builtin type._ |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, cudf doesn't actually do this, right now, is that correct? (for example it does return numpy scalars for numeric types, i.e. what pandas does)
But cudf would like to do this? (I seem to remember discussions in the past about Scalar objects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it did. If not, maybe @kkraus14 or @shwina can suggest a better example for where cuDF uses scalars.
That said, numpy scalars are also an example of special objects that duck type, they're not Python builtin
float
,int
, etc.