-
Notifications
You must be signed in to change notification settings - Fork 21
Add DataFrame.persist, and notes on execution model #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 11 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
7a8dcf0
wip: add notes on execution model
MarcoGorelli 0f4188b
reword
MarcoGorelli 1dd4678
remove column mentions for now
MarcoGorelli 7c72dd2
Merge remote-tracking branch 'upstream/main' into may-execute
MarcoGorelli e4f47c7
remove to_array
MarcoGorelli b6b648b
use persist instead
MarcoGorelli 6d5a599
remove note on propagation
MarcoGorelli 6cef569
update purpose and scope
MarcoGorelli 3704a4b
Merge remote-tracking branch 'upstream/main' into may-execute
MarcoGorelli 4bf81c2
reduce execution_model
MarcoGorelli 305a44b
Update spec/API_specification/dataframe_api/dataframe_object.py
MarcoGorelli e0b7458
Update spec/purpose_and_scope.md
MarcoGorelli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Execution model | ||
|
||
## Scope | ||
|
||
The vast majority of the Dataframe API is designed to be agnostic of the | ||
underlying execution model. | ||
|
||
However, there are some methods which, depending on the implementation, may | ||
not be supported in some cases. | ||
|
||
For example, let's consider the following: | ||
```python | ||
df: DataFrame | ||
features = [] | ||
for column_name in df.column_names: | ||
if df.col(column_name).std() > 0: | ||
features.append(column_name) | ||
return features | ||
``` | ||
If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns | ||
a (ducktyped) Python boolean scalar. No issues so far. Problem is, | ||
what happens when `if df.col(column_name).std() > 0` is called? | ||
|
||
Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in | ||
order to extract a Python boolean. This is a problem for "lazy" implementations, | ||
as the laziness needs breaking in order to evaluate the above. | ||
|
||
Dask and Polars both require that `.compute` (resp. `.collect`) be called beforehand | ||
for such an operation to be executed: | ||
```python | ||
In [1]: import dask.dataframe as dd | ||
|
||
In [2]: pandas_df = pd.DataFrame({"x": [1, 2, 3], "y": 1}) | ||
|
||
In [3]: df = dd.from_pandas(pandas_df, npartitions=2) | ||
|
||
In [4]: scalar = df.x.std() > 0 | ||
|
||
In [5]: if scalar: | ||
...: print('scalar is positive') | ||
...: | ||
--------------------------------------------------------------------------- | ||
[...] | ||
|
||
TypeError: Trying to convert dd.Scalar<gt-bbc3..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement. | ||
``` | ||
|
||
Whether such computation succeeds or raises is currently not defined by the Standard and may vary across | ||
implementations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,3 +8,4 @@ Design topics & constraints | |
backwards_compatibility | ||
data_interchange | ||
python_builtin_types | ||
execution_model |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: not entirely accurate, since it's only a hint so there is still no "when" prescribed.
How about saying instead: "(see
Exection model
for some caveats)" in order to keep things in one place?