Skip to content

Commit 7be00b6

Browse files
Add DataFrame.persist, and notes on execution model (#307)
* wip: add notes on execution model * reword * remove column mentions for now * remove to_array * use persist instead * remove note on propagation * update purpose and scope * reduce execution_model * Update spec/API_specification/dataframe_api/dataframe_object.py Co-authored-by: Ralf Gommers <[email protected]> * Update spec/purpose_and_scope.md --------- Co-authored-by: Ralf Gommers <[email protected]>
1 parent e310573 commit 7be00b6

File tree

4 files changed

+87
-1
lines changed

4 files changed

+87
-1
lines changed

spec/API_specification/dataframe_api/dataframe_object.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -929,3 +929,38 @@ def join(
929929
present in both `self` and `other`.
930930
"""
931931
...
932+
933+
def persist(self) -> Self:
934+
"""Hint that computation prior to this point should not be repeated.
935+
936+
This is intended as a hint, rather than as a directive. Implementations
937+
which do not separate lazy vs eager execution may ignore this method and
938+
treat it as a no-op.
939+
940+
.. note::
941+
This method may trigger execution. If necessary, it should be called
942+
at most once per dataframe, and as late as possible in the pipeline.
943+
944+
For example, do this
945+
946+
.. code-block:: python
947+
948+
df: DataFrame
949+
df = df.persist()
950+
features = []
951+
for column_name in df.column_names:
952+
if df.col(column_name).std() > 0:
953+
features.append(column_name)
954+
955+
instead of this:
956+
957+
.. code-block:: python
958+
959+
df: DataFrame
960+
features = []
961+
for column_name in df.column_names:
962+
# Do NOT do this!
963+
if df.persist().col(column_name).std() > 0:
964+
features.append(column_name)
965+
"""
966+
...

spec/design_topics/execution_model.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Execution model
2+
3+
## Scope
4+
5+
The vast majority of the Dataframe API is designed to be agnostic of the
6+
underlying execution model.
7+
8+
However, there are some methods which, depending on the implementation, may
9+
not be supported in some cases.
10+
11+
For example, let's consider the following:
12+
```python
13+
df: DataFrame
14+
features = []
15+
for column_name in df.column_names:
16+
if df.col(column_name).std() > 0:
17+
features.append(column_name)
18+
return features
19+
```
20+
If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns
21+
a (ducktyped) Python boolean scalar. No issues so far. Problem is,
22+
what happens when `if df.col(column_name).std() > 0` is called?
23+
24+
Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in
25+
order to extract a Python boolean. This is a problem for "lazy" implementations,
26+
as the laziness needs breaking in order to evaluate the above.
27+
28+
Dask and Polars both require that `.compute` (resp. `.collect`) be called beforehand
29+
for such an operation to be executed:
30+
```python
31+
In [1]: import dask.dataframe as dd
32+
33+
In [2]: pandas_df = pd.DataFrame({"x": [1, 2, 3], "y": 1})
34+
35+
In [3]: df = dd.from_pandas(pandas_df, npartitions=2)
36+
37+
In [4]: scalar = df.x.std() > 0
38+
39+
In [5]: if scalar:
40+
...: print('scalar is positive')
41+
...:
42+
---------------------------------------------------------------------------
43+
[...]
44+
45+
TypeError: Trying to convert dd.Scalar<gt-bbc3..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.
46+
```
47+
48+
Whether such computation succeeds or raises is currently not defined by the Standard and may vary across
49+
implementations.

spec/design_topics/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ Design topics & constraints
88
backwards_compatibility
99
data_interchange
1010
python_builtin_types
11+
execution_model

spec/purpose_and_scope.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,9 +125,10 @@ See the [use cases](use_cases.md) section for details on the exact use cases con
125125
Implementation details of the dataframes and execution of operations. This includes:
126126

127127
- How data is represented and stored (whether the data is in memory, disk, distributed)
128-
- Expectations on when the execution is happening (in an eager or lazy way)
128+
- Expectations on when the execution is happening (in an eager or lazy way) (see `execution model` for some caveats)
129129
- Other execution details
130130

131+
131132
**Rationale:** The API defined in this document needs to be used by libraries as diverse as Ibis,
132133
Dask, Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory.
133134
Any decision that involves assumptions on where the data is stored, or where execution happens

0 commit comments

Comments
 (0)