Skip to content

Commit 7a8dcf0

Browse files
committed
wip: add notes on execution model
1 parent c4ab5b4 commit 7a8dcf0

File tree

4 files changed

+126
-0
lines changed

4 files changed

+126
-0
lines changed

spec/API_specification/dataframe_api/column_object.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -802,6 +802,11 @@ def to_array(self) -> Any:
802802
may choose to return a numpy array (for numpy prior to 2.0), with the
803803
understanding that consuming libraries would then use the
804804
``array-api-compat`` package to convert it to a Standard-compliant array.
805+
806+
Notes
807+
-----
808+
To be guaranteed to run across all implementations, :meth:`may_execute` should
809+
be executed at some point before calling this method.
805810
"""
806811
...
807812

spec/API_specification/dataframe_api/dataframe_object.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,11 @@ def dataframe(self) -> SupportsDataFrameAPI:
6464
def shape(self) -> tuple[int, int]:
6565
"""
6666
Return number of rows and number of columns.
67+
68+
Notes
69+
-----
70+
To be guaranteed to run across all implementations, :meth:`may_execute` should
71+
be executed at some point before calling this method.
6772
"""
6873
...
6974

@@ -928,6 +933,9 @@ def to_array(self, dtype: DType) -> Any:
928933
may choose to return a numpy array (for numpy prior to 2.0), with the
929934
understanding that consuming libraries would then use the
930935
``array-api-compat`` package to convert it to a Standard-compliant array.
936+
937+
To be guaranteed to run across all implementations, :meth:`may_execute` should
938+
be executed at some point before calling this method.
931939
"""
932940

933941
def join(
@@ -972,3 +980,18 @@ def join(
972980
present in both `self` and `other`.
973981
"""
974982
...
983+
984+
def may_execute(self) -> Self:
985+
"""
986+
Hint that execution may be triggered, depending on the implementation.
987+
988+
This is intended as a hint, rather than as a directive. Implementations
989+
which do not separate lazy vs eager execution may ignore this method and
990+
treat it as a no-op. Likewise for implementations which support automated
991+
execution.
992+
993+
.. note::
994+
This method may force execution. If necessary, it should be called
995+
at most once per dataframe, and as late as possible in the pipeline.
996+
"""
997+
...

spec/design_topics/execution_model.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Execution model
2+
3+
The vast majority of the Dataframe API is designed to be agnostic of the
4+
underlying execution model.
5+
6+
However, there are some methods which, depending on the implementation, may
7+
not be supported in some cases.
8+
9+
For example, let's consider the following:
10+
```python
11+
df: DataFrame
12+
features = []
13+
for column_name in df.column_names:
14+
if df.col(column_name).std() > 0:
15+
features.append(column_name)
16+
return features
17+
```
18+
If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns
19+
a (ducktyped) Python boolean scalar. No issues so far. Problem is,
20+
what happens when `if df.col(column_name).std() > 0` is called?
21+
22+
Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in
23+
order to extract a Python boolean. This is a problem for "lazy" implementations,
24+
as the laziness needs breaking in order to evaluate the above.
25+
26+
Dask and Polars both require that `.compute` (resp. `.collect`) be called beforehand
27+
for such an operation to be executed:
28+
```python
29+
In [1]: import dask.dataframe as dd
30+
31+
In [2]: pandas_df = pd.DataFrame({"x": [1, 2, 3], "y": 1})
32+
33+
In [3]: df = dd.from_pandas(pandas_df, npartitions=2)
34+
35+
In [4]: scalar = df.x.std() > 0
36+
37+
In [5]: if scalar:
38+
...: print('scalar is positive')
39+
...:
40+
---------------------------------------------------------------------------
41+
TypeError Traceback (most recent call last)
42+
Cell In[5], line 1
43+
----> 1 if scalar:
44+
2 print('scalar is positive')
45+
46+
File ~/tmp/.venv/lib/python3.10/site-packages/dask/dataframe/core.py:312, in Scalar.__bool__(self)
47+
311 def __bool__(self):
48+
--> 312 raise TypeError(
49+
313 f"Trying to convert {self} to a boolean value. Because Dask objects are "
50+
314 "lazily evaluated, they cannot be converted to a boolean value or used "
51+
315 "in boolean conditions like if statements. Try calling .compute() to "
52+
316 "force computation prior to converting to a boolean value or using in "
53+
317 "a conditional statement."
54+
318 )
55+
56+
TypeError: Trying to convert dd.Scalar<gt-bbc3..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.
57+
```
58+
59+
Exactly which methods require computation may vary across implementations. Some may
60+
implicitly do it for users under-the-hood for certain methods, whereas others require
61+
the user to explicitly trigger it.
62+
63+
Therefore, the Dataframe API has a `Dataframe.maybe_evaluate` method. This is to be
64+
interpreted as a hint, rather than as a directive - the implementation itself may decide
65+
whether to force execution at this step, or whether to defer it to later.
66+
67+
Operations which require `DataFrame.may_execute` to have been called at some prior
68+
point are:
69+
- `DataFrame.to_array`
70+
- `DataFrame.shape`
71+
- `Column.to_array`
72+
- calling `bool`, `int`, or `float` on a scalar
73+
74+
Therefore, the Standard-compliant way to write the code above is:
75+
```python
76+
df: DataFrame
77+
df = df.may_execute()
78+
features = []
79+
for column_name in df.column_names:
80+
if df.col(column_name).std() > 0:
81+
features.append(column_name)
82+
return features
83+
```
84+
85+
Note now `DataFrame.may_execute` is called only once, and as late as possible.
86+
Conversely, the "wrong" way to execute the above would be:
87+
88+
```python
89+
df: DataFrame
90+
features = []
91+
for column_name in df.column_names:
92+
# Do NOT do this!
93+
if df.may_execute().col(column_name).std() > 0:
94+
features.append(column_name)
95+
return features
96+
```
97+
as that will potentially re-trigger the same execution multiple times.

spec/design_topics/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ Design topics & constraints
88
backwards_compatibility
99
data_interchange
1010
python_builtin_types
11+
execution_model

0 commit comments

Comments
 (0)