Skip to content

Commit e287002

Browse files
committed
wip: add notes on execution model
1 parent c4ab5b4 commit e287002

File tree

4 files changed

+90
-0
lines changed

4 files changed

+90
-0
lines changed

spec/API_specification/dataframe_api/column_object.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -802,6 +802,11 @@ def to_array(self) -> Any:
802802
may choose to return a numpy array (for numpy prior to 2.0), with the
803803
understanding that consuming libraries would then use the
804804
``array-api-compat`` package to convert it to a Standard-compliant array.
805+
806+
Notes
807+
-----
808+
To be guaranteed to run across all implementations, :meth:`may_execute` should
809+
be executed at some point before calling this method.
805810
"""
806811
...
807812

spec/API_specification/dataframe_api/dataframe_object.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,11 @@ def dataframe(self) -> SupportsDataFrameAPI:
6464
def shape(self) -> tuple[int, int]:
6565
"""
6666
Return number of rows and number of columns.
67+
68+
Notes
69+
-----
70+
To be guaranteed to run across all implementations, :meth:`may_execute` should
71+
be executed at some point before calling this method.
6772
"""
6873
...
6974

@@ -928,6 +933,9 @@ def to_array(self, dtype: DType) -> Any:
928933
may choose to return a numpy array (for numpy prior to 2.0), with the
929934
understanding that consuming libraries would then use the
930935
``array-api-compat`` package to convert it to a Standard-compliant array.
936+
937+
To be guaranteed to run across all implementations, :meth:`may_execute` should
938+
be executed at some point before calling this method.
931939
"""
932940

933941
def join(
@@ -972,3 +980,18 @@ def join(
972980
present in both `self` and `other`.
973981
"""
974982
...
983+
984+
def may_execute(self) -> Self:
985+
"""
986+
Hint that execution may be triggered, depending on the implementation.
987+
988+
This is intended as a hint, rather than as a directive. Implementations
989+
which do not separate lazy vs eager execution may ignore this method and
990+
treat it as a no-op. Likewise for implementations which support automated
991+
execution.
992+
993+
.. note::
994+
This method may force execution. If necessary, it should be called
995+
at most once per dataframe, and as late as possible in the pipeline.
996+
"""
997+
...

spec/design_topics/execution_model.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Execution model
2+
3+
The vast majority of the Dataframe API is designed to be agnostic of the
4+
underlying execution model.
5+
6+
However, there are some methods which, depending on the implementation, may
7+
be problematic depending on the execution backend.
8+
9+
For example, let's consider the following:
10+
```python
11+
df: DataFrame
12+
features = []
13+
for column_name in df.column_names:
14+
if df.col(column_name).std() > 0:
15+
features.append(column_name)
16+
return features
17+
```
18+
The call `df.col(column_name).std()` returns a (ducktyped) Python scalar, which
19+
may stay lazy. Problem is, what is `df.col(column_name).std() > 0` meant to
20+
evaluate to?
21+
22+
The way to trigger execution varies across dataframe implementations. Therefore,
23+
the Dataframe API has a DataFrame method `may_execute`, which serves as a hint
24+
that triggering execution will be required at some later point.
25+
26+
Operations which require `DataFrame.may_execute` to have been called at some prior
27+
point are:
28+
- `DataFrame.to_array`
29+
- `Column.to_array`
30+
- calling `bool`, `int`, `float` on a scalar
31+
32+
Returning to the example above, the line
33+
```python
34+
if df.col(column_name).std() > 0:
35+
```
36+
will implicitly call `__bool__` on the return value of `df.col(column_name).std()`.
37+
Therefore, the way to guarantee that such a call will run without errors across
38+
all standard-compliant implementations is:
39+
```python
40+
df: DataFrame
41+
df = df.may_execute()
42+
features = []
43+
for column_name in df.column_names:
44+
if df.col(column_name).std() > 0:
45+
features.append(column_name)
46+
return features
47+
```
48+
49+
Note now `DataFrame.may_execute` is called only once, and as late as possible.
50+
Conversely, the "wrong" way to execute the above would be:
51+
52+
```python
53+
df: DataFrame
54+
features = []
55+
for column_name in df.column_names:
56+
# Do NOT do this!
57+
if df.may_execute().col(column_name).std() > 0:
58+
features.append(column_name)
59+
return features
60+
```
61+
as that will potentially re-trigger the same execution multiple times.

spec/design_topics/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ Design topics & constraints
88
backwards_compatibility
99
data_interchange
1010
python_builtin_types
11+
execution_model

0 commit comments

Comments
 (0)