data-apis · MarcoGorelli · Nov 10, 2023 · Oct 31, 2023 · Oct 31, 2023 · Oct 31, 2023
diff --git a/spec/API_specification/dataframe_api/dataframe_object.py b/spec/API_specification/dataframe_api/dataframe_object.py
@@ -64,6 +64,11 @@ def dataframe(self) -> SupportsDataFrameAPI:
     def shape(self) -> tuple[int, int]:
         """
         Return number of rows and number of columns.
+
+        Notes
+        -----
+        To be guaranteed to run across all implementations, :meth:`maybe_execute` should
+        be executed at some point before calling this method.
         """
         ...
 
@@ -928,6 +933,9 @@ def to_array(self, dtype: DType) -> Any:
         may choose to return a numpy array (for numpy prior to 2.0), with the
         understanding that consuming libraries would then use the
         ``array-api-compat`` package to convert it to a Standard-compliant array.
+
+        To be guaranteed to run across all implementations, :meth:`maybe_execute` should
+        be executed at some point before calling this method.
         """
 
     def join(
@@ -972,3 +980,18 @@ def join(
             present in both `self` and `other`.
         """
         ...
+
+    def maybe_execute(self) -> Self:
+        """
+        Hint that execution may be triggered, depending on the implementation.
+
+        This is intended as a hint, rather than as a directive. Implementations
+        which do not separate lazy vs eager execution may ignore this method and
+        treat it as a no-op. Likewise for implementations which support automated
+        execution.
+
+        .. note::
+            This method may force execution. If necessary, it should be called
+            at most once per dataframe, and as late as possible in the pipeline.
+        """
+        ...
diff --git a/spec/design_topics/execution_model.md b/spec/design_topics/execution_model.md
@@ -0,0 +1,83 @@
+# Execution model
+
+The vast majority of the Dataframe API is designed to be agnostic of the
+underlying execution model.
+
+However, there are some methods which, depending on the implementation, may
+not be supported in some cases.
+
+For example, let's consider the following:
+```python
+df: DataFrame
+features = []
+for column_name in df.column_names:
+    if df.col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns
+a (ducktyped) Python boolean scalar. No issues so far. Problem is,
+what happens when `if df.col(column_name).std() > 0` is called?
+
+Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in
+order to extract a Python boolean. This is a problem for "lazy" implementations,
+as the laziness needs breaking in order to evaluate the above.
+
+Dask and Polars both require that `.compute` (resp. `.collect`) be called beforehand
+for such an operation to be executed:
+  ```python
+  In [1]: import dask.dataframe as dd
+
+  In [2]: pandas_df = pd.DataFrame({"x": [1, 2, 3], "y": 1})
+
+  In [3]: df = dd.from_pandas(pandas_df, npartitions=2)
+
+  In [4]: scalar = df.x.std() > 0
+
+  In [5]: if scalar:
+     ...:     print('scalar is positive')
+     ...:
+  ---------------------------------------------------------------------------
+  [...]
+
+  TypeError: Trying to convert dd.Scalar<gt-bbc3..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.
+  ```
+
+The Dataframe API has a `DataFrame.maybe_evaluate` for addressing the above. We can use it to rewrite the code above
+as follows:
+```python
+df: DataFrame
+df = df.maybe_execute()
+features = []
+for column_name in df.column_names:
+    if df.col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+
+Note that `maybe_evaluate` is to be interpreted as a hint, rather than as a directive -
+the implementation itself may decide
+whether to force execution at this step, or whether to defer it to later.
+For example, a dataframe which can convert to a lazy array could decide to ignore
+`maybe_evaluate` when evaluting `DataFrame.to_array` but to respect it when evaluating
+`float(Column.std())`.
+
+Operations which require `DataFrame.maybe_execute` to have been called at some prior
+point are:
+- `DataFrame.to_array`
+- `DataFrame.shape`
+- calling `bool`, `int`, or `float` on a scalar 
+
+Note now `DataFrame.maybe_execute` is called only once, and as late as possible.
+Conversely, the "wrong" way to execute the above would be:
+
+```python
+df: DataFrame
+features = []
+for column_name in df.column_names:
+    # Do NOT do this!
+    if df.maybe_execute().col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+as that will potentially re-trigger the same execution multiple times.
diff --git a/spec/design_topics/index.rst b/spec/design_topics/index.rst
@@ -8,3 +8,4 @@ Design topics & constraints
    backwards_compatibility
    data_interchange
    python_builtin_types
+   execution_model