wip: add notes on execution model

MarcoGorelli · MarcoGorelli · commit e287002c1100 · 2023-10-31T11:36:11.000Z
diff --git a/spec/API_specification/dataframe_api/column_object.py b/spec/API_specification/dataframe_api/column_object.py
@@ -802,6 +802,11 @@ def to_array(self) -> Any:
         may choose to return a numpy array (for numpy prior to 2.0), with the
         understanding that consuming libraries would then use the
         ``array-api-compat`` package to convert it to a Standard-compliant array.
+
+        Notes
+        -----
+        To be guaranteed to run across all implementations, :meth:`may_execute` should
+        be executed at some point before calling this method.
         """
         ...
 
diff --git a/spec/API_specification/dataframe_api/dataframe_object.py b/spec/API_specification/dataframe_api/dataframe_object.py
@@ -64,6 +64,11 @@ def dataframe(self) -> SupportsDataFrameAPI:
     def shape(self) -> tuple[int, int]:
         """
         Return number of rows and number of columns.
+
+        Notes
+        -----
+        To be guaranteed to run across all implementations, :meth:`may_execute` should
+        be executed at some point before calling this method.
         """
         ...
 
@@ -928,6 +933,9 @@ def to_array(self, dtype: DType) -> Any:
         may choose to return a numpy array (for numpy prior to 2.0), with the
         understanding that consuming libraries would then use the
         ``array-api-compat`` package to convert it to a Standard-compliant array.
+
+        To be guaranteed to run across all implementations, :meth:`may_execute` should
+        be executed at some point before calling this method.
         """
     
     def join(
@@ -972,3 +980,18 @@ def join(
             present in both `self` and `other`.
         """
         ...
+    
+    def may_execute(self) -> Self:
+        """
+        Hint that execution may be triggered, depending on the implementation.
+
+        This is intended as a hint, rather than as a directive. Implementations
+        which do not separate lazy vs eager execution may ignore this method and
+        treat it as a no-op. Likewise for implementations which support automated
+        execution.
+
+        .. note::
+            This method may force execution. If necessary, it should be called
+            at most once per dataframe, and as late as possible in the pipeline.
+        """
+        ...
diff --git a/spec/design_topics/execution_model.md b/spec/design_topics/execution_model.md
@@ -0,0 +1,61 @@
+# Execution model
+
+The vast majority of the Dataframe API is designed to be agnostic of the
+underlying execution model.
+
+However, there are some methods which, depending on the implementation, may
+be problematic depending on the execution backend.
+
+For example, let's consider the following:
+```python
+df: DataFrame
+features = []
+for column_name in df.column_names:
+    if df.col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+The call `df.col(column_name).std()` returns a (ducktyped) Python scalar, which
+may stay lazy. Problem is, what is `df.col(column_name).std() > 0` meant to
+evaluate to?
+
+The way to trigger execution varies across dataframe implementations. Therefore,
+the Dataframe API has a DataFrame method `may_execute`, which serves as a hint
+that triggering execution will be required at some later point.
+
+Operations which require `DataFrame.may_execute` to have been called at some prior
+point are:
+- `DataFrame.to_array`
+- `Column.to_array`
+- calling `bool`, `int`, `float` on a scalar 
+
+Returning to the example above, the line
+```python
+if df.col(column_name).std() > 0:
+```
+will implicitly call `__bool__` on the return value of `df.col(column_name).std()`.
+Therefore, the way to guarantee that such a call will run without errors across
+all standard-compliant implementations is:
+```python
+df: DataFrame
+df = df.may_execute()
+features = []
+for column_name in df.column_names:
+    if df.col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+
+Note now `DataFrame.may_execute` is called only once, and as late as possible.
+Conversely, the "wrong" way to execute the above would be:
+
+```python
+df: DataFrame
+features = []
+for column_name in df.column_names:
+    # Do NOT do this!
+    if df.may_execute().col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+as that will potentially re-trigger the same execution multiple times.
diff --git a/spec/design_topics/index.rst b/spec/design_topics/index.rst
@@ -8,3 +8,4 @@ Design topics & constraints
    backwards_compatibility
    data_interchange
    python_builtin_types
+   execution_model