wip: add notes on execution model

MarcoGorelli · MarcoGorelli · commit 7a8dcf0d5680 · 2023-10-31T15:43:00.000Z
diff --git a/spec/API_specification/dataframe_api/column_object.py b/spec/API_specification/dataframe_api/column_object.py
@@ -802,6 +802,11 @@ def to_array(self) -> Any:
         may choose to return a numpy array (for numpy prior to 2.0), with the
         understanding that consuming libraries would then use the
         ``array-api-compat`` package to convert it to a Standard-compliant array.
+
+        Notes
+        -----
+        To be guaranteed to run across all implementations, :meth:`may_execute` should
+        be executed at some point before calling this method.
         """
         ...
 
diff --git a/spec/API_specification/dataframe_api/dataframe_object.py b/spec/API_specification/dataframe_api/dataframe_object.py
@@ -64,6 +64,11 @@ def dataframe(self) -> SupportsDataFrameAPI:
     def shape(self) -> tuple[int, int]:
         """
         Return number of rows and number of columns.
+
+        Notes
+        -----
+        To be guaranteed to run across all implementations, :meth:`may_execute` should
+        be executed at some point before calling this method.
         """
         ...
 
@@ -928,6 +933,9 @@ def to_array(self, dtype: DType) -> Any:
         may choose to return a numpy array (for numpy prior to 2.0), with the
         understanding that consuming libraries would then use the
         ``array-api-compat`` package to convert it to a Standard-compliant array.
+
+        To be guaranteed to run across all implementations, :meth:`may_execute` should
+        be executed at some point before calling this method.
         """
     
     def join(
@@ -972,3 +980,18 @@ def join(
             present in both `self` and `other`.
         """
         ...
+    
+    def may_execute(self) -> Self:
+        """
+        Hint that execution may be triggered, depending on the implementation.
+
+        This is intended as a hint, rather than as a directive. Implementations
+        which do not separate lazy vs eager execution may ignore this method and
+        treat it as a no-op. Likewise for implementations which support automated
+        execution.
+
+        .. note::
+            This method may force execution. If necessary, it should be called
+            at most once per dataframe, and as late as possible in the pipeline.
+        """
+        ...
diff --git a/spec/design_topics/execution_model.md b/spec/design_topics/execution_model.md
@@ -0,0 +1,97 @@
+# Execution model
+
+The vast majority of the Dataframe API is designed to be agnostic of the
+underlying execution model.
+
+However, there are some methods which, depending on the implementation, may
+not be supported in some cases.
+
+For example, let's consider the following:
+```python
+df: DataFrame
+features = []
+for column_name in df.column_names:
+    if df.col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns
+a (ducktyped) Python boolean scalar. No issues so far. Problem is,
+what happens when `if df.col(column_name).std() > 0` is called?
+
+Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in
+order to extract a Python boolean. This is a problem for "lazy" implementations,
+as the laziness needs breaking in order to evaluate the above.
+
+Dask and Polars both require that `.compute` (resp. `.collect`) be called beforehand
+for such an operation to be executed:
+  ```python
+  In [1]: import dask.dataframe as dd
+  
+  In [2]: pandas_df = pd.DataFrame({"x": [1, 2, 3], "y": 1})
+  
+  In [3]: df = dd.from_pandas(pandas_df, npartitions=2)
+  
+  In [4]: scalar = df.x.std() > 0
+  
+  In [5]: if scalar:
+     ...:     print('scalar is positive')
+     ...:
+  ---------------------------------------------------------------------------
+  TypeError                                 Traceback (most recent call last)
+  Cell In[5], line 1
+  ----> 1 if scalar:
+        2     print('scalar is positive')
+  
+  File ~/tmp/.venv/lib/python3.10/site-packages/dask/dataframe/core.py:312, in Scalar.__bool__(self)
+      311 def __bool__(self):
+  --> 312     raise TypeError(
+      313         f"Trying to convert {self} to a boolean value. Because Dask objects are "
+      314         "lazily evaluated, they cannot be converted to a boolean value or used "
+      315         "in boolean conditions like if statements. Try calling .compute() to "
+      316         "force computation prior to converting to a boolean value or using in "
+      317         "a conditional statement."
+      318     )
+  
+  TypeError: Trying to convert dd.Scalar<gt-bbc3..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.
+  ```
+
+Exactly which methods require computation may vary across implementations. Some may
+implicitly do it for users under-the-hood for certain methods, whereas others require
+the user to explicitly trigger it.
+
+Therefore, the Dataframe API has a `Dataframe.maybe_evaluate` method. This is to be
+interpreted as a hint, rather than as a directive - the implementation itself may decide
+whether to force execution at this step, or whether to defer it to later.
+
+Operations which require `DataFrame.may_execute` to have been called at some prior
+point are:
+- `DataFrame.to_array`
+- `DataFrame.shape`
+- `Column.to_array`
+- calling `bool`, `int`, or `float` on a scalar 
+
+Therefore, the Standard-compliant way to write the code above is:
+```python
+df: DataFrame
+df = df.may_execute()
+features = []
+for column_name in df.column_names:
+    if df.col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+
+Note now `DataFrame.may_execute` is called only once, and as late as possible.
+Conversely, the "wrong" way to execute the above would be:
+
+```python
+df: DataFrame
+features = []
+for column_name in df.column_names:
+    # Do NOT do this!
+    if df.may_execute().col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+as that will potentially re-trigger the same execution multiple times.
diff --git a/spec/design_topics/index.rst b/spec/design_topics/index.rst
@@ -8,3 +8,4 @@ Design topics & constraints
    backwards_compatibility
    data_interchange
    python_builtin_types
+   execution_model