|
| 1 | +# Execution model |
| 2 | + |
| 3 | +The vast majority of the Dataframe API is designed to be agnostic of the |
| 4 | +underlying execution model. |
| 5 | + |
| 6 | +However, there are some methods which, depending on the implementation, may |
| 7 | +not be supported in some cases. |
| 8 | + |
| 9 | +For example, let's consider the following: |
| 10 | +```python |
| 11 | +df: DataFrame |
| 12 | +features = [] |
| 13 | +for column_name in df.column_names: |
| 14 | + if df.col(column_name).std() > 0: |
| 15 | + features.append(column_name) |
| 16 | +return features |
| 17 | +``` |
| 18 | +If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns |
| 19 | +a (ducktyped) Python boolean scalar. No issues so far. Problem is, |
| 20 | +what happens when `if df.col(column_name).std() > 0` is called? |
| 21 | + |
| 22 | +Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in |
| 23 | +order to extract a Python boolean. This is a problem for "lazy" implementations, |
| 24 | +as the laziness needs breaking in order to evaluate the above. |
| 25 | + |
| 26 | +Dask and Polars both require that `.compute` (resp. `.collect`) be called beforehand |
| 27 | +for such an operation to be executed: |
| 28 | + ```python |
| 29 | + In [1]: import dask.dataframe as dd |
| 30 | + |
| 31 | + In [2]: pandas_df = pd.DataFrame({"x": [1, 2, 3], "y": 1}) |
| 32 | + |
| 33 | + In [3]: df = dd.from_pandas(pandas_df, npartitions=2) |
| 34 | + |
| 35 | + In [4]: scalar = df.x.std() > 0 |
| 36 | + |
| 37 | + In [5]: if scalar: |
| 38 | + ...: print('scalar is positive') |
| 39 | + ...: |
| 40 | + --------------------------------------------------------------------------- |
| 41 | + TypeError Traceback (most recent call last) |
| 42 | + Cell In[5], line 1 |
| 43 | + ----> 1 if scalar: |
| 44 | + 2 print('scalar is positive') |
| 45 | + |
| 46 | + File ~/tmp/.venv/lib/python3.10/site-packages/dask/dataframe/core.py:312, in Scalar.__bool__(self) |
| 47 | + 311 def __bool__(self): |
| 48 | + --> 312 raise TypeError( |
| 49 | + 313 f"Trying to convert {self} to a boolean value. Because Dask objects are " |
| 50 | + 314 "lazily evaluated, they cannot be converted to a boolean value or used " |
| 51 | + 315 "in boolean conditions like if statements. Try calling .compute() to " |
| 52 | + 316 "force computation prior to converting to a boolean value or using in " |
| 53 | + 317 "a conditional statement." |
| 54 | + 318 ) |
| 55 | + |
| 56 | + TypeError: Trying to convert dd.Scalar<gt-bbc3..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement. |
| 57 | + ``` |
| 58 | + |
| 59 | +Exactly which methods require computation may vary across implementations. Some may |
| 60 | +implicitly do it for users under-the-hood for certain methods, whereas others require |
| 61 | +the user to explicitly trigger it. |
| 62 | + |
| 63 | +Therefore, the Dataframe API has a `Dataframe.maybe_evaluate` method. This is to be |
| 64 | +interpreted as a hint, rather than as a directive - the implementation itself may decide |
| 65 | +whether to force execution at this step, or whether to defer it to later. |
| 66 | + |
| 67 | +Operations which require `DataFrame.may_execute` to have been called at some prior |
| 68 | +point are: |
| 69 | +- `DataFrame.to_array` |
| 70 | +- `DataFrame.shape` |
| 71 | +- `Column.to_array` |
| 72 | +- calling `bool`, `int`, or `float` on a scalar |
| 73 | + |
| 74 | +Therefore, the Standard-compliant way to write the code above is: |
| 75 | +```python |
| 76 | +df: DataFrame |
| 77 | +df = df.may_execute() |
| 78 | +features = [] |
| 79 | +for column_name in df.column_names: |
| 80 | + if df.col(column_name).std() > 0: |
| 81 | + features.append(column_name) |
| 82 | +return features |
| 83 | +``` |
| 84 | + |
| 85 | +Note now `DataFrame.may_execute` is called only once, and as late as possible. |
| 86 | +Conversely, the "wrong" way to execute the above would be: |
| 87 | + |
| 88 | +```python |
| 89 | +df: DataFrame |
| 90 | +features = [] |
| 91 | +for column_name in df.column_names: |
| 92 | + # Do NOT do this! |
| 93 | + if df.may_execute().col(column_name).std() > 0: |
| 94 | + features.append(column_name) |
| 95 | +return features |
| 96 | +``` |
| 97 | +as that will potentially re-trigger the same execution multiple times. |
0 commit comments