You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It was discussed that the API should be agnostic of execution, including eager/lazy evaluation. I think this is easy when operations return data frames (or columns). For example:
df['value'] +1
If df['value'] is an in-memory representation, or a lazy expression, the result will likely be the same, and no assumptions need to be made.
But if instead, the result is a scalar:
df['value'].sum()
The output type defined in the API can make force certain executions and prevent others. For example, if the return type defined in the API is a Python int or float. See this example:
df['value'] +df['value'].sum()
While an implementation could want to keep the result of df['value'].sum() as its C representation for the next operation (the addition), making sum() return a Python object would force the conversion from C to Python and then back to C.
Another example could be Ibis or other SQL-backed implementations. Returning a Python object would cause them to execute a first query for df['value'].sum() and use the result in a second query. While in the example is likely that a single SQL query could be enough if the computation is delayed until the end.
For the array API it was discussed to use a 0-dimensional array to prevent a similar problem. Assuming we want to do the same for data frames (and not retun a Python object directly), I see two main options:
Using a 1x1 data frame. I think in the df['value'] example could make sense, not so sure in other cases like df.count_rows() (see Get number of rows and columns #20) where we could possibly be interested in applying
Creating a scalar type/class that wraps a scalar and can be used by implementations to decide how the data is represented, when it is converted to Python objects... For example, an toy implementation storing the data as a numpy object could look like:
+1 to the scalar class, it allows us to return scalar values from operations like sum() without forcing device synchronization and without forcing us to try to make a 1x1 dataframe look / feel like a scalar, which sounds very cumbersome.
xref #20 (comment)
It was discussed that the API should be agnostic of execution, including eager/lazy evaluation. I think this is easy when operations return data frames (or columns). For example:
If
df['value']
is an in-memory representation, or a lazy expression, the result will likely be the same, and no assumptions need to be made.But if instead, the result is a scalar:
The output type defined in the API can make force certain executions and prevent others. For example, if the return type defined in the API is a Python
int
orfloat
. See this example:While an implementation could want to keep the result of
df['value'].sum()
as its C representation for the next operation (the addition), makingsum()
return a Python object would force the conversion from C to Python and then back to C.Another example could be Ibis or other SQL-backed implementations. Returning a Python object would cause them to execute a first query for
df['value'].sum()
and use the result in a second query. While in the example is likely that a single SQL query could be enough if the computation is delayed until the end.For the array API it was discussed to use a 0-dimensional array to prevent a similar problem. Assuming we want to do the same for data frames (and not retun a Python object directly), I see two main options:
df['value']
example could make sense, not so sure in other cases likedf.count_rows()
(see Get number of rows and columns #20) where we could possibly be interested in applyingscalar
type/class that wraps a scalar and can be used by implementations to decide how the data is represented, when it is converted to Python objects... For example, an toy implementation storing the data as a numpy object could look like:CC: @markusweimer @kkraus14
The text was updated successfully, but these errors were encountered: