Scalar representation #28

datapythonista · 2020-08-26T14:11:31Z

It was discussed that the API should be agnostic of execution, including eager/lazy evaluation. I think this is easy when operations return data frames (or columns). For example:

df['value'] + 1

If df['value'] is an in-memory representation, or a lazy expression, the result will likely be the same, and no assumptions need to be made.

But if instead, the result is a scalar:

df['value'].sum()

The output type defined in the API can make force certain executions and prevent others. For example, if the return type defined in the API is a Python int or float. See this example:

df['value'] + df['value'].sum()

While an implementation could want to keep the result of df['value'].sum() as its C representation for the next operation (the addition), making sum() return a Python object would force the conversion from C to Python and then back to C.

Another example could be Ibis or other SQL-backed implementations. Returning a Python object would cause them to execute a first query for df['value'].sum() and use the result in a second query. While in the example is likely that a single SQL query could be enough if the computation is delayed until the end.

For the array API it was discussed to use a 0-dimensional array to prevent a similar problem. Assuming we want to do the same for data frames (and not retun a Python object directly), I see two main options:

Using a 1x1 data frame. I think in the df['value'] example could make sense, not so sure in other cases like df.count_rows() (see Get number of rows and columns #20) where we could possibly be interested in applying
Creating a scalar type/class that wraps a scalar and can be used by implementations to decide how the data is represented, when it is converted to Python objects... For example, an toy implementation storing the data as a numpy object could look like:

>>> import numpy

>>> class scalar:
...     def __init__(self, value, dtype):
...         self.value = numpy.array(value, dtype=dtype)
...
...     def __repr__(self):
...         return str(self.value)
...
...     def __add__(self, other):
...         return self.value + other

>>> result = scalar(12, dtype='int64')
>>> result
12
>>> result + 3
15

CC: @markusweimer @kkraus14

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-08-26T14:35:31Z

+1 to the scalar class, it allows us to return scalar values from operations like sum() without forcing device synchronization and without forcing us to try to make a 1x1 dataframe look / feel like a scalar, which sounds very cumbersome.

MarcoGorelli · 2023-12-15T11:06:35Z

we now have a scalar class - closing then

MarcoGorelli closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalar representation #28

Scalar representation #28

datapythonista commented Aug 26, 2020

kkraus14 commented Aug 26, 2020

MarcoGorelli commented Dec 15, 2023

Scalar representation #28

Scalar representation #28

Comments

datapythonista commented Aug 26, 2020

kkraus14 commented Aug 26, 2020

MarcoGorelli commented Dec 15, 2023