Skip to content

Scalar representation #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Aug 26, 2020 · 2 comments
Closed

Scalar representation #28

datapythonista opened this issue Aug 26, 2020 · 2 comments

Comments

@datapythonista
Copy link
Member

xref #20 (comment)

It was discussed that the API should be agnostic of execution, including eager/lazy evaluation. I think this is easy when operations return data frames (or columns). For example:

df['value'] + 1

If df['value'] is an in-memory representation, or a lazy expression, the result will likely be the same, and no assumptions need to be made.

But if instead, the result is a scalar:

df['value'].sum()

The output type defined in the API can make force certain executions and prevent others. For example, if the return type defined in the API is a Python int or float. See this example:

df['value'] + df['value'].sum()

While an implementation could want to keep the result of df['value'].sum() as its C representation for the next operation (the addition), making sum() return a Python object would force the conversion from C to Python and then back to C.

Another example could be Ibis or other SQL-backed implementations. Returning a Python object would cause them to execute a first query for df['value'].sum() and use the result in a second query. While in the example is likely that a single SQL query could be enough if the computation is delayed until the end.

For the array API it was discussed to use a 0-dimensional array to prevent a similar problem. Assuming we want to do the same for data frames (and not retun a Python object directly), I see two main options:

  • Using a 1x1 data frame. I think in the df['value'] example could make sense, not so sure in other cases like df.count_rows() (see Get number of rows and columns #20) where we could possibly be interested in applying
  • Creating a scalar type/class that wraps a scalar and can be used by implementations to decide how the data is represented, when it is converted to Python objects... For example, an toy implementation storing the data as a numpy object could look like:
>>> import numpy

>>> class scalar:
...     def __init__(self, value, dtype):
...         self.value = numpy.array(value, dtype=dtype)
...
...     def __repr__(self):
...         return str(self.value)
...
...     def __add__(self, other):
...         return self.value + other

>>> result = scalar(12, dtype='int64')
>>> result
12
>>> result + 3
15

CC: @markusweimer @kkraus14

@kkraus14
Copy link
Collaborator

+1 to the scalar class, it allows us to return scalar values from operations like sum() without forcing device synchronization and without forcing us to try to make a 1x1 dataframe look / feel like a scalar, which sounds very cumbersome.

@MarcoGorelli
Copy link
Contributor

we now have a scalar class - closing then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants