Skip to content

Add a to_array_obj method to Column #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions spec/API_specification/dataframe_api/column_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,54 @@ class Column:
constructor functions or an already-created dataframe object retrieved via

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR bu the sentence seems to be missing something?


"""
def to_array_obj(self) -> object:
"""
Obtain an object that can be used as input to ``asarray`` or ``from_dlpack``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very different where I don't think using or here is really accurate. Based on the rest of the code this seems to be geared towards asarray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly, yes. The main pain point there is that the array API standard doesn't mandate that asarray has to support DLPack. There was a desire to keep that orthogonal (long discussion, and I believe you were involved in that). So now if we have a column with a dtype which can support DLPack, that will work with numpy.asarray but not necessarily with other libraries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can just drop from_dlpack from the description and leave everything else unchanged.


The returned object must only be used for one thing: calling the ``asarray``
or ``from_dlpack`` functions of a library implementing the array API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for only asarray here?

standard, or equivalent ``asarray``/``from_dlpack`` functions to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

one in that standard. In practice that means that the returned object
may either already *be* an array type of a specific array library, or
it may implement one or more of the following methods or behaviors:

- ``__dlpack__``
- the Python buffer protocol
- ``__array__``
- ``__array_interface__``
- ``__cuda_array_interface__``
- the Python ``Sequence`` interface
- any other method that is known to work with an ``asarray`` function
in a library of interest
Comment on lines +37 to +38
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal of this function to work generally with multiple libraries asarray implementations or just a specific library's implementation? If the former I would push for us to remove this point since it's asking for library specific things to be introduced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal of this function to work generally with multiple libraries asarray implementations or just a specific library's implementation?

I'd say neither is 100% accurate. It's more the former, however it does make sense to add library-specific methods - and in fact, __array__ is one already. There is no single interchange method that is supported by all libraries' asarray function anyway, nor is that a good goal because it would eliminate support for dtypes like string and datetime which numpy supports and some other array library could support, but most libraries never will.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my fear is that say some dataframe library fancy_dataframe could expect to work downstream with fancy_array and return something that supports __fancy_array__ which fancy_array.asarray handles, but a developer calls np.asarray against the output of this and it understandably doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that fear is justified, because:

  • this method already allows returning an instance of fancy_array directly anyway,
  • the returned object should have other, more standard methods too if possible, so np.asarray can still work
  • if there's something really specific in the column such that it's not possible to add anything but __fancy_array__ anway, then nothing is lost by doing so


Importantly, the returned object must only implement/expose those
methods that work. An ``asarray`` function may try to use any of the
above methods, and the order in which it does so is not guaranteed.
Hence it is expected that if a method is present, it works correctly
and does not raise an exception (e.g., because of an unsupported dtype
or device).

.. admonition:: Tip

One way to expose methods only if they will work for the dtype,
device, and other characteristics of the column is to hide access to
the methods dynamically, so ``hasattr`` does the right thing. For
example::

def __dir__(self):
methods = dir(self.__class__)
attrs = list(self.__dict__.keys())
keys = methods + attrs
if not self.dtype in _dlpack_supported_dtypes:
keys.remove("__dlpack__")

return keys

def __dlpack__(self):
...

"""

@classmethod
def from_sequence(cls, sequence: Sequence[object], dtype: dtype) -> Column:
"""
Expand Down