Skip to content

Add a to_array_obj method to Column #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion spec/API_specification/dataframe_api/column_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ class Column:
constructor functions or an already-created dataframe object retrieved via

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR bu the sentence seems to be missing something?


"""
def to_array_obj(self) -> object:
def to_array_obj(self, *, null_handling: str | None = None) -> object:
"""
Obtain an object that can be used as input to ``asarray`` or ``from_dlpack``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very different where I don't think using or here is really accurate. Based on the rest of the code this seems to be geared towards asarray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly, yes. The main pain point there is that the array API standard doesn't mandate that asarray has to support DLPack. There was a desire to keep that orthogonal (long discussion, and I believe you were involved in that). So now if we have a column with a dtype which can support DLPack, that will work with numpy.asarray but not necessarily with other libraries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can just drop from_dlpack from the description and leave everything else unchanged.


Expand Down Expand Up @@ -63,6 +63,38 @@ def __dir__(self):
def __dlpack__(self):
...

Parameters
----------
null_handling : str or None
Determine how to treat ``null`` values that may be present in the
column. Valid options are:

- ``None`` (default): no special handling. This assumes that either
no missing values are present, or there is an array type with
native support for missing values that is/can be converted to.
*Note: there is currently no such library that is in wide use;
NumPy's masked arrays are non-recommended, and other array
libraries do not support missing values at all.*
- ``raise``: always raise a ``ValueError`` if nulls are present.
- ``to-nan``: for floating-point dtypes, convert any nulls to ``nan``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be generalized to apply to any numeric type, instead of only floating points (i.e. allowing to convert ints with nulls to float with nans)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I'm not sure how desirable that is, it may be pragmatic in the absence of a good alternative. Let's see what others think about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option could also be removed completely, assuming we add fill_null(scalar_value), in favor of a separate col.fill_null(float('nan')).to_array_obj().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntax-wise, keeping 'to-nan' is probably nicer though.

For other dtypes, do the same as ``None``.

Note that if it is desired to convert nulls to a dtype-specific
sentinel value, the user should do this before calling
``is_array_obj`` with `.isnull()` and replacing the values
directly.

Raises
------
TypeError
In case it is not possible to convert the column to any (known) array
library type, or use any of the possible interchange methods.
This can be due to the dtype (e.g., no array library supports datetime
dtypes with a time zone), device, or other reasons.
ValueError
If the column contains ``null`` values which prevent returning an
array object.

"""

@classmethod
Expand Down