-
Notifications
You must be signed in to change notification settings - Fork 21
Add a to_array_obj
method to Column
#163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,95 @@ class Column: | |
constructor functions or an already-created dataframe object retrieved via | ||
|
||
""" | ||
def to_array_obj(self, *, null_handling: str | None = None) -> object: | ||
""" | ||
Obtain an object that can be used as input to ``asarray`` or ``from_dlpack`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are very different where I don't think using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mostly, yes. The main pain point there is that the array API standard doesn't mandate that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we can just drop |
||
|
||
The returned object must only be used for one thing: calling the ``asarray`` | ||
or ``from_dlpack`` functions of a library implementing the array API | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for only |
||
standard, or equivalent ``asarray``/``from_dlpack`` functions to the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
one in that standard. In practice that means that the returned object | ||
may either already *be* an array type of a specific array library, or | ||
it may implement one or more of the following methods or behaviors: | ||
|
||
- ``__dlpack__`` | ||
- the Python buffer protocol | ||
- ``__array__`` | ||
- ``__array_interface__`` | ||
- ``__cuda_array_interface__`` | ||
- the Python ``Sequence`` interface | ||
- any other method that is known to work with an ``asarray`` function | ||
in a library of interest | ||
Comment on lines
+37
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the goal of this function to work generally with multiple libraries There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'd say neither is 100% accurate. It's more the former, however it does make sense to add library-specific methods - and in fact, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess my fear is that say some dataframe library There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that fear is justified, because:
|
||
|
||
Importantly, the returned object must only implement/expose those | ||
methods that work. An ``asarray`` function may try to use any of the | ||
above methods, and the order in which it does so is not guaranteed. | ||
Hence it is expected that if a method is present, it works correctly | ||
and does not raise an exception (e.g., because of an unsupported dtype | ||
or device). | ||
rgommers marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. admonition:: Tip | ||
|
||
One way to expose methods only if they will work for the dtype, | ||
device, and other characteristics of the column is to hide access to | ||
the methods dynamically, so ``hasattr`` does the right thing. For | ||
example:: | ||
|
||
def __dir__(self): | ||
methods = dir(self.__class__) | ||
attrs = list(self.__dict__.keys()) | ||
keys = methods + attrs | ||
if not self.dtype in _dlpack_supported_dtypes: | ||
keys.remove("__dlpack__") | ||
|
||
return keys | ||
|
||
def __dlpack__(self): | ||
... | ||
|
||
Parameters | ||
---------- | ||
null_handling : str or None | ||
Determine how to treat ``null`` values that may be present in the | ||
column. Valid options are: | ||
|
||
- ``None`` (default): no special handling. This assumes that either | ||
no missing values are present, or there is an array type with | ||
native support for missing values that is/can be converted to. | ||
*Note: there is currently no such library that is in wide use; | ||
NumPy's masked arrays are non-recommended, and other array | ||
libraries do not support missing values at all.* | ||
- ``raise``: always raise a ``ValueError`` if nulls are present. | ||
- ``to-nan``: for floating-point dtypes, convert any nulls to ``nan``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be generalized to apply to any numeric type, instead of only floating points (i.e. allowing to convert ints with nulls to float with nans) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True. I'm not sure how desirable that is, it may be pragmatic in the absence of a good alternative. Let's see what others think about this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This option could also be removed completely, assuming we add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Syntax-wise, keeping |
||
For other dtypes, do the same as ``None``. | ||
|
||
Note that if it is desired to convert nulls to a dtype-specific | ||
sentinel value, the user should do this before calling | ||
``is_array_obj`` with `.isnull()` and replacing the values | ||
directly. | ||
|
||
Returns | ||
------- | ||
array_obj : object | ||
Either a custom object or an instance of the array type of a | ||
specific array library. In the latter case, the array must have the | ||
closest matching dtype of the array library (e.g., a column with a | ||
floating-point dtype must produce the corresponding floating-point | ||
dtype of the same precision that the array library offers). | ||
|
||
Raises | ||
------ | ||
TypeError | ||
In case it is not possible to convert the column to any (known) array | ||
library type, or use any of the possible interchange methods. | ||
This can be due to the dtype (e.g., no array library supports datetime | ||
dtypes with a time zone), device, or other reasons. | ||
ValueError | ||
If the column contains ``null`` values which prevent returning an | ||
array object. | ||
|
||
""" | ||
|
||
@classmethod | ||
def from_sequence(cls, sequence: Sequence[object], dtype: dtype) -> Column: | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the PR bu the sentence seems to be missing something?