-
Notifications
You must be signed in to change notification settings - Fork 21
Add a to_array_obj
method to Column
#163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Should we specify the behaviour in case of nulls? (do some automatic conversion, or raise an error?) Or add a keyword that controls to which value it is converted? |
Good question - I added a keyword and specified |
NumPy's masked arrays are non-recommended, and other array | ||
libraries do not support missing values at all.* | ||
- ``raise``: always raise a ``ValueError`` if nulls are present. | ||
- ``to-nan``: for floating-point dtypes, convert any nulls to ``nan``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be generalized to apply to any numeric type, instead of only floating points (i.e. allowing to convert ints with nulls to float with nans)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I'm not sure how desirable that is, it may be pragmatic in the absence of a good alternative. Let's see what others think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option could also be removed completely, assuming we add fill_null(scalar_value)
, in favor of a separate col.fill_null(float('nan')).to_array_obj()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syntax-wise, keeping 'to-nan'
is probably nicer though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this
My only request would be to force the target dtype to be specified (to prevent implementers from stuffing everything into an object ndarray - looking at you, pandas nullable dtypes)
That would be counterproductive I think. The intended usage is this: # If you want say a PyTorch tensor, replace `np.asarray` with `torch.Tensor`:
np.asarray(col.to_array_obj()) In, e.g., Matplotlib that is supposed to work for pretty much any dtype. How would you usefully fill in a
That is a mis-feature in pandas, and it should be solved in pandas imho. This standard nor any other array library even has the concept of an object array. If it would help to address this specific concern, I'd rather specify something like "a dataframe library must produce the closest matching dtype of the array library in case it produces an array type instance itself (e.g. "a column with a floating-point dtype must produce the corresponding floating-point dtype of the same precision that the array library offers". |
sure, sounds good, that makes it clear that pandas' nullable dtypes |
Yes, see also what Ralf already added about nulls causing a ValueError. Most of the cases that pandas currently gives object dtype would already give an error for that reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine actually, apologies, it already says
of a library implementing the array API standard
which excludes the object dtype anyway
Looks good to me, thanks!
Looks good - OK to merge? |
I added the language regarding the "may produce object dtype" concern, and am now happy with this PR. There is one open point on what to do with |
@@ -17,6 +17,86 @@ class Column: | |||
constructor functions or an already-created dataframe object retrieved via | |||
|
|||
""" | |||
def to_array_obj(self, *, null_handling: str | None = None) -> object: | |||
""" | |||
Obtain an object that can be used as input to ``asarray`` or ``from_dlpack`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are very different where I don't think using or
here is really accurate. Based on the rest of the code this seems to be geared towards asarray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly, yes. The main pain point there is that the array API standard doesn't mandate that asarray
has to support DLPack. There was a desire to keep that orthogonal (long discussion, and I believe you were involved in that). So now if we have a column with a dtype which can support DLPack, that will work with numpy.asarray
but not necessarily with other libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can just drop from_dlpack
from the description and leave everything else unchanged.
Obtain an object that can be used as input to ``asarray`` or ``from_dlpack`` | ||
|
||
The returned object must only be used for one thing: calling the ``asarray`` | ||
or ``from_dlpack`` functions of a library implementing the array API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for only asarray
here?
|
||
The returned object must only be used for one thing: calling the ``asarray`` | ||
or ``from_dlpack`` functions of a library implementing the array API | ||
standard, or equivalent ``asarray``/``from_dlpack`` functions to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
- any other method that is known to work with an ``asarray`` function | ||
in a library of interest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the goal of this function to work generally with multiple libraries asarray
implementations or just a specific library's implementation? If the former I would push for us to remove this point since it's asking for library specific things to be introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the goal of this function to work generally with multiple libraries
asarray
implementations or just a specific library's implementation?
I'd say neither is 100% accurate. It's more the former, however it does make sense to add library-specific methods - and in fact, __array__
is one already. There is no single interchange method that is supported by all libraries' asarray
function anyway, nor is that a good goal because it would eliminate support for dtypes like string and datetime which numpy supports and some other array library could support, but most libraries never will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my fear is that say some dataframe library fancy_dataframe
could expect to work downstream with fancy_array
and return something that supports __fancy_array__
which fancy_array.asarray
handles, but a developer calls np.asarray
against the output of this and it understandably doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that fear is justified, because:
- this method already allows returning an instance of
fancy_array
directly anyway, - the returned object should have other, more standard methods too if possible, so
np.asarray
can still work - if there's something really specific in the column such that it's not possible to add anything but
__fancy_array__
anway, then nothing is lost by doing so
@@ -17,6 +17,95 @@ class Column: | |||
constructor functions or an already-created dataframe object retrieved via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to the PR bu the sentence seems to be missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A suggestion which came out of today's call was to not have to_array_obj
in the end, but to instead have Column.column
. And put a big warning in its docs, telling users that if they invoke it, then they are leaving the Standard territory and that we make no guarantees about what will or won't work with it
I'd prefer that TBH. Any objections?
(requesting changes so this doesn't get accidentally merged before we've figured it out)
I don't see how that would solve the same concern / use case as what I understand the concern that the specification of |
Some other options to consider:
|
I think the challenge is that the API as it currently exists is too ambiguous and encourages use as opposed to it being an "escape hatch". I think what we generally agreed on the call today was:
|
I noticed that plotly have this: in case it's If we can find closure to this discussion, then that's probably everything that plotly would need to be DataFrame-agnostic (from my initial experimentation at least) plotly's dataframe handling is way simpler than seaborn's, so they might be a better target for now (they've also expressed interest) |
The link here 404s, any chance you could update it? |
thanks @kkraus14 , have fixed |
superseded by #209 - closing then, thanks all for the discussion |
Closes gh-139