Add a `to_array_obj` method to Column #163

rgommers · 2023-05-03T20:12:46Z

spec/API_specification/dataframe_api/column_object.py

jorisvandenbossche · 2023-05-04T08:00:55Z

Should we specify the behaviour in case of nulls? (do some automatic conversion, or raise an error?) Or add a keyword that controls to which value it is converted?

rgommers · 2023-05-04T10:40:46Z

Should we specify the behaviour in case of nulls? (do some automatic conversion, or raise an error?) Or add a keyword that controls to which value it is converted?

Good question - I added a keyword and specified ValueError when nulls cause a column to not be convertible.

jorisvandenbossche · 2023-05-04T15:08:56Z

spec/API_specification/dataframe_api/column_object.py

+              NumPy's masked arrays are non-recommended, and other array
+              libraries do not support missing values at all.*
+            - ``raise``: always raise a ``ValueError`` if nulls are present.
+            - ``to-nan``: for floating-point dtypes, convert any nulls to ``nan``.


This could be generalized to apply to any numeric type, instead of only floating points (i.e. allowing to convert ints with nulls to float with nans)

True. I'm not sure how desirable that is, it may be pragmatic in the absence of a good alternative. Let's see what others think about this.

This option could also be removed completely, assuming we add fill_null(scalar_value), in favor of a separate col.fill_null(float('nan')).to_array_obj().

Syntax-wise, keeping 'to-nan' is probably nicer though.

MarcoGorelli

Thanks for doing this

My only request would be to force the target dtype to be specified (to prevent implementers from stuffing everything into an object ndarray - looking at you, pandas nullable dtypes)

rgommers · 2023-05-04T17:23:12Z

My only request would be to force the target dtype to be specified

That would be counterproductive I think. The intended usage is this:

# If you want say a PyTorch tensor, replace `np.asarray` with `torch.Tensor`:
np.asarray(col.to_array_obj())

In, e.g., Matplotlib that is supposed to work for pretty much any dtype. How would you usefully fill in a dtype= keyword here?

(to prevent implementers from stuffing everything into an object ndarray - looking at you, pandas nullable dtypes)

That is a mis-feature in pandas, and it should be solved in pandas imho. This standard nor any other array library even has the concept of an object array.

If it would help to address this specific concern, I'd rather specify something like "a dataframe library must produce the closest matching dtype of the array library in case it produces an array type instance itself (e.g. "a column with a floating-point dtype must produce the corresponding floating-point dtype of the same precision that the array library offers".

MarcoGorelli · 2023-05-04T17:25:09Z

I'd rather specify something like "a dataframe library must produce the closest matching dtype of the array library in case it produces an array type instance itself (e.g. "a column with a floating-point dtype must produce the corresponding floating-point dtype of the same precision that the array library offers".

sure, sounds good, that makes it clear that pandas' nullable dtypes to_numpy default behaviour is considered non-compliant

jorisvandenbossche · 2023-05-05T06:59:31Z

Yes, see also what Ralf already added about nulls causing a ValueError. Most of the cases that pandas currently gives object dtype would already give an error for that reason.

MarcoGorelli

This is fine actually, apologies, it already says

of a library implementing the array API standard

which excludes the object dtype anyway

Looks good to me, thanks!

MarcoGorelli · 2023-05-10T12:23:37Z

Looks good - OK to merge?

rgommers · 2023-05-10T20:05:58Z

Looks good - OK to merge?

I added the language regarding the "may produce object dtype" concern, and am now happy with this PR. There is one open point on what to do with null_handling='to-nan' - I don't have a strong preference there. We could discuss over the next day and merge then if everyone is happy.

kkraus14 · 2023-05-10T18:03:24Z

spec/API_specification/dataframe_api/column_object.py

@@ -17,6 +17,86 @@ class Column:
    constructor functions or an already-created dataframe object retrieved via

    """
+    def to_array_obj(self, *, null_handling: str | None = None) -> object:
+        """
+        Obtain an object that can be used as input to ``asarray`` or ``from_dlpack``


These are very different where I don't think using or here is really accurate. Based on the rest of the code this seems to be geared towards asarray?

Mostly, yes. The main pain point there is that the array API standard doesn't mandate that asarray has to support DLPack. There was a desire to keep that orthogonal (long discussion, and I believe you were involved in that). So now if we have a column with a dtype which can support DLPack, that will work with numpy.asarray but not necessarily with other libraries.

So we can just drop from_dlpack from the description and leave everything else unchanged.

kkraus14 · 2023-05-10T18:03:38Z

spec/API_specification/dataframe_api/column_object.py

+        Obtain an object that can be used as input to ``asarray`` or ``from_dlpack``
+
+        The returned object must only be used for one thing: calling the ``asarray``
+        or ``from_dlpack`` functions of a library implementing the array API


+1 for only asarray here?

kkraus14 · 2023-05-10T18:03:43Z

spec/API_specification/dataframe_api/column_object.py

+
+        The returned object must only be used for one thing: calling the ``asarray``
+        or ``from_dlpack`` functions of a library implementing the array API
+        standard, or equivalent ``asarray``/``from_dlpack`` functions to the


kkraus14 · 2023-05-10T18:05:12Z

spec/API_specification/dataframe_api/column_object.py

+        - any other method that is known to work with an ``asarray`` function
+          in a library of interest


Is the goal of this function to work generally with multiple libraries asarray implementations or just a specific library's implementation? If the former I would push for us to remove this point since it's asking for library specific things to be introduced.

Is the goal of this function to work generally with multiple libraries asarray implementations or just a specific library's implementation?

I'd say neither is 100% accurate. It's more the former, however it does make sense to add library-specific methods - and in fact, __array__ is one already. There is no single interchange method that is supported by all libraries' asarray function anyway, nor is that a good goal because it would eliminate support for dtypes like string and datetime which numpy supports and some other array library could support, but most libraries never will.

I guess my fear is that say some dataframe library fancy_dataframe could expect to work downstream with fancy_array and return something that supports __fancy_array__ which fancy_array.asarray handles, but a developer calls np.asarray against the output of this and it understandably doesn't work.

I don't think that fear is justified, because:

this method already allows returning an instance of fancy_array directly anyway,

the returned object should have other, more standard methods too if possible, so np.asarray can still work

if there's something really specific in the column such that it's not possible to add anything but __fancy_array__ anway, then nothing is lost by doing so

amueller · 2023-05-11T17:08:03Z

spec/API_specification/dataframe_api/column_object.py

@@ -17,6 +17,95 @@ class Column:
    constructor functions or an already-created dataframe object retrieved via


Not related to the PR bu the sentence seems to be missing something?

MarcoGorelli

A suggestion which came out of today's call was to not have to_array_obj in the end, but to instead have Column.column. And put a big warning in its docs, telling users that if they invoke it, then they are leaving the Standard territory and that we make no guarantees about what will or won't work with it

I'd prefer that TBH. Any objections?

(requesting changes so this doesn't get accidentally merged before we've figured it out)

jorisvandenbossche · 2023-05-25T18:33:17Z

instead have Column.column.

I don't see how that would solve the same concern / use case as what to_array_obj tried to do in this PR. I assume that Column.column will return the underlying dataframe library specific object (based on the meeting notes), for example a Series for pandas (or an ExtensionArray).
But then, if your use case as a downstream user of the standard is that, for example, you need to have the data as a numpy array (or generic buffer/memoryview), then you need to know the implementation specific details of every dataframe standard implementation out there to know how to convert this object to an array object. For the pandas object, that could be calling to_numpy(), but for another dataframe library that could be some other name.

I understand the concern that the specification of to_array_obj is not very strictly defined and a bit specific to certain implementations (i.e. the mention of np.asarray). But not having such a method only makes it worse, as now you need to know the implementations details of every possible dataframe library, defeating the purpose of being able to write agnostic code.

jorisvandenbossche · 2023-05-25T18:44:44Z

Some other options to consider:

Give up on trying to be array-library agnostic, and just add an explicit to_numpy() so at least we have something that is dataframe-library-independent to get a numpy array.
Limit ourselves to numeric data types for now, so we could specify that the returned object from to_array_obj() needs to have __dlpack__, to make this a more well defined specification

kkraus14 · 2023-05-25T20:20:13Z

I think the challenge is that the API as it currently exists is too ambiguous and encourages use as opposed to it being an "escape hatch". I think what we generally agreed on the call today was:

There's desire for an API to return an array that guarantees you can use the array API against it, but doesn't make any guarantees about memory layout
There's desire for an API to return something that guarantees __dlpack__ interchange compatibility
These two shouldn't be the same API
There's going to need to be an escape hatch to use library specific functionality that is not standardized and this to_array_obj is just a single instance of that. We provide an escape hatch at the DataFrame level today via the .dataframe property: https://github.com/data-apis/dataframe-api/blob/main/spec/API_specification/dataframe_api/dataframe_object.py#L83-L90. Providing an equivalent of that at the Column level serves the purpose of to_array_obj as well as serving the purpose of future cases where there's desire to break away from the standard.

MarcoGorelli · 2023-06-22T14:05:06Z

I noticed that plotly have this:

https://github.com/plotly/plotly.py/blob/0e0eaa80b2d3f95376c429100b55506cf166bca5/packages/python/plotly/_plotly_utils/basevalidators.py#L169-L174

in case it's False, then the iterate over the elements 1 by 1

If we can find closure to this discussion, then that's probably everything that plotly would need to be DataFrame-agnostic (from my initial experimentation at least)

plotly's dataframe handling is way simpler than seaborn's, so they might be a better target for now (they've also expressed interest)

kkraus14 · 2023-06-22T15:30:13Z

I noticed that plotly have this:

https://github.com/plotly/plotly.py/blob/4ab63db64b09d7e8ba28ea2eb67fbefe7c18ffd4/packages/python/plotly/_plotly_utils/basevalidators.py#L169-L174

in case it's False, then the iterate over the elements 1 by 1

The link here 404s, any chance you could update it?

MarcoGorelli · 2023-06-22T16:53:38Z

thanks @kkraus14 , have fixed

MarcoGorelli · 2023-08-01T12:50:41Z

superseded by #209 - closing then, thanks all for the discussion

Add a to_array_obj method to Column

43831ad

Closes data-apisgh-139

rgommers added the API design label May 3, 2023

jorisvandenbossche reviewed May 4, 2023

View reviewed changes

spec/API_specification/dataframe_api/column_object.py Show resolved Hide resolved

Address exception types for unsupported dtypes, add null handling

d98324a

jorisvandenbossche reviewed May 4, 2023

View reviewed changes

MarcoGorelli suggested changes May 4, 2023

View reviewed changes

MarcoGorelli approved these changes May 5, 2023

View reviewed changes

Add language on dtypes to address "produces object arrays" concern

e53be97

kkraus14 reviewed May 10, 2023

View reviewed changes

amueller reviewed May 11, 2023

View reviewed changes

MarcoGorelli suggested changes May 25, 2023

View reviewed changes

MarcoGorelli mentioned this pull request Jun 22, 2023

Add Column.column as a part of the standard? #185

Closed

MarcoGorelli closed this Aug 1, 2023

		- any other method that is known to work with an ``asarray`` function
		in a library of interest

		@@ -17,6 +17,95 @@ class Column:
		constructor functions or an already-created dataframe object retrieved via

Add a to_array_obj method to Column #163

Add a to_array_obj method to Column #163

Uh oh!

Conversation

rgommers commented May 3, 2023

Uh oh!

Uh oh!

jorisvandenbossche commented May 4, 2023

Uh oh!

rgommers commented May 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

rgommers commented May 4, 2023

Uh oh!

MarcoGorelli commented May 4, 2023

Uh oh!

jorisvandenbossche commented May 5, 2023

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented May 10, 2023

Uh oh!

rgommers commented May 10, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented May 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented May 25, 2023

Uh oh!

kkraus14 commented May 25, 2023

Uh oh!

MarcoGorelli commented Jun 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkraus14 commented Jun 22, 2023

Uh oh!

MarcoGorelli commented Jun 22, 2023

Uh oh!

MarcoGorelli commented Aug 1, 2023

Uh oh!

Uh oh!

Add a `to_array_obj` method to Column #163

Add a `to_array_obj` method to Column #163

jorisvandenbossche commented May 25, 2023 •

edited

Loading

MarcoGorelli commented Jun 22, 2023 •

edited

Loading