Skip to content

Add a to_array_obj method to Column #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

rgommers
Copy link
Member

@rgommers rgommers commented May 3, 2023

Closes gh-139

@jorisvandenbossche
Copy link
Member

Should we specify the behaviour in case of nulls? (do some automatic conversion, or raise an error?) Or add a keyword that controls to which value it is converted?

@rgommers
Copy link
Member Author

rgommers commented May 4, 2023

Should we specify the behaviour in case of nulls? (do some automatic conversion, or raise an error?) Or add a keyword that controls to which value it is converted?

Good question - I added a keyword and specified ValueError when nulls cause a column to not be convertible.

NumPy's masked arrays are non-recommended, and other array
libraries do not support missing values at all.*
- ``raise``: always raise a ``ValueError`` if nulls are present.
- ``to-nan``: for floating-point dtypes, convert any nulls to ``nan``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be generalized to apply to any numeric type, instead of only floating points (i.e. allowing to convert ints with nulls to float with nans)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I'm not sure how desirable that is, it may be pragmatic in the absence of a good alternative. Let's see what others think about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option could also be removed completely, assuming we add fill_null(scalar_value), in favor of a separate col.fill_null(float('nan')).to_array_obj().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntax-wise, keeping 'to-nan' is probably nicer though.

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this

My only request would be to force the target dtype to be specified (to prevent implementers from stuffing everything into an object ndarray - looking at you, pandas nullable dtypes)

@rgommers
Copy link
Member Author

rgommers commented May 4, 2023

My only request would be to force the target dtype to be specified

That would be counterproductive I think. The intended usage is this:

# If you want say a PyTorch tensor, replace `np.asarray` with `torch.Tensor`:
np.asarray(col.to_array_obj())

In, e.g., Matplotlib that is supposed to work for pretty much any dtype. How would you usefully fill in a dtype= keyword here?

(to prevent implementers from stuffing everything into an object ndarray - looking at you, pandas nullable dtypes)

That is a mis-feature in pandas, and it should be solved in pandas imho. This standard nor any other array library even has the concept of an object array.

If it would help to address this specific concern, I'd rather specify something like "a dataframe library must produce the closest matching dtype of the array library in case it produces an array type instance itself (e.g. "a column with a floating-point dtype must produce the corresponding floating-point dtype of the same precision that the array library offers".

@MarcoGorelli
Copy link
Contributor

I'd rather specify something like "a dataframe library must produce the closest matching dtype of the array library in case it produces an array type instance itself (e.g. "a column with a floating-point dtype must produce the corresponding floating-point dtype of the same precision that the array library offers".

sure, sounds good, that makes it clear that pandas' nullable dtypes to_numpy default behaviour is considered non-compliant

@jorisvandenbossche
Copy link
Member

Yes, see also what Ralf already added about nulls causing a ValueError. Most of the cases that pandas currently gives object dtype would already give an error for that reason.

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine actually, apologies, it already says

of a library implementing the array API standard

which excludes the object dtype anyway

Looks good to me, thanks!

@MarcoGorelli
Copy link
Contributor

Looks good - OK to merge?

@rgommers
Copy link
Member Author

Looks good - OK to merge?

I added the language regarding the "may produce object dtype" concern, and am now happy with this PR. There is one open point on what to do with null_handling='to-nan' - I don't have a strong preference there. We could discuss over the next day and merge then if everyone is happy.

@@ -17,6 +17,86 @@ class Column:
constructor functions or an already-created dataframe object retrieved via

"""
def to_array_obj(self, *, null_handling: str | None = None) -> object:
"""
Obtain an object that can be used as input to ``asarray`` or ``from_dlpack``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very different where I don't think using or here is really accurate. Based on the rest of the code this seems to be geared towards asarray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly, yes. The main pain point there is that the array API standard doesn't mandate that asarray has to support DLPack. There was a desire to keep that orthogonal (long discussion, and I believe you were involved in that). So now if we have a column with a dtype which can support DLPack, that will work with numpy.asarray but not necessarily with other libraries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can just drop from_dlpack from the description and leave everything else unchanged.

Obtain an object that can be used as input to ``asarray`` or ``from_dlpack``

The returned object must only be used for one thing: calling the ``asarray``
or ``from_dlpack`` functions of a library implementing the array API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for only asarray here?


The returned object must only be used for one thing: calling the ``asarray``
or ``from_dlpack`` functions of a library implementing the array API
standard, or equivalent ``asarray``/``from_dlpack`` functions to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +37 to +38
- any other method that is known to work with an ``asarray`` function
in a library of interest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal of this function to work generally with multiple libraries asarray implementations or just a specific library's implementation? If the former I would push for us to remove this point since it's asking for library specific things to be introduced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal of this function to work generally with multiple libraries asarray implementations or just a specific library's implementation?

I'd say neither is 100% accurate. It's more the former, however it does make sense to add library-specific methods - and in fact, __array__ is one already. There is no single interchange method that is supported by all libraries' asarray function anyway, nor is that a good goal because it would eliminate support for dtypes like string and datetime which numpy supports and some other array library could support, but most libraries never will.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my fear is that say some dataframe library fancy_dataframe could expect to work downstream with fancy_array and return something that supports __fancy_array__ which fancy_array.asarray handles, but a developer calls np.asarray against the output of this and it understandably doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that fear is justified, because:

  • this method already allows returning an instance of fancy_array directly anyway,
  • the returned object should have other, more standard methods too if possible, so np.asarray can still work
  • if there's something really specific in the column such that it's not possible to add anything but __fancy_array__ anway, then nothing is lost by doing so

@@ -17,6 +17,95 @@ class Column:
constructor functions or an already-created dataframe object retrieved via

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR bu the sentence seems to be missing something?

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion which came out of today's call was to not have to_array_obj in the end, but to instead have Column.column. And put a big warning in its docs, telling users that if they invoke it, then they are leaving the Standard territory and that we make no guarantees about what will or won't work with it

I'd prefer that TBH. Any objections?

(requesting changes so this doesn't get accidentally merged before we've figured it out)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 25, 2023

instead have Column.column.

I don't see how that would solve the same concern / use case as what to_array_obj tried to do in this PR. I assume that Column.column will return the underlying dataframe library specific object (based on the meeting notes), for example a Series for pandas (or an ExtensionArray).
But then, if your use case as a downstream user of the standard is that, for example, you need to have the data as a numpy array (or generic buffer/memoryview), then you need to know the implementation specific details of every dataframe standard implementation out there to know how to convert this object to an array object. For the pandas object, that could be calling to_numpy(), but for another dataframe library that could be some other name.

I understand the concern that the specification of to_array_obj is not very strictly defined and a bit specific to certain implementations (i.e. the mention of np.asarray). But not having such a method only makes it worse, as now you need to know the implementations details of every possible dataframe library, defeating the purpose of being able to write agnostic code.

@jorisvandenbossche
Copy link
Member

Some other options to consider:

  • Give up on trying to be array-library agnostic, and just add an explicit to_numpy() so at least we have something that is dataframe-library-independent to get a numpy array.
  • Limit ourselves to numeric data types for now, so we could specify that the returned object from to_array_obj() needs to have __dlpack__, to make this a more well defined specification

@kkraus14
Copy link
Collaborator

I think the challenge is that the API as it currently exists is too ambiguous and encourages use as opposed to it being an "escape hatch". I think what we generally agreed on the call today was:

  • There's desire for an API to return an array that guarantees you can use the array API against it, but doesn't make any guarantees about memory layout
  • There's desire for an API to return something that guarantees __dlpack__ interchange compatibility
  • These two shouldn't be the same API
  • There's going to need to be an escape hatch to use library specific functionality that is not standardized and this to_array_obj is just a single instance of that. We provide an escape hatch at the DataFrame level today via the .dataframe property: https://github.com/data-apis/dataframe-api/blob/main/spec/API_specification/dataframe_api/dataframe_object.py#L83-L90. Providing an equivalent of that at the Column level serves the purpose of to_array_obj as well as serving the purpose of future cases where there's desire to break away from the standard.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jun 22, 2023

I noticed that plotly have this:

https://github.com/plotly/plotly.py/blob/0e0eaa80b2d3f95376c429100b55506cf166bca5/packages/python/plotly/_plotly_utils/basevalidators.py#L169-L174

in case it's False, then the iterate over the elements 1 by 1


If we can find closure to this discussion, then that's probably everything that plotly would need to be DataFrame-agnostic (from my initial experimentation at least)

plotly's dataframe handling is way simpler than seaborn's, so they might be a better target for now (they've also expressed interest)

@kkraus14
Copy link
Collaborator

I noticed that plotly have this:

https://github.com/plotly/plotly.py/blob/4ab63db64b09d7e8ba28ea2eb67fbefe7c18ffd4/packages/python/plotly/_plotly_utils/basevalidators.py#L169-L174

in case it's False, then the iterate over the elements 1 by 1

The link here 404s, any chance you could update it?

@MarcoGorelli
Copy link
Contributor

thanks @kkraus14 , have fixed

@MarcoGorelli
Copy link
Contributor

superseded by #209 - closing then, thanks all for the discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need a method to convert a column to an array instance (numpy or other)
5 participants