Skip to content

Add a summary document for the dataframe interchange protocol #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions protocol/dataframe_protocol_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# `__dataframe__` protocol - summary

_We've had a lot of discussion in a couple of GitHub issues and in meetings.
This description attempts to summarize that, and extract the essential design
requirements/principles and functionality it needs to support._

## Purpose of `__dataframe__`

The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe).

Currently (Sep'20) there is no way to do this in an implementation-independent way.

The main use case this protocol intends to enable is to make it possible to
write code that can accept any type of dataframe instead of being tied to a
single type of dataframe. To illustrate that:

```python
def somefunc(df, ...):
"""`df` can be any dataframe supporting the protocol, rather than (say)
only a pandas.DataFrame"""
# could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)`
df = pd.from_dataframe(df)
# From now on, use Pandas dataframe internally
```

### Non-goals

Providing a _complete standardized dataframe API_ is not a goal of the
`__dataframe__` protocol. Instead, this is a goal of the full dataframe API
standard, which the Consortium for Python Data API Standards aims to provide
in the future. When that full API standard is implemented by dataframe
libraries, the example above can change to:

```python
def get_df_module(df):
"""Utility function to support programming against a dataframe API"""
if hasattr(df, '__dataframe_namespace__'):
# Retrieve the namespace
pdx = df.__dataframe_namespace__()
else:
# Here we can raise an exception if we only want to support compliant dataframes,
# or convert to our default choice of dataframe if we want to accept (e.g.) dicts
pdx = pd
df = pd.DataFrame(df)

return pdx, df


def somefunc(df, ...):
"""`df` can be any dataframe conforming to the dataframe API standard"""
pdx, df = get_df_module(df)
# From now on, use `df` methods and `pdx` functions/objects
```

### Constraints

An important constraint on the `__dataframe__` protocol is that it should not
make achieving the goal of the complete standardized dataframe API more
difficult to achieve.

There is a small concern here. Say that a library adopts `__dataframe__` first,
and it goes from supporting only Pandas to officially supporting other
dataframes like `modin.pandas.DataFrame`. At that point, changing to
supporting the full dataframe API standard as a next step _implies a
backwards compatibility break_ for users that now start relying on Modin
dataframe support. E.g., the second transition will change from returning a
Pandas dataframe from `somefunc(df_modin)` to returning a Modin dataframe
later. It must be made very clear to libraries accepting `__dataframe__` that
this is a consequence, and that that should be acceptable to them.


### Progression / timeline

- **Current status**: most dataframe-consuming libraries work _only_ with
Pandas, and rely on many Pandas-specific functions, methods and behavior.
- **Status after `__dataframe__`**: with minor code changes (as in first
example above), libraries can start supporting all conforming dataframes,
convert them to Pandas dataframes, and still rely on the same
Pandas-specific functions, methods and behavior.
- **Status after standard dataframe API adoption**: libraries can start
supporting all conforming dataframes _without converting to Pandas or
relying on its implementation details_. At this point, it's possible to
"program to an interface" rather than to a specific library like Pandas.


## Protocol design requirements

1. Must be a standard API that is unambiguously specified, and not rely on
implementation details of any particular dataframe library.
2. Must treat dataframes as a collection of columns (which are 1-D arrays
with a dtype and missing data support).
3. Must include device support
4. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
and provide an explicit way to force such transfers (e.g. a `force=` or
`copy=` keyword that the caller can set to `True`).
5. Must be zero-copy if possible.
6. Must be able to support "virtual columns" (e.g., a library like Vaex which
may not have data in memory because it uses lazy evaluation).
7. Must support missing values (`NA`) for all supported dtypes.
8. Must supports string and categorical dtypes
(_TBD: not discussed a lot, is this a hard requirement?_)

We'll also list some things that were discussed but are not requirements:

1. Object dtype does not need to be supported (_TBD: this is what Joris said,
but doesn't Pandas use object dtype to represent strings?_).
2. Heterogeneous/structured dtypes within a single column does not need to be
supported.
_Rationale: not used a lot, additional design complexity not justified._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow C data interface supports this through Arrow struct type (not exactly the same as numpy's structured dtype, though).

But we probably need to say something in general about nested types. Because you mention "heterogenous", but you could eg also have a homogeneous list type ("ragged" array like).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would anyone have a use for that, and even if that is the case - would there be enough demand that multiple dataframe libraries would implement that in a way that it would work the same for a user?

I think just extending the out of scope to be more clear that nested dtypes, custom/extension dtypes, etc. are all out of scope. And in the in scope part, be more explicit about dtypes that are included (I made a summary of the discussions we had, so I didn't say things we all seem to agree on and are obvious, like "integer, floating point, .... dtypes are in scope").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there would certainly be use cases for it (eg cudf already supports nested types, in pandas we are considering adding a list dtype, and I suppose eg koalas (since it is spark-based) supports nested types as well). But I also think it is certainly fine to leave it out for a first iteration.

(leveraging the Arrow type definitions / interface would basically give it for free)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at what Arrow does here. Nested types and a list dtype seem to me to be very similar to object arrays, they're basically a variable-size container that's very flexible, you can stuff pretty much anything into it.

(leveraging the Arrow type definitions / interface would basically give it for free)

Free as in "we have a spec already", but it seems complex implementation-wise.

But I also think it is certainly fine to leave it out for a first iteration.

I'd be inclined to do that, and just clarify the statement on this topic a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited, and added extension dtypes here too - that's in the same bucket I'd say. Possible to add in the future, a bridge too far for the first version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested types and a list dtype seem to me to be very similar to object arrays

I am not sure that is an accurate description (at least, on a memory layout level, but assuming we are talking about that).
Object arrays are arrays with pointers to python objects that can live anywhere in memory (AFAIK?), while nested types in arrow consist of several contiguous arrays (eg a list array is one plain array with the actual values, and one array with the indices where each list starts).

Also eg a struct type array consists of one array per key of the struct, which is also not that complex implementation wise, I would say.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object arrays are arrays with pointers to python objects that can live anywhere in memory (AFAIK?)

Yes indeed.



## Frequently asked questions

### Can the Arrow C Data Interface be used for this?

What we are aiming for is quite similar to the Arrow C Data Interface (see
the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
except `__dataframe__` is a Python-level rather than C-level interface.
Comment on lines +128 to +130
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One key thing is Arrow C Data interface relies on providing a deletion / finalization method similar to DLPack. That is something that hasn't been discussed too much, but we should iron out for this proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good one. Since we're discussing a Python-level API, a deletion / finalization method seems a bit "foreign" / C-specific. But I agree that it's important. I have to admit I haven't fully figured out what all the observable behaviour differences to a Python user are between the deleter method and refcounting - should write a set of tests for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question here: is "Python-level" a design requirement? (if so, probably should be added in the list above)
For example also DLPack, considered as the exchange protocol format for the array standard, is a C-level interface?

(to be clear, I am not very familiar with those aspects. It might also be you can have a Python interface to a C-level exchange format?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO for desired semantics (deletion/finalization vs. buffer protocol type behaviour) here.

Another question here: is "Python-level" a design requirement? (if so, probably should be added in the list above)

I'd say yes, added to item 1 of the list of design requirements. A C-only interface would probably be asking too much from consumers here, and maximal performance doesn't seem too important compared to having this functionality available in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking into the semantics some more, I came to the conclusion that that Arrow spec doesn't define any particular semantics that matter to Python users. The release callback semantics matter a lot to library implementers (both producer and consumer), but in the end it doesn't say anything about whether memory is shared or not. It allows for zero-copy but doesn't mandate it - so the copy/view + mutation ambiguity is similar as we had the large discussion around for arrays.


The limitations seem to be:
- No device support (@kkraus14 will bring this up on the Arrow dev mailing list)
- Specific to columnar data (_at least, this is what its docs say_).
TODO: are there any concerns for, e.g., Koalas or Ibis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this limitation points at. Isn't columnar data the scope of this proposal? (above it says "treat dataframes as a collection of columns")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes true, maybe the TODO belongs more with that first statement, or it's a non-issue. I put it here because the Arrow doc is so adamant about only dealing with columnar data, so I thought about it when writing this section.

I think this all still works fine for a row-based dataframe, it's just a more expensive conversion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head, this whole document is about columnar conversions? But maybe that is a wrong assumption? (in any case something to discuss then in the meeting tomorrow)

For example, also the "Possible direction for implementation" (the comment of @kkraus14 in the other issue) is about a columnar exchange.
(and indeed, your dataframe doesn't necessarily need to be implemented columnar to support a column-based exchange protocol, it might only be less efficient).

Other exchange types, like row-based, might also be interesting, but I think they will be sufficiently different that they warrant a separate discussion / document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head, this whole document is about columnar conversions?

I think you're right, it's just a matter of being careful about terminology. It's about columnar conversions, and treating dataframes as a collection of columns from the user's point of view, but I think we should be careful to avoid implying that the implementation must use columnar storage, or that a column is an array.


Note that categoricals are supported, Arrow uses the phrasing
"dictionary-encoded types" for categorical.

The Arrow C Data Interface says specifically it was inspired by [Python's
buffer protocol](https://docs.python.org/3/c-api/buffer.html), which is also
a C-only and CPU-only interface. See `__array_interface__` below for a
Python-level equivalent of the buffer protocol.


### Is `__dataframe__` analogous to `__array__` or `__array_interface__`?

Yes, it is fairly analogous to `__array_interface__`. There will be some
differences though, for example `__array_interface__` doesn't know about
devices, and it's a `dict` with a pointer to memory so there's an assumption
that the data lives in CPU memory (which may not be true, e.g. in the case of
cuDF or Vaex).

It is _not_ analogous to `__array__`, which is NumPy-specific. `__array__` is a
method attached to array/tensor-like objects, and calling it is requesting
the object it's attached to to turn itself into a NumPy array. Hence, the
library that implements `__array__` must depend on NumPy, and call a NumPy
`ndarray` constructor itself from within `__array__`.


### What is wrong with `.to_numpy?` and `.to_arrow()`?

Such methods ask the object it is attached to to turn itself into a NumPy or
Arrow array. Which means each library must have at least an optional
dependency on NumPy and on Arrow if it implements those methods. This leads
to unnecessary coupling between libraries, and hence is a suboptimal choice -
we'd like to avoid this if we can.

Instead, it should be dataframe consumers that rely on NumPy or Arrow, since
they are the ones that need such a particular format. So, it can call the
constructor it needs. For example, `x = np.asarray(df['colname'])` (where
`df` supports `__dataframe__`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignores an important aspect brought up in the discussion, I think. One of the arguments to have such dedicated methods is that you might need a numpy array in a specific memory layout (eg because your cython algo requires it). Numpy's type support is less rich as found in dataframe libraries (eg no categorical, string, decimal, ..), and numpy doesn't support missing values. So as the consumer you might want to have control on how the conversion is done.

For example in pandas, the to_numpy method has a na_value keyword to control what value (compatible with the numpy dtype) is used for missing values.

This of course doesn't necessarily require a to_numpy method, as we might be able to give this kind of control in other ways. But I think having this kind of control is an important use case (at least for compatibility with numpy-based (or general array-based) libraries).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. At the moment na_value is the one thing I see in Pandas. Is there anything else that other dataframe libraries have or you know may be needed in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the actual dtype, I think. For example, if you have a categorical column, do you want to get a "densified" array (eg string array of the categories are string) or the integer indices? If you have a string column, do you want to get a numpy str dtype or object dtype array? If you have a decimal column, do you want a numpy float array or object array with decimal objects. Etc.
Basically any data type that has no direct mapping to a numpy dtype might have potentially multiple options in how to convert it to numpy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid this discussion by solving this in the future? e.g.

x = np.array(df['col'].fill_na(4))
y = np.array(df['col'].as_type(pdx.types.float32))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion makes sense I think @maartenbreddels. Longer-term this does seem like the cleaner solution.

This is actually a bit of a side discussion, triggered by the presence of to_numpy and to_arrow in wesm/dataframe-protocol#1. In that prototype it's actually not connected to __dataframe__. And converting one dataframe into another kind of dataframe is a different goal/conversation than "convert a column into an array". We do not have a design requirement for the latter. So I'd say we clarify that in the doc so that this FAQ item has some more context, and then just leave it out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And converting one dataframe into another kind of dataframe is a different goal/conversation than "convert a column into an array"

But often, converting one dataframe into another dataframe will go through converting each column of dataframe 1 to an array, and assemble dataframe 2 from those arrays ?
At least, that's how I imagine dataframe exchange to work based on the protocol we are discussing (if you want to avoid specific knowledge about the type of dataframe 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fair enough - that's completely implicit right now, should probably be made more explicit, that will both make this section make more sense and help when implementing. I'll retract my "this is a different goal" comment and will try to fix this up.



### Does an interface describing memory work for virtual columns?

Vaex is an example of a library that can have "virtual columns" (see @maartenbreddels
[comment here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-686373569)).
If the protocol includes a description of data layout in memory, does that
work for such a virtual column?

Yes. Virtual columns need to be materialized in memory before they can be
turned into a column for a different type of dataframe - that will be true
for every discussed form of the protocol; whether there's a `to_arrow()` or
something else does not matter. Vaex can choose _how_ to materialize (e.g.,
to an Arrow array, a NumPy array, or a raw memory buffer) - as long as the
returned description of memory layout is valid, all those options can later
be turned into the desired column format without a data copy, so the
implementation choice here really doesn't matter much.

_Note: the above statement on materialization assumes that there are many
forms a virtual column can be implemented, and that those are all
custom/different and that at this point it makes little sense to standardize
that. For example, one could do this with a simple string DSL (`'col_C =
col_A + col_B'`, with a fancier C++-style lazy evaluation, with a
computational graph approach like Dask uses, etc.)._


## Possible direction for implementation

The `cuDFDataFrame`, `cuDFColumn` and `cuDFBuffer` sketched out by @kkraus14
[here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685123386)
seems to be in the right direction.

TODO: work this out after making sure we're all on the same page regarding requirements.