Skip to content

Commit 49f3b06

Browse files
committed
Add a summary document for the dataframe interchange protocol
Summarizes the various discussions about and goals/non-goals and requirements for the `__dataframe__` data interchange protocol. The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. @datapythonista will add a companion document that's a more gentle introduction/tutorial in a "from zero to a protocol" style. The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements. Closes gh-29
1 parent 6af8c2a commit 49f3b06

File tree

1 file changed

+193
-0
lines changed

1 file changed

+193
-0
lines changed
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# `__dataframe__` protocol - summary
2+
3+
_We've had a lot of discussion in a couple of GitHub issues and in meetings.
4+
This description attempts to summarize that, and extract the essential design
5+
requirements/principles and functionality it needs to support._
6+
7+
## Purpose of `__dataframe__`
8+
9+
The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe).
10+
11+
Currently (Sep'20) there is no way to do this in an implementation-independent way.
12+
13+
The main use case this protocol intends to enable is to make it possible to
14+
write code that can accept any type of dataframe instead of being tied to a
15+
single type of dataframe. To illustrate that:
16+
17+
```python
18+
def somefunc(df, ...):
19+
"""`df` can be any dataframe supporting the protocol, rather than (say)
20+
only a pandas.DataFrame"""
21+
# could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)`
22+
df = pd.from_dataframe(df)
23+
# From now on, use Pandas dataframe internally
24+
```
25+
26+
### Non-goals
27+
28+
Providing a _complete standardized dataframe API_ is not a goal of the
29+
`__dataframe__` protocol. Instead, this is a goal of the full dataframe API
30+
standard, which the Consortium for Python Data API Standards aims to provide
31+
in the future. When that full API standard is implemented by dataframe
32+
libraries, the example above can change to:
33+
34+
```python
35+
def get_df_module(df):
36+
"""Utility function to support programming against a dataframe API"""
37+
if hasattr(df, '__dataframe_namespace__'):
38+
# Retrieve the namespace
39+
pdx = df.__dataframe_namespace__()
40+
else:
41+
# Here we can raise an exception if we only want to support compliant dataframes,
42+
# or convert to our default choice of dataframe if we want to accept (e.g.) dicts
43+
pdx = pd
44+
df = pd.DataFrame(df)
45+
46+
return pdx, df
47+
48+
49+
def somefunc(df, ...):
50+
"""`df` can be any dataframe conforming to the dataframe API standard"""
51+
pdx, df = get_df_module(df)
52+
# From now on, use `df` methods and `pdx` functions/objects
53+
```
54+
55+
### Constraints
56+
57+
An important constraint on the `__dataframe__` protocol is that it should not
58+
make achieving the goal of the complete standardized dataframe API more
59+
difficult to achieve.
60+
61+
There is a small concern here. Say that a library adopts `__dataframe__` first,
62+
and it goes from supporting only Pandas to officially supporting other
63+
dataframes like `modin.pandas.DataFrame`. At that point, changing to
64+
supporting the full dataframe API standard as a next step _implies a
65+
backwards compatibility break_ for users that now start relying on Modin
66+
dataframe support. E.g., the second transition will change from returning a
67+
Pandas dataframe from `somefunc(df_modin)` to returning a Modin dataframe
68+
later. It must be made very clear to libraries accepting `__dataframe__` that
69+
this is a consequence, and that that should be acceptable to them.
70+
71+
72+
### Progression / timeline
73+
74+
- **Current status**: most dataframe-consuming libraries work _only_ with
75+
Pandas, and rely on many Pandas-specific functions, methods and behavior.
76+
- **Status after `__dataframe__`**: with minor code changes (as in first
77+
example above), libraries can start supporting all conforming dataframes,
78+
convert them to Pandas dataframes, and still rely on the same
79+
Pandas-specific functions, methods and behavior.
80+
- **Status after standard dataframe API adoption**: libraries can start
81+
supporting all conforming dataframes _without converting to Pandas or
82+
relying on its implementation details_. At this point, it's possible to
83+
"program to an interface" rather than to a specific library like Pandas.
84+
85+
86+
## Protocol design requirements
87+
88+
1. Must be a standard API that is unambiguously specified, and not rely on
89+
implementation details of any particular dataframe library.
90+
2. Must treat dataframes as a collection of columns (which are 1-D arrays
91+
with a dtype and missing data support).
92+
3. Must include device support
93+
4. Must avoid device transfers by default (e.g. copy data from GPU to CPU),
94+
and provide an explicit way to force such transfers (e.g. a `force=` or
95+
`copy=` keyword that the caller can set to `True`).
96+
5. Must be zero-copy if possible.
97+
6. Must be able to support "virtual columns" (e.g., a library like Vaex which
98+
may not have data in memory because it uses lazy evaluation).
99+
7. Must support missing values (`NA`) for all supported dtypes.
100+
8. Must supports string and categorical dtypes
101+
(_TBD: not discussed a lot, is this a hard requirement?_)
102+
103+
We'll also list some things that were discussed but are not requirements:
104+
105+
1. Object dtype does not need to be supported (_TBD: this is what Joris said,
106+
but doesn't Pandas use object dtype to represent strings?_).
107+
2. Heterogeneous/structured dtypes within a single column does not need to be
108+
supported.
109+
_Rationale: not used a lot, additional design complexity not justified._
110+
111+
112+
## Frequently asked questions
113+
114+
### Can the Arrow C Data Interface be used for this?
115+
116+
What we are aiming for is quite similar to the Arrow C Data Interface (see
117+
the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)),
118+
except `__dataframe__` is a Python-level rather than C-level interface.
119+
120+
The limitations seem to be:
121+
- No device support (@kkraus14 will bring this up on the Arrow dev mailing list)
122+
- Specific to columnar data (_at least, this is what its docs say_).
123+
TODO: are there any concerns for, e.g., Koalas or Ibis.
124+
125+
Note that categoricals are supported, Arrow uses the phrasing
126+
"dictionary-encoded types" for categorical.
127+
128+
The Arrow C Data Interface says specifically it was inspired by [Python's
129+
buffer protocol](https://docs.python.org/3/c-api/buffer.html), which is also
130+
a C-only and CPU-only interface. See `__array_interface__` below for a
131+
Python-level equivalent of the buffer protocol.
132+
133+
134+
### Is `__dataframe__` analogous to `__array__` or `__array_interface__`?
135+
136+
Yes, it is fairly analogous to `__array_interface__`. There will be some
137+
differences though, for example `__array_interface__` doesn't know about
138+
devices, and it's a `dict` with a pointer to memory so there's an assumption
139+
that the data lives in CPU memory (which may not be true, e.g. in the case of
140+
cuDF or Vaex).
141+
142+
It is _not_ analogous to `__array__`, which is NumPy-specific. `__array__` is a
143+
method attached to array/tensor-like objects, and calling it is requesting
144+
the object it's attached to to turn itself into a NumPy array. Hence, the
145+
library that implements `__array__` must depend on NumPy, and call a NumPy
146+
`ndarray` constructor itself from within `__array__`.
147+
148+
149+
### What is wrong with `.to_numpy?` and `.to_arrow()`?
150+
151+
Such methods ask the object it is attached to to turn itself into a NumPy or
152+
Arrow array. Which means each library must have at least an optional
153+
dependency on NumPy and on Arrow if it implements those methods. This leads
154+
to unnecessary coupling between libraries, and hence is a suboptimal choice -
155+
we'd like to avoid this if we can.
156+
157+
Instead, it should be dataframe consumers that rely on NumPy or Arrow, since
158+
they are the ones that need such a particular format. So, it can call the
159+
constructor it needs. For example, `x = np.asarray(df['colname'])` (where
160+
`df` supports `__dataframe__`).
161+
162+
163+
### Does an interface describing memory work for virtual columns?
164+
165+
Vaex is an example of a library that can have "virtual columns" (see @maartenbreddels
166+
[comment here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-686373569)).
167+
If the protocol includes a description of data layout in memory, does that
168+
work for such a virtual column?
169+
170+
Yes. Virtual columns need to be materialized in memory before they can be
171+
turned into a column for a different type of dataframe - that will be true
172+
for every discussed form of the protocol; whether there's a `to_arrow()` or
173+
something else does not matter. Vaex can choose _how_ to materialize (e.g.,
174+
to an Arrow array, a NumPy array, or a raw memory buffer) - as long as the
175+
returned description of memory layout is valid, all those options can later
176+
be turned into the desired column format without a data copy, so the
177+
implementation choice here really doesn't matter much.
178+
179+
_Note: the above statement on materialization assumes that there are many
180+
forms a virtual column can be implemented, and that those are all
181+
custom/different and that at this point it makes little sense to standardize
182+
that. For example, one could do this with a simple string DSL (`'col_C =
183+
col_A + col_B'`, with a fancier C++-style lazy evaluation, with a
184+
computational graph approach like Dask uses, etc.)._
185+
186+
187+
## Possible direction for implementation
188+
189+
The `cuDFDataFrame`, `cuDFColumn` and `cuDFBuffer` sketched out by @kkraus14
190+
[here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685123386)
191+
seems to be in the right direction.
192+
193+
TODO: work this out after making sure we're all on the same page regarding requirements.

0 commit comments

Comments
 (0)