|
| 1 | +# `__dataframe__` protocol - summary |
| 2 | + |
| 3 | +_We've had a lot of discussion in a couple of GitHub issues and in meetings. |
| 4 | +This description attempts to summarize that, and extract the essential design |
| 5 | +requirements/principles and functionality it needs to support._ |
| 6 | + |
| 7 | +## Purpose of `__dataframe__` |
| 8 | + |
| 9 | +The purpose of `__dataframe__` is to be a _data interchange_ protocol. I.e., a way to convert one type of dataframe into another type (for example, convert a Koalas dataframe into a Pandas dataframe, or a cuDF dataframe into a Vaex dataframe). |
| 10 | + |
| 11 | +Currently (Sep'20) there is no way to do this in an implementation-independent way. |
| 12 | + |
| 13 | +The main use case this protocol intends to enable is to make it possible to |
| 14 | +write code that can accept any type of dataframe instead of being tied to a |
| 15 | +single type of dataframe. To illustrate that: |
| 16 | + |
| 17 | +```python |
| 18 | +def somefunc(df, ...): |
| 19 | + """`df` can be any dataframe supporting the protocol, rather than (say) |
| 20 | + only a pandas.DataFrame""" |
| 21 | + # could also be `cudf.from_dataframe(df)`, or `vaex.from_dataframe(df)` |
| 22 | + df = pd.from_dataframe(df) |
| 23 | + # From now on, use Pandas dataframe internally |
| 24 | +``` |
| 25 | + |
| 26 | +### Non-goals |
| 27 | + |
| 28 | +Providing a _complete standardized dataframe API_ is not a goal of the |
| 29 | +`__dataframe__` protocol. Instead, this is a goal of the full dataframe API |
| 30 | +standard, which the Consortium for Python Data API Standards aims to provide |
| 31 | +in the future. When that full API standard is implemented by dataframe |
| 32 | +libraries, the example above can change to: |
| 33 | + |
| 34 | +```python |
| 35 | +def get_df_module(df): |
| 36 | + """Utility function to support programming against a dataframe API""" |
| 37 | + if hasattr(df, '__dataframe_namespace__'): |
| 38 | + # Retrieve the namespace |
| 39 | + pdx = df.__dataframe_namespace__() |
| 40 | + else: |
| 41 | + # Here we can raise an exception if we only want to support compliant dataframes, |
| 42 | + # or convert to our default choice of dataframe if we want to accept (e.g.) dicts |
| 43 | + pdx = pd |
| 44 | + df = pd.DataFrame(df) |
| 45 | + |
| 46 | + return pdx, df |
| 47 | + |
| 48 | + |
| 49 | +def somefunc(df, ...): |
| 50 | + """`df` can be any dataframe conforming to the dataframe API standard""" |
| 51 | + pdx, df = get_df_module(df) |
| 52 | + # From now on, use `df` methods and `pdx` functions/objects |
| 53 | +``` |
| 54 | + |
| 55 | +### Constraints |
| 56 | + |
| 57 | +An important constraint on the `__dataframe__` protocol is that it should not |
| 58 | +make achieving the goal of the complete standardized dataframe API more |
| 59 | +difficult to achieve. |
| 60 | + |
| 61 | +There is a small concern here. Say that a library adopts `__dataframe__` first, |
| 62 | +and it goes from supporting only Pandas to officially supporting other |
| 63 | +dataframes like `modin.pandas.DataFrame`. At that point, changing to |
| 64 | +supporting the full dataframe API standard as a next step _implies a |
| 65 | +backwards compatibility break_ for users that now start relying on Modin |
| 66 | +dataframe support. E.g., the second transition will change from returning a |
| 67 | +Pandas dataframe from `somefunc(df_modin)` to returning a Modin dataframe |
| 68 | +later. It must be made very clear to libraries accepting `__dataframe__` that |
| 69 | +this is a consequence, and that that should be acceptable to them. |
| 70 | + |
| 71 | + |
| 72 | +### Progression / timeline |
| 73 | + |
| 74 | +- **Current status**: most dataframe-consuming libraries work _only_ with |
| 75 | + Pandas, and rely on many Pandas-specific functions, methods and behavior. |
| 76 | +- **Status after `__dataframe__`**: with minor code changes (as in first |
| 77 | + example above), libraries can start supporting all conforming dataframes, |
| 78 | + convert them to Pandas dataframes, and still rely on the same |
| 79 | + Pandas-specific functions, methods and behavior. |
| 80 | +- **Status after standard dataframe API adoption**: libraries can start |
| 81 | + supporting all conforming dataframes _without converting to Pandas or |
| 82 | + relying on its implementation details_. At this point, it's possible to |
| 83 | + "program to an interface" rather than to a specific library like Pandas. |
| 84 | + |
| 85 | + |
| 86 | +## Protocol design requirements |
| 87 | + |
| 88 | +1. Must be a standard API that is unambiguously specified, and not rely on |
| 89 | + implementation details of any particular dataframe library. |
| 90 | +2. Must treat dataframes as a collection of columns (which are 1-D arrays |
| 91 | + with a dtype and missing data support). |
| 92 | +3. Must include device support |
| 93 | +4. Must avoid device transfers by default (e.g. copy data from GPU to CPU), |
| 94 | + and provide an explicit way to force such transfers (e.g. a `force=` or |
| 95 | + `copy=` keyword that the caller can set to `True`). |
| 96 | +5. Must be zero-copy if possible. |
| 97 | +6. Must be able to support "virtual columns" (e.g., a library like Vaex which |
| 98 | + may not have data in memory because it uses lazy evaluation). |
| 99 | +7. Must support missing values (`NA`) for all supported dtypes. |
| 100 | +8. Must supports string and categorical dtypes |
| 101 | + (_TBD: not discussed a lot, is this a hard requirement?_) |
| 102 | + |
| 103 | +We'll also list some things that were discussed but are not requirements: |
| 104 | + |
| 105 | +1. Object dtype does not need to be supported (_TBD: this is what Joris said, |
| 106 | + but doesn't Pandas use object dtype to represent strings?_). |
| 107 | +2. Heterogeneous/structured dtypes within a single column does not need to be |
| 108 | + supported. |
| 109 | + _Rationale: not used a lot, additional design complexity not justified._ |
| 110 | + |
| 111 | + |
| 112 | +## Frequently asked questions |
| 113 | + |
| 114 | +### Can the Arrow C Data Interface be used for this? |
| 115 | + |
| 116 | +What we are aiming for is quite similar to the Arrow C Data Interface (see |
| 117 | +the [rationale for the Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html#rationale)), |
| 118 | +except `__dataframe__` is a Python-level rather than C-level interface. |
| 119 | + |
| 120 | +The limitations seem to be: |
| 121 | +- No device support (@kkraus14 will bring this up on the Arrow dev mailing list) |
| 122 | +- Specific to columnar data (_at least, this is what its docs say_). |
| 123 | + TODO: are there any concerns for, e.g., Koalas or Ibis. |
| 124 | + |
| 125 | +Note that categoricals are supported, Arrow uses the phrasing |
| 126 | +"dictionary-encoded types" for categorical. |
| 127 | + |
| 128 | +The Arrow C Data Interface says specifically it was inspired by [Python's |
| 129 | +buffer protocol](https://docs.python.org/3/c-api/buffer.html), which is also |
| 130 | +a C-only and CPU-only interface. See `__array_interface__` below for a |
| 131 | +Python-level equivalent of the buffer protocol. |
| 132 | + |
| 133 | + |
| 134 | +### Is `__dataframe__` analogous to `__array__` or `__array_interface__`? |
| 135 | + |
| 136 | +Yes, it is fairly analogous to `__array_interface__`. There will be some |
| 137 | +differences though, for example `__array_interface__` doesn't know about |
| 138 | +devices, and it's a `dict` with a pointer to memory so there's an assumption |
| 139 | +that the data lives in CPU memory (which may not be true, e.g. in the case of |
| 140 | +cuDF or Vaex). |
| 141 | + |
| 142 | +It is _not_ analogous to `__array__`, which is NumPy-specific. `__array__` is a |
| 143 | +method attached to array/tensor-like objects, and calling it is requesting |
| 144 | +the object it's attached to to turn itself into a NumPy array. Hence, the |
| 145 | +library that implements `__array__` must depend on NumPy, and call a NumPy |
| 146 | +`ndarray` constructor itself from within `__array__`. |
| 147 | + |
| 148 | + |
| 149 | +### What is wrong with `.to_numpy?` and `.to_arrow()`? |
| 150 | + |
| 151 | +Such methods ask the object it is attached to to turn itself into a NumPy or |
| 152 | +Arrow array. Which means each library must have at least an optional |
| 153 | +dependency on NumPy and on Arrow if it implements those methods. This leads |
| 154 | +to unnecessary coupling between libraries, and hence is a suboptimal choice - |
| 155 | +we'd like to avoid this if we can. |
| 156 | + |
| 157 | +Instead, it should be dataframe consumers that rely on NumPy or Arrow, since |
| 158 | +they are the ones that need such a particular format. So, it can call the |
| 159 | +constructor it needs. For example, `x = np.asarray(df['colname'])` (where |
| 160 | +`df` supports `__dataframe__`). |
| 161 | + |
| 162 | + |
| 163 | +### Does an interface describing memory work for virtual columns? |
| 164 | + |
| 165 | +Vaex is an example of a library that can have "virtual columns" (see @maartenbreddels |
| 166 | +[comment here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-686373569)). |
| 167 | +If the protocol includes a description of data layout in memory, does that |
| 168 | +work for such a virtual column? |
| 169 | + |
| 170 | +Yes. Virtual columns need to be materialized in memory before they can be |
| 171 | +turned into a column for a different type of dataframe - that will be true |
| 172 | +for every discussed form of the protocol; whether there's a `to_arrow()` or |
| 173 | +something else does not matter. Vaex can choose _how_ to materialize (e.g., |
| 174 | +to an Arrow array, a NumPy array, or a raw memory buffer) - as long as the |
| 175 | +returned description of memory layout is valid, all those options can later |
| 176 | +be turned into the desired column format without a data copy, so the |
| 177 | +implementation choice here really doesn't matter much. |
| 178 | + |
| 179 | +_Note: the above statement on materialization assumes that there are many |
| 180 | +forms a virtual column can be implemented, and that those are all |
| 181 | +custom/different and that at this point it makes little sense to standardize |
| 182 | +that. For example, one could do this with a simple string DSL (`'col_C = |
| 183 | +col_A + col_B'`, with a fancier C++-style lazy evaluation, with a |
| 184 | +computational graph approach like Dask uses, etc.)._ |
| 185 | + |
| 186 | + |
| 187 | +## Possible direction for implementation |
| 188 | + |
| 189 | +The `cuDFDataFrame`, `cuDFColumn` and `cuDFBuffer` sketched out by @kkraus14 |
| 190 | +[here](https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685123386) |
| 191 | +seems to be in the right direction. |
| 192 | + |
| 193 | +TODO: work this out after making sure we're all on the same page regarding requirements. |
0 commit comments