How the API is expected to be used #18

datapythonista · 2020-06-25T19:05:21Z

In today's meeting it was discussed what's the goal of the API, and which are its target users.

@maartenbreddels and @devin-petersohn, if I understood correctly, see the API we're defining here, as something they'd like to implement internally in Vaex and Modin, but not making it their public API. Not sure what's pandas point of view on that.

I think that's perfectly fine, and it makes sense. But I have the question of whether would make sense if those public API's would be independent wrappers, in the same way Seaborn wraps Matplotlib, or HoloViews wraps Bokeh. Let me expand on what I mean here.

For the discussions we had, I think people mentioned that they were interested in defining a more "pure" and less "magic" API, than the existing one. Not sure if the previous sentence makes a lot of sense, but I guess some of the principles for the API could be:

Explicit is better than implicit
There should be one-- and preferably only one --obvious way to do it
Avoid ambiguity
In general, avoid making the library having to make guesses

Personally, I think this API should be great for software developers. Like developers of libraries like us, who want to build on top of it. Or developers of downstream software. And I'd say, also to data engineers, and people who want to write production code with dataframes.

Then, I understand that some users (e.g. data analysts) prefer more "magic" API's, that automatically fix problems they don't want to care about. As an example, let's think of the dataframe constructor.

As a data analyst, or non-software people, I think the next code working is very reasonable/convenient:

DataFrame({'a': [1, 2], 'b': [3, 4]})
DataFrame([{'a': 1, 'b': 3}, {'a': 3, 'b': 4}])
DataFrame(json.loads(value))

But as software engineer, I may want to have a more explicit and less magic syntax, for example:

DataFrame.load(kind='dict', {'a': [1, 2], 'b': [3, 4]})
DataFrame.load(kind='list_of_dict', [{'a': 1, 'b': 3}, {'a': 3, 'b': 4}])
DataFrame.load(kind='dict', json.loads(value))

Correct me if I'm wrong, but I think there is mostly agreement that what we want to focus in the consortium API in the latter style. If Vaex, Modin, pandas... provide this API, then there is easy compatibility in the ecosystem. For example, Scikit-learn or Matplotlib can get a "dataframe" as a parameter, and operate with it, since they know it will follow the standard API.

But then, implementations like Modin, Vaex, or pandas, may want to keep their existing API's. Or provide a different user API, more targeted to specific users (e.g. data analysts, who want the library making guesses, that make their lives easier).

Then my question is, does it make sense that this alternative API live in the implementations? For example, let's consider I see pandas as this API on top of numpy, Vaex on top of memory maps, and Modin on top of Ray (excuse the simplification). Then, if Modin wants to implement an SQLite-like API. Could make sense that this is an independent project, of an SQLite-like API that wraps the standard API? Instead of a Modin API? I guess that could make sense.

Then, I guess there is the case, of an implementation, let's say pandas, which is planning to expose the API to users, but it's going to add some extra magic (let's say that the standard for filter is df.filter(condition) but pandas wants to keeps supporting df[condition] for backward compatibility. Or Vaex having some specific syntax for expressions in top of the standard API.

I see there is a whole range between these options:

All implementations offer exactly the same API
Implementations offer the standard API, but add some functionality to it (for their target users, or specific to the backens)
Backends (e.g. dataframes over numpy, over ray, over memory maps, over Arrow...) implement the same API, but users use libraries built on top of it. For example, the existing pandas API, could be a layer on top of the standard API, and work on top of Vaex, Modin...

Would be great to know other people thoughts. I think most people have an idea on how this API is expected to be used, but not sure if we're all in the same page.

The text was updated successfully, but these errors were encountered:

maartenbreddels · 2020-07-09T16:58:16Z

But then, implementations like Modin, Vaex, or pandas, may want to keep their existing API's.

Yes, how I see it, is that in vaex, I create a new module that exposes this standard API, but calls into Vaex, same for pandas, same for Modin.

I don't expect the current Vaex API to change to this API (although maybe get inspired).

SQLite-like API that wraps the standard API?

Yes, that's what I think the purpose is, new libraries, e.g. this, or a GraphQL API will use the standard API, not pandas/vaex/modin etc.

Then, I guess there is the case, of an implementation, let's say pandas, which is planning to expose the API to users, but it's going to add some extra magic (let's say that the standard for filter is df.filter(condition) but pandas wants to keeps supporting df[condition] for backward compatibility.

I think it's fine that pandas keeps its own API, and adds a new class to expose this new standard API, not in the same class (although that might be possible).

I hope what I say makes sense.

datapythonista · 2020-07-10T11:12:57Z

Trying to summarize what was discussed in the call. I'll open a PR in the RFC when there is agreement.

We want to focus on a standard API that avoids ambiguity (magic) and follows good software development principles. We target software engineers, as opposed as data analysts/scientists, who would prefer shortcuts that speed up their work.
One of the API goals is compatibility. I see this in two different ways:
- Being able to provide a standard dataframe to downstream libraries (matplotlib or scikit-learn being able to receive a pandas, vaex, modin... dataframe)
- Being able to build other APIs in top of a common API (e.g. implementing a SQL-like API that under the hood could work with pandas, vaex, modin...)
We don't make any assumption (or we leave it for later discussion) whether dataframe libraries will expose the standard API to users. And if they do, on whether they'll extend it with richer functionality.

Does this represent well what was discussed in the call? Any feedback welcome.

MarcoGorelli · 2023-12-15T11:08:40Z

we now have a standardised way of opting into the standard (__dataframe_consortium_standard__), it's up to implementations where the implementation itself lives or what they do with their main API

I suspect that nobody will change their main API, but will just expose a thing wrapped around their main API to comply with the standard

closing then, as I think this is now addressed

datapythonista mentioned this issue Jul 27, 2020

Mutability #10

Open

MarcoGorelli closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How the API is expected to be used #18

How the API is expected to be used #18

datapythonista commented Jun 25, 2020

maartenbreddels commented Jul 9, 2020

datapythonista commented Jul 10, 2020

MarcoGorelli commented Dec 15, 2023

How the API is expected to be used #18

How the API is expected to be used #18

Comments

datapythonista commented Jun 25, 2020

maartenbreddels commented Jul 9, 2020

datapythonista commented Jul 10, 2020

MarcoGorelli commented Dec 15, 2023