Skip to content

Adding introduction, goals, scope and use cases to the RFC #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 13, 2020

Conversation

datapythonista
Copy link
Member

This is a first version of the introductory materials of the data frame RFC document.

So far, it only considers the part on data interchange (see #25). The scope will be changed, and further use cases will be added once this first stage in data interchange is completed.

@datapythonista
Copy link
Member Author

Thanks @HyukjinKwon for the corrections.

Copy link

@markusweimer markusweimer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a pass and left some nits and high-level comments


In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/).
pandas was initially develop at a hedge fund, with a focus on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

develop -> developed

implementation details of data frame libraries. This will allow users and third-party libraries to
write code that interacts with a standard data frame, and not with specific implementations.

The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we leave the standardization of the end-user API as potential future work for us, or do we not plan on doing any of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and one that many readers will have. I think it would be good to explicitly this is out of scope for this version of the standard, but may be in scope for a future version. With a rationale that it's also important, one of the longer-term goals should be (I think) to make the learning curve for users less steep when switching from one library to another one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of:

  • Goals
  • Scope
  • Out-of-scope and non-goals
    is a little inconsistent, I'd suggest to make it symmetric (and add rationales as I just did in my array API scope PR), then this kind of thing may be easier to address.


- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
- Task schedulers (e.g. Dask, Ray)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add Database and Big Data systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view.

Comment on lines 124 to 125
Authors of data frame libraries in Python are expected to implement the API defined
in this document in their libraries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very heavy handed statement. Could we reword it to something a bit friendlier of:

We encourage data frame libraries in Python to implement the API defined in this document in their libraries

A non-exhaustive list of upstream categories is next:

- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
- Task schedulers (e.g. Dask, Ray)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would include Mars (https://github.com/mars-project/mars) here as well.

Comment on lines +139 to +144
So, the list above would be reduced to a single function or method in each implementation:

- `from_dataframe()`

Note that the function `from_dataframe()` is for illustration, and not proposed as part
of the standard at this point.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dataframe protocol similar to wesm/dataframe-protocol#1 is a prerequisite to this being possible in my mind.

Without having a data exchange protocol defined as part of the spec / goal how can we define from_dataframe / to_dataframe APIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the data exchange protocol is what we're trying to define. This use case tries to illustrate why such a data exchange protocol is needed.

Do you think I should clarify this is the goal for the use cases? Or am I not understanding you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good as it is. We are talking about use cases in this document, not the implementation right? So we can loosely define what from_dataframe does, from a high-level point of view, to make the use case clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to comment here. I edited the scope since last comment from @kkraus14. I guess making clear in the goal/scope that we are defining a data exchange protocol solved your concern @kkraus14, or do you think this use case also needs editing?

Thanks both for the comments!

- How data is represented and stored (whether the data is in memory, disk, distributed)
- Expectations on when the execution is happening (in an eager or lazy way)
- Other execution details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd state here that an API designed for interactive usage is out of scope.

@@ -2,20 +2,187 @@

## Introduction

This document defines a Python data frame API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer dataframe as one word, like database.

I do not want to start a holy war, and I realize there are historical reasons to call it data frame, but data base was common even throughout the 90s. https://groups.google.com/g/alt.usage.english/c/jRB0g0zK85Q?pli=1

@datapythonista
Copy link
Member Author

Thanks all for the feedback, I addressed all comments.

- Libraries for database access (e.g. SQLAlchemy)


### Data frame power users
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Data frame power users
### Dataframe power users


Data frame libraries in several programming language exist, such as
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data frame libraries in several programming language exist, such as
Dataframe libraries in several programming language exist, such as

This section provides the list of stakeholders considered for the definition of this API.


### Data frame library authors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Data frame library authors
### Dataframe library authors

Comment on lines +139 to +144
So, the list above would be reduced to a single function or method in each implementation:

- `from_dataframe()`

Note that the function `from_dataframe()` is for illustration, and not proposed as part
of the standard at this point.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell.


## Scope

It is in the scope of this document the different elements of the API. This includes signatures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a verb missing in the first sentence ("to describe" ?)

Comment on lines 90 to 102
The goal of the API described in this document is to provide a standard interface that encapsulates
implementation details of dataframe libraries. This will allow users and third-party libraries to
write code that interacts with a standard dataframe, and not with specific implementations.

The main goals for the API defined in this document are:

- Provide a common API for dataframes so software can be developed to communicate with it
- Provide a common API for dataframes to build user interfaces on top of it, for example
libraries for interactive use or specific domains and industries
- Simplify interactions between the projects of the ecosystem, for example, software that
receives data as a dataframe
- Make conversion of data among different implementations easier
- Help user transition from one dataframe library to another
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to initially limit the scope to a "data exchange protocol", the above description of the goals still sound too general, and not making it clear what the specific scope/goals are IMO.


A non-exhaustive list of upstream categories is next:

- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy as well? It's used by dataframe libraries for their implementation

@maartenbreddels
Copy link

I like it!

Co-authored-by: Maarten Breddels <[email protected]>
@datapythonista
Copy link
Member Author

Thanks all for the feedback. If there are no objections I'll be merging this in a couple of days.

I think we will continue iterating on the content of this PR for a bit, particularly the scope. From the last discussions, I think it can be worth adding things like whether we want to support heterogeneous columns, devices, missing values, or virtual columns. But I think it's better to merge this first if people who reviewed it is happy enough, and open follow up PRs. This way the discussion can be more focused, and reviewers don't need to read the long diff of this PR anymore.

@datapythonista datapythonista merged commit 8352aba into master Sep 13, 2020
@datapythonista datapythonista deleted the purpose_and_use_cases branch September 13, 2020 21:35
@datapythonista
Copy link
Member Author

Merging this. I will keep updating the relevant sections in follow up PRs as needed. Further feedback on what's been merged surely welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants