-
Notifications
You must be signed in to change notification settings - Fork 21
Adding introduction, goals, scope and use cases to the RFC #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Hyukjin Kwon <[email protected]>
Thanks @HyukjinKwon for the corrections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a pass and left some nits and high-level comments
spec/01_purpose_and_scope.md
Outdated
|
||
In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/). | ||
pandas was initially develop at a hedge fund, with a focus on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
develop
-> developed
spec/01_purpose_and_scope.md
Outdated
implementation details of data frame libraries. This will allow users and third-party libraries to | ||
write code that interacts with a standard data frame, and not with specific implementations. | ||
|
||
The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we leave the standardization of the end-user API as potential future work for us, or do we not plan on doing any of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, and one that many readers will have. I think it would be good to explicitly this is out of scope for this version of the standard, but may be in scope for a future version. With a rationale that it's also important, one of the longer-term goals should be (I think) to make the learning curve for users less steep when switching from one library to another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of:
- Goals
- Scope
- Out-of-scope and non-goals
is a little inconsistent, I'd suggest to make it symmetric (and add rationales as I just did in my array API scope PR), then this kind of thing may be easier to address.
|
||
- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) | ||
- Task schedulers (e.g. Dask, Ray) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add Database and Big Data systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view.
spec/01_purpose_and_scope.md
Outdated
Authors of data frame libraries in Python are expected to implement the API defined | ||
in this document in their libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very heavy handed statement. Could we reword it to something a bit friendlier of:
We encourage data frame libraries in Python to implement the API defined in this document in their libraries
spec/01_purpose_and_scope.md
Outdated
A non-exhaustive list of upstream categories is next: | ||
|
||
- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) | ||
- Task schedulers (e.g. Dask, Ray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would include Mars (https://github.com/mars-project/mars) here as well.
So, the list above would be reduced to a single function or method in each implementation: | ||
|
||
- `from_dataframe()` | ||
|
||
Note that the function `from_dataframe()` is for illustration, and not proposed as part | ||
of the standard at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dataframe protocol similar to wesm/dataframe-protocol#1 is a prerequisite to this being possible in my mind.
Without having a data exchange protocol defined as part of the spec / goal how can we define from_dataframe
/ to_dataframe
APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point the data exchange protocol is what we're trying to define. This use case tries to illustrate why such a data exchange protocol is needed.
Do you think I should clarify this is the goal for the use cases? Or am I not understanding you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good as it is. We are talking about use cases in this document, not the implementation right? So we can loosely define what from_dataframe
does, from a high-level point of view, to make the use case clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How data is represented and stored (whether the data is in memory, disk, distributed) | ||
- Expectations on when the execution is happening (in an eager or lazy way) | ||
- Other execution details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd state here that an API designed for interactive usage is out of scope.
spec/01_purpose_and_scope.md
Outdated
@@ -2,20 +2,187 @@ | |||
|
|||
## Introduction | |||
|
|||
This document defines a Python data frame API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer dataframe
as one word, like database
.
I do not want to start a holy war, and I realize there are historical reasons to call it data frame
, but data base
was common even throughout the 90s. https://groups.google.com/g/alt.usage.english/c/jRB0g0zK85Q?pli=1
Thanks all for the feedback, I addressed all comments. |
spec/01_purpose_and_scope.md
Outdated
- Libraries for database access (e.g. SQLAlchemy) | ||
|
||
|
||
### Data frame power users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Data frame power users | |
### Dataframe power users |
spec/01_purpose_and_scope.md
Outdated
|
||
Data frame libraries in several programming language exist, such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data frame libraries in several programming language exist, such as | |
Dataframe libraries in several programming language exist, such as |
spec/01_purpose_and_scope.md
Outdated
This section provides the list of stakeholders considered for the definition of this API. | ||
|
||
|
||
### Data frame library authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Data frame library authors | |
### Dataframe library authors |
So, the list above would be reduced to a single function or method in each implementation: | ||
|
||
- `from_dataframe()` | ||
|
||
Note that the function `from_dataframe()` is for illustration, and not proposed as part | ||
of the standard at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell.
|
||
## Scope | ||
|
||
It is in the scope of this document the different elements of the API. This includes signatures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a verb missing in the first sentence ("to describe" ?)
spec/01_purpose_and_scope.md
Outdated
The goal of the API described in this document is to provide a standard interface that encapsulates | ||
implementation details of dataframe libraries. This will allow users and third-party libraries to | ||
write code that interacts with a standard dataframe, and not with specific implementations. | ||
|
||
The main goals for the API defined in this document are: | ||
|
||
- Provide a common API for dataframes so software can be developed to communicate with it | ||
- Provide a common API for dataframes to build user interfaces on top of it, for example | ||
libraries for interactive use or specific domains and industries | ||
- Simplify interactions between the projects of the ecosystem, for example, software that | ||
receives data as a dataframe | ||
- Make conversion of data among different implementations easier | ||
- Help user transition from one dataframe library to another |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to initially limit the scope to a "data exchange protocol", the above description of the goals still sound too general, and not making it clear what the specific scope/goals are IMO.
spec/01_purpose_and_scope.md
Outdated
|
||
A non-exhaustive list of upstream categories is next: | ||
|
||
- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy as well? It's used by dataframe libraries for their implementation
I like it! |
Co-authored-by: Maarten Breddels <[email protected]>
Co-authored-by: Devin Petersohn <[email protected]>
Thanks all for the feedback. If there are no objections I'll be merging this in a couple of days. I think we will continue iterating on the content of this PR for a bit, particularly the scope. From the last discussions, I think it can be worth adding things like whether we want to support heterogeneous columns, devices, missing values, or virtual columns. But I think it's better to merge this first if people who reviewed it is happy enough, and open follow up PRs. This way the discussion can be more focused, and reviewers don't need to read the long diff of this PR anymore. |
Merging this. I will keep updating the relevant sections in follow up PRs as needed. Further feedback on what's been merged surely welcome. |
This is a first version of the introductory materials of the data frame RFC document.
So far, it only considers the part on data interchange (see #25). The scope will be changed, and further use cases will be added once this first stage in data interchange is completed.