Adding introduction, goals, scope and use cases to the RFC #27

datapythonista · 2020-08-18T21:10:05Z

This is a first version of the introductory materials of the data frame RFC document.

So far, it only considers the part on data interchange (see #25). The scope will be changed, and further use cases will be added once this first stage in data interchange is completed.

spec/01_purpose_and_scope.md

spec/02_use_cases.md

Co-authored-by: Hyukjin Kwon <[email protected]>

datapythonista · 2020-08-25T11:03:36Z

Thanks @HyukjinKwon for the corrections.

markusweimer

Just did a pass and left some nits and high-level comments

markusweimer · 2020-08-25T19:54:42Z

spec/01_purpose_and_scope.md


+In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/).
+pandas was initially develop at a hedge fund, with a focus on


develop -> developed

markusweimer · 2020-08-25T19:57:57Z

spec/01_purpose_and_scope.md

+implementation details of data frame libraries. This will allow users and third-party libraries to
+write code that interacts with a standard data frame, and not with specific implementations.
+
+The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting


Do we leave the standardization of the end-user API as potential future work for us, or do we not plan on doing any of that?

Good question, and one that many readers will have. I think it would be good to explicitly this is out of scope for this version of the standard, but may be in scope for a future version. With a rationale that it's also important, one of the longer-term goals should be (I think) to make the learning curve for users less steep when switching from one library to another one.

The structure of:

Goals

Scope

Out-of-scope and non-goals
is a little inconsistent, I'd suggest to make it symmetric (and add rationales as I just did in my array API scope PR), then this kind of thing may be easier to address.

markusweimer · 2020-08-25T19:59:21Z

spec/01_purpose_and_scope.md

+
+- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
+- Task schedulers (e.g. Dask, Ray)
+


Should we add Database and Big Data systems?

Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view.

kkraus14 · 2020-08-26T00:06:13Z

spec/01_purpose_and_scope.md

+Authors of data frame libraries in Python are expected to implement the API defined
+in this document in their libraries.


This is a very heavy handed statement. Could we reword it to something a bit friendlier of:

We encourage data frame libraries in Python to implement the API defined in this document in their libraries

kkraus14 · 2020-08-26T00:09:57Z

spec/01_purpose_and_scope.md

+A non-exhaustive list of upstream categories is next:
+
+- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
+- Task schedulers (e.g. Dask, Ray)


Would include Mars (https://github.com/mars-project/mars) here as well.

kkraus14 · 2020-08-26T00:20:04Z

spec/02_use_cases.md

+So, the list above would be reduced to a single function or method in each implementation:
+
+- `from_dataframe()`
+
+Note that the function `from_dataframe()` is for illustration, and not proposed as part
+of the standard at this point.


A dataframe protocol similar to wesm/dataframe-protocol#1 is a prerequisite to this being possible in my mind.

Without having a data exchange protocol defined as part of the spec / goal how can we define from_dataframe / to_dataframe APIs?

At this point the data exchange protocol is what we're trying to define. This use case tries to illustrate why such a data exchange protocol is needed.

Do you think I should clarify this is the goal for the use cases? Or am I not understanding you?

I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell.

I think it's good as it is. We are talking about use cases in this document, not the implementation right? So we can loosely define what from_dataframe does, from a high-level point of view, to make the use case clear.

Sorry, forgot to comment here. I edited the scope since last comment from @kkraus14. I guess making clear in the goal/scope that we are defining a data exchange protocol solved your concern @kkraus14, or do you think this use case also needs editing?

Thanks both for the comments!

spec/02_use_cases.md

rgommers · 2020-08-26T17:08:52Z

spec/01_purpose_and_scope.md

+- How data is represented and stored (whether the data is in memory, disk, distributed)
+- Expectations on when the execution is happening (in an eager or lazy way)
+- Other execution details
+


I'd state here that an API designed for interactive usage is out of scope.

devin-petersohn · 2020-08-26T20:47:07Z

spec/01_purpose_and_scope.md

@@ -2,20 +2,187 @@

 ## Introduction

+This document defines a Python data frame API.


I prefer dataframe as one word, like database.

I do not want to start a holy war, and I realize there are historical reasons to call it data frame, but data base was common even throughout the 90s. https://groups.google.com/g/alt.usage.english/c/jRB0g0zK85Q?pli=1

datapythonista · 2020-08-29T13:41:30Z

Thanks all for the feedback, I addressed all comments.

kkraus14 · 2020-08-31T04:05:34Z

spec/01_purpose_and_scope.md

+- Libraries for database access (e.g. SQLAlchemy)
+
+
+### Data frame power users


Suggested change

### Data frame power users

### Dataframe power users

kkraus14 · 2020-08-31T04:05:49Z

spec/01_purpose_and_scope.md


+Data frame libraries in several programming language exist, such as


Suggested change

Data frame libraries in several programming language exist, such as

Dataframe libraries in several programming language exist, such as

kkraus14 · 2020-08-31T04:05:56Z

spec/01_purpose_and_scope.md

+This section provides the list of stakeholders considered for the definition of this API.
+
+
+### Data frame library authors


Suggested change

### Data frame library authors

### Dataframe library authors

kkraus14 · 2020-08-31T04:06:58Z

spec/02_use_cases.md

+So, the list above would be reduced to a single function or method in each implementation:
+
+- `from_dataframe()`
+
+Note that the function `from_dataframe()` is for illustration, and not proposed as part
+of the standard at this point.


I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell.

jorisvandenbossche · 2020-09-01T15:41:52Z

spec/01_purpose_and_scope.md

+
+## Scope
+
+It is in the scope of this document the different elements of the API. This includes signatures


There is a verb missing in the first sentence ("to describe" ?)

jorisvandenbossche · 2020-09-01T15:45:10Z

spec/01_purpose_and_scope.md

+The goal of the API described in this document is to provide a standard interface that encapsulates
+implementation details of dataframe libraries. This will allow users and third-party libraries to
+write code that interacts with a standard dataframe, and not with specific implementations.
+
+The main goals for the API defined in this document are:
+
+- Provide a common API for dataframes so software can be developed to communicate with it
+- Provide a common API for dataframes to build user interfaces on top of it, for example
+  libraries for interactive use or specific domains and industries
+- Simplify interactions between the projects of the ecosystem, for example, software that
+  receives data as a dataframe
+- Make conversion of data among different implementations easier
+- Help user transition from one dataframe library to another


If the goal is to initially limit the scope to a "data exchange protocol", the above description of the goals still sound too general, and not making it clear what the specific scope/goals are IMO.

jorisvandenbossche · 2020-09-01T15:50:49Z

spec/01_purpose_and_scope.md

+
+A non-exhaustive list of upstream categories is next:
+
+- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)


Numpy as well? It's used by dataframe libraries for their implementation

spec/01_purpose_and_scope.md

maartenbreddels · 2020-09-03T09:59:50Z

I like it!

Co-authored-by: Maarten Breddels <[email protected]>

spec/01_purpose_and_scope.md

Co-authored-by: Devin Petersohn <[email protected]>

datapythonista · 2020-09-04T17:32:55Z

Thanks all for the feedback. If there are no objections I'll be merging this in a couple of days.

I think we will continue iterating on the content of this PR for a bit, particularly the scope. From the last discussions, I think it can be worth adding things like whether we want to support heterogeneous columns, devices, missing values, or virtual columns. But I think it's better to merge this first if people who reviewed it is happy enough, and open follow up PRs. This way the discussion can be more focused, and reviewers don't need to read the long diff of this PR anymore.

datapythonista · 2020-09-13T21:36:33Z

Merging this. I will keep updating the relevant sections in follow up PRs as needed. Further feedback on what's been merged surely welcome.

datapythonista added 2 commits August 17, 2020 00:56

Adding purpose, goals and use cases

991ab82

Changes after first review

cc92c2c

HyukjinKwon reviewed Aug 25, 2020

View reviewed changes

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

spec/02_use_cases.md Outdated Show resolved Hide resolved

Apply suggestions from code review

3d998f9

Co-authored-by: Hyukjin Kwon <[email protected]>

markusweimer reviewed Aug 25, 2020

View reviewed changes

kkraus14 reviewed Aug 26, 2020

View reviewed changes

rgommers reviewed Aug 26, 2020

View reviewed changes

devin-petersohn reviewed Aug 26, 2020

View reviewed changes

datapythonista added 2 commits August 29, 2020 14:34

Addressing comments from reviews

d047f57

Merging

1ac9a15

kkraus14 reviewed Aug 31, 2020

View reviewed changes

jorisvandenbossche reviewed Sep 1, 2020

View reviewed changes

Addressing comments from reviews

e9472c8

maartenbreddels reviewed Sep 3, 2020

View reviewed changes

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

maartenbreddels reviewed Sep 3, 2020

View reviewed changes

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

Apply suggestions from code review

837f87d

Co-authored-by: Maarten Breddels <[email protected]>

devin-petersohn reviewed Sep 4, 2020

View reviewed changes

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

spec/01_purpose_and_scope.md Outdated Show resolved Hide resolved

Apply suggestions from code review

293c652

Co-authored-by: Devin Petersohn <[email protected]>

datapythonista merged commit 8352aba into master Sep 13, 2020

datapythonista deleted the purpose_and_use_cases branch September 13, 2020 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding introduction, goals, scope and use cases to the RFC #27

Adding introduction, goals, scope and use cases to the RFC #27

datapythonista commented Aug 18, 2020

datapythonista commented Aug 25, 2020

markusweimer left a comment

markusweimer Aug 25, 2020

markusweimer Aug 25, 2020

rgommers Aug 26, 2020

rgommers Aug 26, 2020

markusweimer Aug 25, 2020

datapythonista Aug 29, 2020

kkraus14 Aug 26, 2020

kkraus14 Aug 26, 2020

kkraus14 Aug 26, 2020

datapythonista Aug 29, 2020

kkraus14 Aug 31, 2020

maartenbreddels Sep 3, 2020

datapythonista Sep 3, 2020

rgommers Aug 26, 2020

devin-petersohn Aug 26, 2020

datapythonista commented Aug 29, 2020

kkraus14 Aug 31, 2020

kkraus14 Aug 31, 2020

kkraus14 Aug 31, 2020

kkraus14 Aug 31, 2020

jorisvandenbossche Sep 1, 2020

jorisvandenbossche Sep 1, 2020

jorisvandenbossche Sep 1, 2020

maartenbreddels commented Sep 3, 2020

datapythonista commented Sep 4, 2020

datapythonista commented Sep 13, 2020


		In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/).
		pandas was initially develop at a hedge fund, with a focus on


		- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)
		- Task schedulers (e.g. Dask, Ray)

		Authors of data frame libraries in Python are expected to implement the API defined
		in this document in their libraries.

		@@ -2,20 +2,187 @@

		## Introduction

		This document defines a Python data frame API.

		- Libraries for database access (e.g. SQLAlchemy)


		### Data frame power users


		Data frame libraries in several programming language exist, such as

	Data frame libraries in several programming language exist, such as
	Dataframe libraries in several programming language exist, such as

		This section provides the list of stakeholders considered for the definition of this API.


		### Data frame library authors


		## Scope

		It is in the scope of this document the different elements of the API. This includes signatures


		A non-exhaustive list of upstream categories is next:

		- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow)

Adding introduction, goals, scope and use cases to the RFC #27

Adding introduction, goals, scope and use cases to the RFC #27

Conversation

datapythonista commented Aug 18, 2020

datapythonista commented Aug 25, 2020

markusweimer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista commented Aug 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maartenbreddels commented Sep 3, 2020

datapythonista commented Sep 4, 2020

datapythonista commented Sep 13, 2020