Skip to content

Draft strawman data frame "__dataframe__" interchange / data export protocol for discussion #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
270 changes: 270 additions & 0 deletions dataframe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
# MIT License
#
# Copyright (c) 2020 Wes McKinney

from abc import ABC, abstractmethod
from collections.abc import Mapping, MutableMapping
from typing import Any, Hashable, Iterable, Sequence

# ----------------------------------------------------------------------
# A simple data type class hierarchy for illustration


class DataType(ABC):
"""
A metadata object representing the logical value type of a cell in a data
frame column. This metadata does not guarantee an specific underlying data
representation
"""
def __eq__(self, other: 'DataType'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting a mypy error here...

pandas\wesm\dataframe.py:19: error: Argument 1 of "__eq__" is incompatible with supertype "object"; supertype defines the 
argument type as "object"
pandas\wesm\dataframe.py:19: note: It is recommended for "__eq__" to work with arbitrary objects, for example:
pandas\wesm\dataframe.py:19: note:     def __eq__(self, other: object) -> bool:
pandas\wesm\dataframe.py:19: note:         if not isinstance(other, DataType):
pandas\wesm\dataframe.py:19: note:             return NotImplemented
pandas\wesm\dataframe.py:19: note:         return <logic to compare two DataType instances>

should we follow the mypy recommedation?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we follow the mypy recommedation?

I see no reason not to

return self.equals(other)

def __str__(self):
return self.to_string()

def __repr__(self):
return str(self)

@abstractmethod
def to_string(self) -> str:
"""
Return human-readable representation of the data type
"""

@abstractmethod
def equals(self, other: 'DataType') -> bool:
"""
Return true if other DataType contains the same metadata as this
DataType
"""
pass


class PrimitiveType(DataType):

def equals(self, other: DataType) -> bool:
return type(self) == type(other)


class NullType(PrimitiveType):
"""
A data type whose values are always null
"""
def to_string(self):
return "null"


class Boolean(PrimitiveType):

def to_string(self):
return "bool"


class NumberType(PrimitiveType):
pass


class IntegerType(NumberType):
pass


class SignedIntegerType(IntegerType):
pass


class Int8(SignedIntegerType):

def to_string(self):
return "int8"


class Int16(SignedIntegerType):

def to_string(self):
return "int16"


class Int32(SignedIntegerType):

def to_string(self):
return "int32"


class Int64(SignedIntegerType):

def to_string(self):
return "int64"


class Binary(PrimitiveType):
"""
A variable-size binary (bytes) value
"""
def to_string(self):
return "binary"


class String(PrimitiveType):
"""
A UTF8-encoded string value
"""
def to_string(self):
return "string"


class Object(PrimitiveType):
"""
Any PyObject value
"""
def to_string(self):
return "object"


class Categorical(DataType):
"""
A categorical value is an ordinal (integer) value that references a
sequence of category values of an arbitrary data type
"""

def __init__(self, index_type: IntegerType, category_type: DataType,
ordered: bool = False):
Copy link

@datapythonista datapythonista Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand better the reason to have the types of the data and the categories in the type.

With an example, I think a category column could be represented with something equivalent to:

class CategoryColumn:
    dtype = 'category'
    index = pyarrow.array([0, 1, 0, 2, 0, 1], type='int8')
    categories = pyarrow.array(['red', 'green', 'blue'], type='string')

So, when accessing the data, I can imagine that we may be interested in one of two options:

  1. The underlying data (to copy the structure from implementation to implementation for example)
  2. To access the actual data (forgetting about the internal representation)

For the first option, an example of code copying the data to a numpy array could be:

for source_column in dataframe:
    if source_column.dtype == 'category':
        copy_of_index = numpy.array(source_column.index, dtype=source_column.index.dtype)

The example is very simplistic, but it shows that we can get the type of the index (int8 in the previous example) from the underlying column (and not from the data type). It feels like if we have it again in the data type it will just be a duplicate that will be need to keep in sync. Same would apply to the data type of the categories.

What would be the main advantage of this implementation, as opposed to have just a categorical type with only the ordered parameter (or two types categorical and ordered_categorical)? I guess it decouples the schema from the data, but I'm unsure if it's worth the extra complexity. I guess we could just identify data types by their name as a string among implementations, instead of making them implement all these subclasses, if we don't have parameters. Which I guess would make things simpler.

Am I missing something? Does it make sense what I say?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty common for users to compare dtypes across columns. With this proposed change it would have a category column with int32 codes and say int64 categories have the same dtype as int8 codes and string categories, which may lead to confusion / frustration for end users.

I'm a +1 to including the index / categories types in the dtype as it makes things more explicit for end users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's useful to know. Just to understand better, do you have any example on why users compare dtypes across columns?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's lots of reasons to have access to the complete schema, from preallocating space for data (in-memory or on disk) to building query plans (including kernel selection) before data arrives. Not having access to the index type and category type would (IMHO) make a lot of things much more difficult.

That said, if you have an application where the index or category type is simply not known, you could use an Any type to indicate that the actual type must be observed in the actual data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's useful to know. Just to understand better, do you have any example on why users compare dtypes across columns?

Unfortunately our users don't share the "why" too often, but one use case is checking the output of dtype inference from something like a read_csv call.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having access to the index type and category type would (IMHO) make a lot of things much more difficult.

Thanks @wesm, that makes sense. But I don't fully understand this part. Given a categorical column, there is access to the index and category types in my example (there was a typo in my previous code, may be that's where I didn't make myself clear). It's just the place where it is found that is different (in the array instead of in the dtype).

In your proposal:

my_column.dtype.index_type

In my example would be something like:

my_column.underlying_categorical_data.index.dtype

I understand that in my example there wouldn't be access to the underlying types if a dataframe schema is used without the actual dataframe. Then, with your proposal we could still know the internal types of the categorical, while in my example those will be undefined, since they should be in the data, which doesn't exist. Is this what you have in mind? If it is, in what case are we considering that a schema can live without the data?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the actual data is potentially costly, I think? So that would make getting the type of the categories costly, while I assume accessing it from the dtype itself would not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the actual data is potentially costly, I think?

Good point. In the PR, the type (aka dtype) is a property of the Column class, so accessing the index type would be

class Column:
    @property
    def type(self):
        pass

my_column.type.index_type

The data is not specified, may be that's the problem, but if we agree the categorical data as two subcolumns, one for the index, one for the categories, then accessing the index type with an unparametrized categorical dtype would be:

my_column.index_subcolumn.type

I may be missing something, but I would say in both cases the type is metadata of the Column class that doesn't require accessing the actual data. Not sure if in certain architectures (distributed, GPU...) accessing a subcolumn could be expensive, but in pure Python both cases would be two lookups in the namespace.

I'm not sure what would imply if we consider implementations using Arrow's Dict instead of the two arrays separately. Would that be costly?

self.index_type = index_type
self.category_type = category_type
self.ordered = ordered

def equals(self, other: DataType) -> bool:
return (isinstance(other, Categorical) and
self.index_type == other.index_type and
self.category_type == other.category_type and
self.ordered == other.ordered)

def to_string(self):
return ("categorical(indices={}, categories={}, ordered={})"
.format(str(self.index_type), str(self.category_type),
self.ordered))


# ----------------------------------------------------------------------
# Classes representing a column in a DataFrame


class Column(ABC):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of having a column-specific interface vs just a DataFrame with one column? In Modin, underneath everything is a DataFrame and Column/Series objects are views into the dataframe implemented at a higher layer. I'm interested to hear the motivation for a column-specific interface.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics that most existing applications / users when they do df[col_name] is to get something that semantically references a 1-dimensional array, or something convertible to a 1-dimensional array. This seems to hold not just in Python but other PLs that have data frames:

  • In R, df$col_name yields a (1-D) R vector
  • In Julia's DataFrames.jl, df.col_name yields an Array of some type

In all the Apache Arrow implementations there's a logical / semantic distinction between a RecordBatch / Table and a field of a RecordBatch / Table (which yield an Array or ChunkedArray, respectively)

Secondly, the semantics and API of conversion methods at the column level vs. at the data frame level may differ. For example, suppose we wanted to select some number of columns from a data frame and then convert them to a dict of NumPy arrays, it might make sense to write

df.select_columns(names).to_numpy_dict()

or similar. I'm interested to see what scikit-learn and matplotlib developers think.


@property
@abstractmethod
def name(self) -> Hashable:
pass

@property
@abstractmethod
def type(self) -> DataType:
"""
Return the logical type of each column cell value
"""
pass

def to_numpy(self):
"""
Access column's data as a NumPy array. Recommended to return a view if
able but not required
"""
raise NotImplementedError("Conversion to NumPy not available")

def to_arrow(self, **kwargs):
"""
Access column's data in the Apache Arrow format as pyarrow.Array or
ChunkedArray. Recommended to return a view if able but not required
"""
raise NotImplementedError("Conversion to Arrow not available")
Comment on lines +169 to +181
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these APIs be generalized more to something like to_array and to_masked_array? I imagine users will use these APIs as building blocks and if we're for example using a GPU we wouldn't want to either return a different typed GPU-backed object or cause an unnecessary host --> device copy.

For example, Pandas uses .values to return a numpy / Pandas array whereas in cuDF we return a cupy array to keep everything on the GPU as much as possible.

Copy link
Owner Author

@wesm wesm Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment.

The idea of this interface is that the data representation is not known. So we will add various methods to Column (what is here is NOT exhaustive, maybe that is where there is confusion) to allow the consumer to request the data be returned to them in the representation that they prefer. I put NumPy and Arrow as two example concrete representations

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is if we have to_numpy and to_cupy, the standard will become to_numpy where anyone then looking to change to a different array implementation will have to go back and rework their code. If it was instead to_array and the object returned is expected to return something supporting some form of array protocols, it allows for writing more general code that can be run anywhere.

Maybe I was unclear as to whether to_numpy and to_arrow were placeholders for generic conversion functions for somewhat generic objects or if they were specific to returning numpy arrays and arrow arrays.

Copy link
Owner Author

@wesm wesm Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. There is no "standard" here, and this effort is not trying to establish a standard representation. The idea is that some projects are implemented against a certain data representation (e.g. NumPy arrays) -- so indeed these methods are here so that a producer can return their data to the consumer in the representation that they are looking for. matplotlib is an example -- it deals almost exclusively in NumPy arrays so rather than have matplotlib take responsibility for the details of converting from some foreign data frame object to a NumPy array, this conversion is the responsibility of the producer of this interface.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is if we have to_numpy and to_cupy, the standard will become to_numpy where anyone then looking to change to a different array implementation will have to go back and rework their code.

Also for clarity, efforts to achieve duck-typing between array libraries is pretty orthogonal to this project -- if such a duck-typed array standard is developed and adopted / normalized, then this interface can add it. The use case here is strictly abstract data interchange that does not force producers to convert to a particular intermediate format like pandas.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, that makes sense. My fear is if protocols can't be relied on entirely, the default path consumers will take is to use .to_numpy() where ideally they could use something more generic like .to_array() which produces something array-like which supports calling numpy functions against it. Without something like this, a GPU library like cuDF that will want to support this protocol would have to make an awful choice of either returning something like a cupy array from the to_numpy() call to keep data on the GPU and do things efficiently but possibly break for some people / libraries integrating downstream who were expecting specifically host arrays or have to_numpy() explicitly copy the data to host, which then possibly throws away the performance we may have gained from using the GPU.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't know that it makes sense to be opinionated in this way at this level. We can provide people with choices and they can decide what makes sense for them. I don't see a problem with having a to_$something method as long as that $something is documented and something that people can understand how to use. So you can provide both the "general array like" path and the "NumPy" path and people can choose. That you have a preference about which one you want people to use seems like it shouldn't factor into the interface

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is if the "NumPy" path is included in the interface whereas the "general array like" path isn't, most producers will likely only implement the "NumPy" path which will make almost all consumers consume the "NumPy" path. This then becomes a problem for us trying to integrate efficiently of trying to convince everyone to change to the "general array like" path as opposed to that being the typical path.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to propose an API to add? In the PR description I wrote "The APIs proposed here are not intended to be to the exclusion of others -- more APIs can be proposed and added later." What is written here is not exhaustive.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I missed that. Will add a suggestion to kickoff discussions on adding the generic API I was talking about.



# ----------------------------------------------------------------------
# DataFrame: the main public API


class DataFrame(ABC, Mapping):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other things I would add to the DataFrame interface:

  • dtypes equivalent: some mapping from column name to type
  • some way to slice/subset the rows. See the overall comment for more on this.
  • to_numpy and to_arrow at the DataFrame level

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent that pandas be able to implement this interface? If so, the .values attribute may cause issues, since it's a property returning an ndarray.

We're set up to deprecate DataFrame.values, with the alternative .to_numpy() which has been available for a few releases now. But we haven't made the deprecation yet since it'll be a bit disruptive.

This type of protocol would I think give enough justification for a gradual deprecation though.

Copy link
Owner Author

@wesm wesm Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger -- not "implement" in the way you're describing. Instead:

class PandasWrapper(dataframe_protocol.DataFrame):
    pass

# in pandas
class DataFrame:

    def __dataframe__(self):
        return PandasWrapper(self)

"""
An abstract data frame base class.

A "data frame" represents an ordered collection of named columns. A
column's "name" is permitted to be any hashable Python value, but strings
are common. Names are not required to be unique. Columns may be accessed by
name (when the name is unique) or by position.
"""

def __dataframe__(self):
"""
Idempotence of data frame protocol
"""
return self

def __iter__(self):
# TBD: Decide what iterating should return
return iter(self.column_names)

def __len__(self):
return self.num_rows

@property
@abstractmethod
def num_columns(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would rather use len(df.column_names) than having this method, this API is going to be huge, and I don't think this adds much value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree strongly here. df.column_names is expected to return a Python Sequence and getting the length of that Python object may not be free. I may want to call into an explicit function in C++ for this for example.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I still have df.column_names as Sequence. Maybe it should be Iterator instead

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 do you mind expanding on the use cases you have in mind?

I don't have a strong opinion on not having the number of columns as a property or method. It's more a preference on not making users remember on more method name in an API, if it's not needed. So if this is useful for some cases, happy to have it.

What I've got in mind is that dataframes would have a known and somehow reduced number of columns (let's say in the largest examples I can think of we're talking about few thousands). If that was the case, I think column_names being simply a Python list would be probably good enough, and IMO would make things easier. Accessing the columns is not something that I have in mind you would like to implement in a loop (materializing the list if it's not saved internally as a python list inside a loop, of course looping over the list is expected). And computing the length of it would be trivial. Even if you prefer a C++ implementation, I guess you could create a list-like class, which supports the length, indexing, iterating, and you could avoid having the whole list as python objects.

But for your comment, I think I'm wrong in my assumptions. So, would be great to know better what you have in mind, and which use cases have you seen where having the columns name as a Python list is a bad idea. If you have some code examples on how the protocol dataframe would be used, that's even better.

Thanks!

"""
Return the number of columns in the DataFrame
"""
pass

@property
@abstractmethod
def num_rows(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I don't think we need an alias for len(df), I think using len() is the Pythonic way, and it's enough.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a certain ambiguity to len(df), though. Because the length of a dict/mapping is the number of keys that it has (which here is the number of columns). We take for granted from pandas that len(df) returns the number of rows

In R

> df <- data.frame(a=c(1,2,3,4), b=c(5,6,7,8))
> length(df)
[1] 2
> nrow(df)
[1] 4
> ncol(df)
[1] 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on that.

To me it makes more sense that len(df) is the number of rows, and not of columns. And I'd avoid making a dataframe behave like a dict of its columns. But that's just my preference, and of course that's a reasonable choice.

But my point here was more to have just one way of doing things, and not repeat num_rows or num_columns for what can be done with len. I think we should follow the recommendation from the zen of Python There should be one-- and preferably only one --obvious way to do it.. Makes things simple for users IMO.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @wesm here that len(df) is ambiguous and prefer the explicit num_rows property.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO len should denote the length of list(df) so you have to decide if it's an iterable of rows or columns.

...or just punt on it, let len(df) and iter(df) raise and make the properties rows(self) -> Iterable[Row] and columns(self) -> Iterable[Column] return iterables (which themselves implement __len__)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this, couple of reasons why len(df) can not be appropriate (besides some people finding it ambiguous on which dimension it should return).

CPython limits the dunder method __len__ to an int64:

Python 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:57:50) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> class LenTooBig:
...     def __len__(self):
...         return 2 ** 64 + 1
... 
>>> len(LenTooBig())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: cannot fit 'int' into an index-sized integer

That is many trillions of data points, that not sure if anyone will ever reach (surely not in-memory representations), but still worth mentioning that the limitation doesn't exist in regular methods.

Another thing, with distributed systems in mind, does it make sense that this is a property (or __len__)? Personally, if I see:

df.num_rows  # also applies to `len(df)`

I'm mostly assuming that num_rows is known / easily available. But for distributed systems that can be an expensive operation. Would it make more sense something like df.count_rows()? Also, I personally find more obvious to see that df.count_rows() can fail (and return an exception), than df.num_rows, which I'd expect to always work, be fast, and not consume significant resources.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it a function makes sense to me for the reasons you listed. It's probably best in this utilitarian interface to not have __len__

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the number of rows in the DataFrame (if known)

Suggested change
def num_rows(self):
def num_rows(self) -> Optional[int]:

should the api use Optional types or raise (maybe custom Exception) if not known?

"""
Return the number of rows in the DataFrame (if known)
"""
pass

@abstractmethod
def iter_column_names(self) -> Iterable[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we need both this and column_names? Can we leave it to the application to materialize and just have column_names return an Iterable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
Return the column names as an iterable
"""
pass

# TODO: Should this be a method or property?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A property IMHO

@property
@abstractmethod
def column_names(self) -> Sequence[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a columns property that returns a ColumnCollection that itself has a names property, and similarly for rows

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. I am not sure this API should be in the business of syntactic sugar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user I'd like to write generic code which would work.regardless of the concrete type of DataFrame passed. For many applications I could probably get by with a restricted API but I'd still want to code a nice interface which used idiomatic python constructs. The proposed interface here strikes me as being extremely cumbersome to use in practice. Just one viewpoint though, there may well be other considerations to take into account other than aesthetics.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user I'd like to write generic code which would work.regardless of the concrete type of DataFrame passed.

This is not a requirement / goal of this project (see the original discussion thread).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the motivating use case at the the start of that thread:

Bigger picture: programming to an interface

If the discussion there has veered off from that course then it is probably a good idea to write a more formal proposal to capture the current state.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take it up with people on Discourse. If it turns out that people don't want interchange / export APIs I'm going to walk away from this project. I've already spent a lot of time on it.

"""
Return the column names as a materialized sequence
"""
pass

# TODO: Should this be a method or property?
@property
def row_names(self) -> Sequence[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on row names / indices. IMO that shouldn't be part of the standard (of course pandas and any other project can have this as a feature, but shouldn't be here).

Of course we could have this optional, but I think it's more a pandas thing, and we shouldn't add complexity to the standard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R's DataFrame also has row labels, so it's not specific to pandas. Are there any dataframe implementations that don't have row labels?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I look at it if some implementations return None while others return a Sequence it is not very harmful.

As one can of worms, row_names may need to return an object with the same API as columns. So this might better return Column

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most dataframe implementations have row labels. But in the Python world I think that is because everybody is trying to replicate the pandas API for compatibility.

Having row labels is surely a good idea for many dataframe implementations. My point is that not having them is also a good idea to me. So, I would leave them out of this proposal, at least initially.

IIRC @maartenbreddels mentioned that he'd make Vaex behave more like SQL regarding indexing. If I had to implement a dataframe library from scratch I'd also have indices as regular columns that can be indexed. There would surely be some drawbacks compared to pandas, but I think the API for users would be much simpler, and I prefer that.

In any case, not trying to say that projects shouldn't have row labels. Just that my preference is to not include them here. It narrows the discussion, and it avoids biasing towards the R/pandas approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want any kind of invariant around np.asarray(dataframe)? To me the most natural case is something like np.concat([col for col in df.values()], axis=1)

But if you're storing your row labels "in band" (as part of the columns in the dataframe) then your row labels will be mixed with the values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both R and pandas data frames have row names. So I don't see a big downside of having an API that allows them to make their row names available to consumers. It does not mean that other libraries need to implement row names

While it isn't required from the protocol perspective, people will end up writing code using these optional features unnecessarily, i.e. in Pandas using df.reset_index()[...] versus df.iloc[...]. Instead of these features being used specifically when necessary, they end up getting used everywhere which becomes almost a forcing hand for libraries to have to support the feature.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a consumer writes code that assumes there are always row names, even though the interface says that they are optional, isn't that the consumer's problem? The abstract API is a contract for producer and consumer, so it wouldn't be appropriate for consumers to de facto violate the contract by making an optional feature required.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the consumer's problem until there's enough people with the same problem (not wanting to rewrite code that assumes row names) that it becomes a problem for producers.

For what it's worth, I fought against having row indices in cuDF and ultimately lost the battle from the overwhelming amount of libraries we wanted to integrate with and user code assuming row indices purely because they were built against Pandas originally and it became too large of a battle to continue fighting.

I'm still against row indices, but would rather them be required than optional.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I guess that one's mostly my fault.

In R, row names are optional and it doesn't seem to cause problems. In pandas the .index has a functional purpose beyond purely communicating data, that may be part of why people are asking for it. In this interface, there is no computational functionality, purely one-way data transfer, so that seems less problematic.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas the .index has a functional purpose beyond purely communicating data, that may be part of why people are asking for it.

In our case an Index is just a Column that we provide the syntactic sugar expected from an Index around it. If someone wants to grab rows by index value, we currently build a hash table and then probe that hash table each time (we have plans to memoize it in the future). No one has pointed this out or complained about performance of this thus far (outside of CUDA developers writing the libcudf C++ library 😆), so I don't think it has to do with computational functionality and is purely the data transfer aspects.

"""
Return the row names (if any) as a materialized sequence. It is not
necessary to implement this method
"""
raise NotImplementedError("row_names")

def __getitem__(self, key: Hashable) -> Column:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to force column names to be str. I think the advantages in simplifying this are more than the limitation it imposes. Then __getitem__ can unambiguously perform all reasonable operations, get one item (by name or position), get multiple, and indexing.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be careful about going down the route of having a heavily-overloaded __getitem__ in this interface. It may be better to be as explicit and unambiguous as possible. For example, I am not even sure that df[start:end] should be available

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface should perhaps not even have a __getitem__ because it's really ambiguous and specific to implementations. It should be explicitly get_columns or get_rows to avoid ambiguity between implementations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you. I'd surely avoid too much overloading, but what is too much is difficult to say. May be some examples can help see what is our preference in practice:

df.column_by_name('col2')
df.column_by_position(1)
df.columns_by_name('col2', 'col3')  # or receive a list instead of `*args`
...
df['col2']
df[1]
df['col2', 'col3']  # note that the param of `__getitem__` is a tuple, not a list like in df[['col2', 'col3']]

Not sure what's the best option, may be something in between, but I'm biased towards the second set. But surely avoiding any ambiguity or something too crazy. And I think requiring column names to be string, makes things easier. For example if a tuple is accepted (it is in your code, as it's hashable), things become more complex and easier to be ambiguous (see the last line of my example).

So, the exact API surely requires some discussion, but my main point was that I'd personally limit key to be a string. I understand that many people will prefer the flexibility over the simplicity, but that's my preference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be explicitly get_columns or get_rows to avoid ambiguity between implementations.

I think this is a very good point. Do we want this PR to be both the API for end users and downstream projects accepting generic dataframes? From @devin-petersohn comment may be we should be discussing the API only for the latter (downstream projects), and leave the end user API to each project. If we do that, I'd avoid using any magic method, and prefix every method so there are no collisions with the public API of the implementations (e.g. __dataframe_columns__, __dataframe_num_rows__).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the discussion so far is that objects providing data frame-like data should have a single "export" method __dataframe__ that returns an implementation of a "data frame interface" object. I do not think that e.g. pandas.DataFrame should be a subclass of this object. This PR is intending to define some of the APIs that this object should contain

return self.column_by_name(key)

@abstractmethod
def column_by_name(self, key: Hashable) -> Column:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any plan for an interface that allows getting multiple columns at a time?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think multiple-column-selection and row slicing should also be in the API. I didn't write anything down for those yet to see if there's agreement on the simplest concepts (getting access to the data in a single column)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the api define the behaviour when the column name is not unique vs key not found

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, open to ideas. Though we should probably avoid polymorphic output types like in some pandas APIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about avoiding polymorphic output types. In Modin this is solved by not having a Column type, so slicing/indexing always returns another DataFrame.

"""
Return the column whose name is the indicated key
"""
pass
Copy link

@dhirschfeld dhirschfeld Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Python dev perspective (who would like to use this functionality) the get/select methods seem a bit awkward.

If we instead extend the columns(self) -> ColumnCollection: model we can have an api like:

In [79]: df = DataFrame('A', 'B', 'C')

In [80]: df
Out[80]: DataFrame(['A', 'B', 'C'])

In [81]: df.columns
Out[81]: ColumnCollection(['A', 'B', 'C'])

In [82]: df.columns.idx(0)
Out[82]: Column(name='A')

In [83]: df.columns.idx(0, 2)
Out[83]: DataFrame(['A', 'C'])

In [84]: df.columns['A']
Out[84]: Column(name='A')

In [85]: df.columns['A','C']
Out[85]: DataFrame(['A', 'C'])

...which, at least to me, seems a lot more natural.

Code:
from typing import List
from dataclasses import dataclass


@dataclass
class Column(object):
    name: str


class ColumnCollection(object):
    _columns: List[Column]
    
    def __init__(self, *columns: Column):
        self._columns = list(columns)
    
    @property
    def names(self) -> List[str]:
        return [col.name for col in self._columns]

    def idx(self, *indices: int) -> ColumnCollection:
        columns = [self._columns[idx] for idx in indices]
        if len(columns) == 1:
            return columns[0]
        return DataFrame(
            *(col.name for col in columns)
        )
    
    def __getitem__(self, names: str) -> Union[Column, ColumnCollection]:
        columns = dict(zip(self.names, self._columns))
        if len(names) == 1:
            return columns[names[0]]
        return DataFrame(*names)
    
    def __len__(self):
        return len(self._columns)

    def __iter__(self):
        for col in self._columns:
            yield col
        
    def __repr__(self):
        columns = "', '".join(self.names)
        return f"ColumnCollection(['{columns}'])"


class DataFrame(object):
    _columns: ColumnCollection
    
    def __init__(self, *columns):
        self._columns = ColumnCollection(
            *(Column(col) for col in columns)
        )
        
    def __repr__(self):
        columns = "', '".join(self.columns.names)
        return f"DataFrame(['{columns}'])"   

    @property
    def columns(self):
        return self._columns

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Python dev perspective (who would like to use this functionality) the get/select methods seem a bit awkward.

I strongly feel that syntactic sugar / conveniences (which get non-objective very quickly) belong at a higher level than this API, which is about library developers implementing unambiguous data export APIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in this as a potential way to write analytics functions which are DataFrame implementation agnostic. i.e. instead of coding to the pandas API I would code to this API hence its usage would be front-and-centre in my code in the same way that pandas now is. Maybe this isn't supposed to cater to that end-user concern though?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@abstractmethod
def column_by_index(self, i: int) -> Column:
"""
Return the column at the indicated position
"""
pass


class MutableDataFrame(DataFrame, MutableMapping):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of dropping this completely. Ideally this interface can discourage mutability because in general it's bad practice. It is still possible to use the immutable DataFrame for mutable legacy systems as is, but moving away from this will be better for dataframes as a whole in my opinion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Modin, the mutability is implemented at the API layer, but underneath is completely immutable.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think that mutability makes sense in an interchange API like this. I put this here to "stir the pot" and elicit reactions from others.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when it comes to performance/memory usage, it's important for the users/libs to be able to modify values inplace. But I guess you could argue we could convert everything first to numpy arrays and then deal with that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem I see with this is that an implementer of the protocol/API may have to deal with converting from foreign input. For example:

df[col] = data

so if data is a NumPy array and df doesn't use NumPy arrays internally, then it's a can of worms to account for all of the different potential kinds of data that the user may pass in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another downfall of inplace is that it is often deceptive and leaky. In pandas, the overwhelming majority of the time it is not really inplace, it is copying at some point, so users who think they are being more efficient or using best practices are often not getting the desired effect under the hood. Being functional (df.<op> returns another dataframe) is better because external libraries should ideally not have to think about optimizing calls for performance or memory, but rather trust that those who implemented the library did a good job.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be true on pandas, but some simple arithmetic operations on numpy show a different story, for isntance, if a, and b are random float64 arrays is 10000x10000, we have:

In [12]: %timeit np.add(a, b, out=a)                                                                                                                                                                                                   
116 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit np.add(a, b)                                                                                                                                                                                                          
483 ms ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This may also just be indicative of us preferably having to be operating on a more low level layer (like numpy) rather than the data frame implementations, if we want to do arithmetics.

Copy link

@devin-petersohn devin-petersohn Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more thing to consider I did not mention before: From a library perspective, editing input data inplace is very destructive to the user's environment.

Often there is a disconnect between users and those who maintain or create libraries. You will hear and see a lot of arguments about performance, but in reality users just do not care as much about performance as they do other things. It is easy for a library maintainer to focus on performance, it is a raw number we can try to optimize. However the user cares more about their own productivity and about prediction accuracy. A real world difference of a few seconds or even minutes is noise to a user who is interpreting results.

Also minor note: Your timeit example does not make sense in practice because each time the In[12] trial updates a. The original a values are gone after the first iteration. It's a minor point because it doesn't actually affect performance. The outputs of In[12] and In[13] are not the same.

Ignoring this correctness issue for a moment, you can improve the performance of the In[13] line by almost a factor of 2 if you do the following (provided you would want to store the output in some variable c):

%%timeit c = np.copy(a); np.add(a, b, out=c)

This gets to my point about performance. Optimizing for the behaviors of some other library that is out of your control is a dangerous business to be in, especially when you destroy a user's environment to do so. If you are the user destroying your own environment, that's perfectly fine, but the purpose of this interface is for libraries to avoid depending directly on pandas (forcing conversion into pandas). From a uniform interface perspective, we want to discourage in place updates because it is a leaky abstraction (should dataframe consumers really know about or rely on internal state?) and destructive to the user's environment.

Users may think they want this, but the same thing can be functionally achieved by df = df.dont_update_inplace(). The only argument to keep it is a performance argument, and I feel this is a very weak argument. If a dataframe library is not effectively managing memory or has performance issues that is an issue the dataframe library should fix, not something to be attempted to be worked around by manually changing internal state in place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is something we really care about in sklearn. We may not always be optimal, but we try, and when we do change things in place, it's not the user's data we're changing. It's an intermediate array we have in our methods. The optimizations we do even touch the level of an array being f-continuous or c-continuous, depending on the arithmatics we need to apply to the data. So at least for our usecase, these details really do matter.

# TODO: Mutable data frames are fraught at this interface level and
# need more discussion
pass
Loading