Mutability #10

TomAugspurger · 2020-05-29T13:37:02Z

On the call yesterday, the topic of mutability came up in the vaex demo.

The short version is that it may be difficult or impossible for some systems to implement inplace mutation of dataframes. For example, I believe that neither vaex nor Dask implement the following:

In [8]: df = pd.DataFrame({"A": [1, 2]})

In [9]: df
Out[9]:
   A
0  1
1  2

In [10]: df.loc[0, 'A'] = 0

In [11]: df
Out[11]:
   A
0  0
1  2

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

There is one mutation-adjacent area that might be considered: using DataFrame.__setitem__ to add an additional column

In [12]: df['B'] = [1, 2]

In [13]: df
Out[13]:
   A  B
0  0  1
1  2  2

Or perhaps to update the contents of an entire column

In [14]: df['B'] = [3, 4]

In [15]: df
Out[15]:
   A  B
0  0  3
1  2  4

In these case, no values are actually being mutated inplace. Is that acceptable?

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2020-05-29T14:07:40Z

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

I agree completely. It is possible to simulate inplace behavior for legacy systems, but moving forward we should recommend users to make their code more functional.

There is one mutation-adjacent area that might be considered: using DataFrame.setitem to add an additional column

Not to derail, but I would argue against __setitem__ in the API spec (more discussion in #6). Adding a column should be something like insert and updating a column should be perhaps update. If we do not support inplace mutations, we should do what we can in the API spec to avoid giving the idea that things are updating inplace.

TomAugspurger · 2020-05-29T14:16:01Z

Sure, let's avoid the __setitem__ discussion for now on only focus on inplace expansion of columns.

I didn't state it explicitly, but I think inplace expansion of rows is out of scope, given the columnar focus.

saulshanabrook · 2020-05-29T14:53:31Z

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable, which would mean that this kind of existing imperative code in sklearn would not work with the standard.

It seems like there are at least a couple of different options here, assuming that we do want to keep supporting an imperative array API for things like sklearn to use and a functional dataframe API for things like Modin or Vaex to implement.

Keep dataframes as functional and arrays as imperative. "An API for going from dataframe API standard -> array API standard might solve this issue." from @devin-petersohn
Have two layers/levels for each API. A higher level/messier version that allows mutation and non functional operations. A lower level one, where all mutation is functional in nature instead. The high level dataframe would return a high level array, that allows mutation. The low level dataframe, would return the low level array. This is like Travis said " One that is more dev-centric and another that is more user-centric. For clarity, and now that packaging is in a better situation, two libraries are often more effective than one."
Instead of separating into high/low level, have separate mutable and immutable APIs. This would be like the how abc.collections has Mutable versions of some options.

devin-petersohn · 2020-05-29T15:45:15Z

@TomAugspurger agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc.

Back to your original question:

In these case, no values are actually being mutated inplace. Is that acceptable?

I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec)

Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas:

df.dropna(inplace=True)
df.groupby("c").count()

vs

df.dropna().groupby("c").count()

For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug.

I am generally against the idea of tightly coupling the array and dataframe APIs.

Is there a reason, other than trying to optimize for performance or memory, to make anything inplace?

TomAugspurger · 2020-05-29T20:46:11Z

I don't think performance should be a primary concern here, just matching (hypothetical) user expectations that something like `df['A'] = ...` would work. I personally am completely fine leaving that out of the initial version, in favor of something that's easily supported across implementations like `df.assign(A=...)`.

…

On Fri, May 29, 2020 at 10:45 AM Devin Petersohn ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc. Back to your original question: In these case, no values are actually being mutated inplace. Is that acceptable? I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec) Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas: df.dropna(inplace=True)df.groupby("c").count() vs df.dropna().groupby("c").count() For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug. I am generally against the idea of tightly coupling the array and dataframe APIs. Is there a reason, other than trying to optimize for performance or memory, to make anything inplace? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIXHHHNDMQ2PMIEGCELRT7KBVANCNFSM4NOBRAAQ> .

TomAugspurger · 2020-06-11T13:25:33Z

Apologies for missing this @saulshanabrook

One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable,

I'm still thinking this through... I'm having trouble thinking through the implications though. I'll give it some more thought.

In general, it sounds like people prefer to avoid mutation of existing data, whether that's mutating individual values inplace or expanding an existing dataframe by adding new rows / columns. So I'll propose that we explicitly leave the following (pandas-style) APIs out of the spec

df.iloc[0, 0] = 0 -- mutate an existing value inplace with __setitem__
df.method(inplace=True) -- Mutate an existing dataframe inplace with a method
df['A'] = [1, 2, 3] -- Expand a dataframe columnwise
df.append([1, 2]) -- Expand a dataframe rowwise

With the option that there might be a mutable level / extension to the standard, once we figure how to handle the various levels.

datapythonista · 2020-07-27T12:51:32Z

Let me try to summarize the discussion, and see if there is agreement to make some decisions.

We don't want the API to provide ways to explicitly mutate information in place. My understanding is that we're talking about the API syntax, and it would still be possible for implementations to make operations in place if they want to. For example:

df2 = df1.fillna(0.)

This doesn't force the implementation to operate on df1, but doesn't prevent that df1 is mutated and df2 is just a reference to df1. Also, the syntax it doesn't prevent that df1 and df2 are lazy objects and there isn't memory allocated really. So, for now, we want a syntax that allows copies in all operations, and we leave for later (or out of scope) what's really going on.

We want that the syntax is consistenly returning dataframes, so methods can be chained.
If we want to build a friendlier API (as discussed here) we we'll leave that for later (or out of scope). For now let's focus on a developer (more low-level) API as agreed in How the API is expected to be used #18.
pandas mostly implements an API that allows method chaining. With an example:

import pandas

(pandas.read_csv('countries.csv')
       .rename(columns={'name': 'country'})
       .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
       .query('(continent.str.lower() != "antarctica") | (population < area_km2)')
       .groupby('continent')
       [['population', 'area_km2']]
       .sum()
       .assign(density=lambda df: df['population'] / df['area_km2'])
       [['density']]
       .to_parquet('continent_density.parquet', index=True))

I think most operations are quite simple and non-controversial (feel free to disagree, of course). I think the main changes (from pandas, to the standard API) would be:

Remove the syntax that is not consistent with this behaviour
Consider special cases, mainly __getitem__ (do we want to replace it with .select() and .filter() instead?)
When creating new columns and in conditions, how to refer to the columns. Is using lambda as pandas does the best way? Or are there alternatives?
Obviously the specific methods to implement and their signature is open to discussion

Is there agreement in the above points, or are there things that should be discussed further?

datapythonista mentioned this issue Jun 23, 2020

Dataframe MVP #14

Closed

datapythonista mentioned this issue Jul 2, 2020

Get and set column names #21

Closed

datapythonista mentioned this issue Jul 28, 2020

Dataframe namespaces #23

Open

rgommers mentioned this issue Aug 25, 2021

Document the mutability assumptions for the interop protocol #56

Open

rgommers mentioned this issue Sep 26, 2022

API: DataFrame.assign #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutability #10

Mutability #10

TomAugspurger commented May 29, 2020

devin-petersohn commented May 29, 2020

TomAugspurger commented May 29, 2020

saulshanabrook commented May 29, 2020

devin-petersohn commented May 29, 2020

TomAugspurger commented May 29, 2020 via email

TomAugspurger commented Jun 11, 2020

datapythonista commented Jul 27, 2020

Mutability #10

Mutability #10

Comments

TomAugspurger commented May 29, 2020

devin-petersohn commented May 29, 2020

TomAugspurger commented May 29, 2020

saulshanabrook commented May 29, 2020

devin-petersohn commented May 29, 2020

TomAugspurger commented May 29, 2020 via email

TomAugspurger commented Jun 11, 2020

datapythonista commented Jul 27, 2020