Skip to content

Mutability #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue May 29, 2020 · 7 comments
Open

Mutability #10

TomAugspurger opened this issue May 29, 2020 · 7 comments

Comments

@TomAugspurger
Copy link

On the call yesterday, the topic of mutability came up in the vaex demo.

The short version is that it may be difficult or impossible for some systems to implement inplace mutation of dataframes. For example, I believe that neither vaex nor Dask implement the following:

In [8]: df = pd.DataFrame({"A": [1, 2]})

In [9]: df
Out[9]:
   A
0  1
1  2

In [10]: df.loc[0, 'A'] = 0

In [11]: df
Out[11]:
   A
0  0
1  2

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

There is one mutation-adjacent area that might be considered: using DataFrame.__setitem__ to add an additional column

In [12]: df['B'] = [1, 2]

In [13]: df
Out[13]:
   A  B
0  0  1
1  2  2

Or perhaps to update the contents of an entire column

In [14]: df['B'] = [3, 4]

In [15]: df
Out[15]:
   A  B
0  0  3
1  2  4

In these case, no values are actually being mutated inplace. Is that acceptable?

@devin-petersohn
Copy link
Member

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

I agree completely. It is possible to simulate inplace behavior for legacy systems, but moving forward we should recommend users to make their code more functional.

There is one mutation-adjacent area that might be considered: using DataFrame.setitem to add an additional column

Not to derail, but I would argue against __setitem__ in the API spec (more discussion in #6). Adding a column should be something like insert and updating a column should be perhaps update. If we do not support inplace mutations, we should do what we can in the API spec to avoid giving the idea that things are updating inplace.

@TomAugspurger
Copy link
Author

Sure, let's avoid the __setitem__ discussion for now on only focus on inplace expansion of columns.

I didn't state it explicitly, but I think inplace expansion of rows is out of scope, given the columnar focus.

@saulshanabrook
Copy link

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable, which would mean that this kind of existing imperative code in sklearn would not work with the standard.

It seems like there are at least a couple of different options here, assuming that we do want to keep supporting an imperative array API for things like sklearn to use and a functional dataframe API for things like Modin or Vaex to implement.

  1. Keep dataframes as functional and arrays as imperative. "An API for going from dataframe API standard -> array API standard might solve this issue." from @devin-petersohn
  2. Have two layers/levels for each API. A higher level/messier version that allows mutation and non functional operations. A lower level one, where all mutation is functional in nature instead. The high level dataframe would return a high level array, that allows mutation. The low level dataframe, would return the low level array. This is like Travis said " One that is more dev-centric and another that is more user-centric. For clarity, and now that packaging is in a better situation, two libraries are often more effective than one."
  3. Instead of separating into high/low level, have separate mutable and immutable APIs. This would be like the how abc.collections has Mutable versions of some options.

@devin-petersohn
Copy link
Member

@TomAugspurger agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc.

Back to your original question:

In these case, no values are actually being mutated inplace. Is that acceptable?

I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec)

Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas:

df.dropna(inplace=True)
df.groupby("c").count()

vs

df.dropna().groupby("c").count()

For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug.

I am generally against the idea of tightly coupling the array and dataframe APIs.

Is there a reason, other than trying to optimize for performance or memory, to make anything inplace?

@TomAugspurger
Copy link
Author

TomAugspurger commented May 29, 2020 via email

@TomAugspurger
Copy link
Author

Apologies for missing this @saulshanabrook

One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable,

I'm still thinking this through... I'm having trouble thinking through the implications though. I'll give it some more thought.


In general, it sounds like people prefer to avoid mutation of existing data, whether that's mutating individual values inplace or expanding an existing dataframe by adding new rows / columns. So I'll propose that we explicitly leave the following (pandas-style) APIs out of the spec

  • df.iloc[0, 0] = 0 -- mutate an existing value inplace with __setitem__
  • df.method(inplace=True) -- Mutate an existing dataframe inplace with a method
  • df['A'] = [1, 2, 3] -- Expand a dataframe columnwise
  • df.append([1, 2]) -- Expand a dataframe rowwise

With the option that there might be a mutable level / extension to the standard, once we figure how to handle the various levels.

@datapythonista
Copy link
Member

Let me try to summarize the discussion, and see if there is agreement to make some decisions.

  • We don't want the API to provide ways to explicitly mutate information in place. My understanding is that we're talking about the API syntax, and it would still be possible for implementations to make operations in place if they want to. For example:
df2 = df1.fillna(0.)

This doesn't force the implementation to operate on df1, but doesn't prevent that df1 is mutated and df2 is just a reference to df1. Also, the syntax it doesn't prevent that df1 and df2 are lazy objects and there isn't memory allocated really. So, for now, we want a syntax that allows copies in all operations, and we leave for later (or out of scope) what's really going on.

  • We want that the syntax is consistenly returning dataframes, so methods can be chained.

  • If we want to build a friendlier API (as discussed here) we we'll leave that for later (or out of scope). For now let's focus on a developer (more low-level) API as agreed in How the API is expected to be used #18.

  • pandas mostly implements an API that allows method chaining. With an example:

import pandas

(pandas.read_csv('countries.csv')
       .rename(columns={'name': 'country'})
       .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
       .query('(continent.str.lower() != "antarctica") | (population < area_km2)')
       .groupby('continent')
       [['population', 'area_km2']]
       .sum()
       .assign(density=lambda df: df['population'] / df['area_km2'])
       [['density']]
       .to_parquet('continent_density.parquet', index=True))

I think most operations are quite simple and non-controversial (feel free to disagree, of course). I think the main changes (from pandas, to the standard API) would be:

  • Remove the syntax that is not consistent with this behaviour
  • Consider special cases, mainly __getitem__ (do we want to replace it with .select() and .filter() instead?)
  • When creating new columns and in conditions, how to refer to the columns. Is using lambda as pandas does the best way? Or are there alternatives?
  • Obviously the specific methods to implement and their signature is open to discussion

Is there agreement in the above points, or are there things that should be discussed further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants