-
Notifications
You must be signed in to change notification settings - Fork 21
Mutability #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree completely. It is possible to simulate
Not to derail, but I would argue against |
Sure, let's avoid the I didn't state it explicitly, but I think inplace expansion of rows is out of scope, given the columnar focus. |
One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable, which would mean that this kind of existing imperative code in sklearn would not work with the standard. It seems like there are at least a couple of different options here, assuming that we do want to keep supporting an imperative array API for things like sklearn to use and a functional dataframe API for things like Modin or Vaex to implement.
|
@TomAugspurger agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc. Back to your original question:
I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec) Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas: df.dropna(inplace=True)
df.groupby("c").count() vs df.dropna().groupby("c").count() For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug. I am generally against the idea of tightly coupling the array and dataframe APIs. Is there a reason, other than trying to optimize for performance or memory, to make anything inplace? |
I don't think performance should be a primary concern here, just matching
(hypothetical) user expectations that something like `df['A'] = ...` would
work.
I personally am completely fine leaving that out of the initial version, in
favor of something that's easily supported across implementations like
`df.assign(A=...)`.
…On Fri, May 29, 2020 at 10:45 AM Devin Petersohn ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> agree. Taking it a step
further, I think any inplace related updates should not be supported.
Expansion/destruction of rows/columns, inplace point mutations, etc.
Back to your original question:
In these case, no values are actually being mutated inplace. Is that
acceptable?
I believe that all APIs should result in a new dataframe, so I would argue
that we don't want an API for creating a new column on an existing
dataframe, modifying the data underlying an existing dataframe object.
(Implementations are free to have this individually, but it will be outside
the spec)
Often users chain operations together. Ideally every operation can be
chained with another in the API. Example from pandas:
df.dropna(inplace=True)df.groupby("c").count()
vs
df.dropna().groupby("c").count()
For general use, it is better to not make the user choose between inplace
vs not. The bottom example has the nice benefit of being easier to debug.
I am generally against the idea of tightly coupling the array and
dataframe APIs.
Is there a reason, other than trying to optimize for performance or
memory, to make anything inplace?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIXHHHNDMQ2PMIEGCELRT7KBVANCNFSM4NOBRAAQ>
.
|
Apologies for missing this @saulshanabrook
I'm still thinking this through... I'm having trouble thinking through the implications though. I'll give it some more thought. In general, it sounds like people prefer to avoid mutation of existing data, whether that's mutating individual values inplace or expanding an existing dataframe by adding new rows / columns. So I'll propose that we explicitly leave the following (pandas-style) APIs out of the spec
With the option that there might be a mutable level / extension to the standard, once we figure how to handle the various levels. |
Let me try to summarize the discussion, and see if there is agreement to make some decisions.
This doesn't force the implementation to operate on
import pandas
(pandas.read_csv('countries.csv')
.rename(columns={'name': 'country'})
.assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
.query('(continent.str.lower() != "antarctica") | (population < area_km2)')
.groupby('continent')
[['population', 'area_km2']]
.sum()
.assign(density=lambda df: df['population'] / df['area_km2'])
[['density']]
.to_parquet('continent_density.parquet', index=True)) I think most operations are quite simple and non-controversial (feel free to disagree, of course). I think the main changes (from pandas, to the standard API) would be:
Is there agreement in the above points, or are there things that should be discussed further? |
On the call yesterday, the topic of mutability came up in the vaex demo.
The short version is that it may be difficult or impossible for some systems to implement inplace mutation of dataframes. For example, I believe that neither vaex nor Dask implement the following:
I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.
There is one mutation-adjacent area that might be considered: using
DataFrame.__setitem__
to add an additional columnOr perhaps to update the contents of an entire column
In these case, no values are actually being mutated inplace. Is that acceptable?
The text was updated successfully, but these errors were encountered: