Skip to content

API: DataFrame.assign #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Sep 20, 2022 · 9 comments
Closed

API: DataFrame.assign #86

jbrockmendel opened this issue Sep 20, 2022 · 9 comments

Comments

@jbrockmendel
Copy link
Contributor

The pandas implementation is very simple:

def assign(self, **kwargs) -> DataFrame:
    obj = self.copy()
    for key, value in kwargs.items():
        obj[key] = value
    return obj

We can't just re-use this directly bc we don't have __setitem__. Also, if assign is the main setitem-like method, it'd be nice to retain the option of supporting non-string keys. So maybe something like def assign(self, key: Hashable, value: ??) -> DataFrame: ...? If we have a Column object, could use that for value?

@rgommers
Copy link
Member

We can't just re-use this directly bc we don't have __setitem__.

I had to refresh my memory on that one - that was discussed in gh-10, and the arguments still seem convincing.

Also, if assign is the main setitem-like method, it'd be nice to retain the option of supporting non-string keys

Isn't that again a "sure that's allowed in an implementation, the standard only deals with unique strings, but doesn't forbid going beyond that"?

key in the spec can be str I think, and libraries are free to extend it. value is indeed a little less obvious. Maybe both a 1-column dataframe and a Column` should work? Or is there a more preferred method to add the former?

@jbrockmendel
Copy link
Contributor Author

Isn't that again a "sure that's allowed in an implementation, the standard only deals with unique strings, but doesn't forbid going beyond that"?

I was specifically referring to the signature def assign(self, **kwargs) -> DataFrame:, which doesn't allow for non-string keys, whereas def assign(self, key: Hashable, value: whatever) would.

Maybe both a 1-column dataframe and a Column should work? Or is there a more preferred method to add the former?

I'm not sure how settled the existence of a Column is, but conditional on that being agreed upon, I'd stick with just that. Also would exclude scalars.

@rgommers
Copy link
Member

I was specifically referring to the signature def assign(self, **kwargs) -> DataFrame:, which doesn't allow for non-string keys, whereas def assign(self, key: Hashable, value: whatever) would.

Okay, got it. **kwargs should be avoided wherever possible either way I'd think. The signature could also be even more specific:

def assign(self, key: str, value: column):

The Hashable can be an outside-the-standard addition, that is what I was getting at.

@mwaskom
Copy link
Contributor

mwaskom commented Oct 23, 2022

Isn't the pandas implementation (slightly) more complicated than that because val can be a callable?

@jbrockmendel
Copy link
Contributor Author

Isn't the pandas implementation (slightly) more complicated than that because val can be a callable?

I guess. I don't think there's any interest in supporting that for the DataFrame API here.

@jbrockmendel
Copy link
Contributor Author

Summarizing the discussion from the call 2022-09-29, the agreed-upon API looks like:

def insert(self, loc: int, label: str, value: Column) -> DataFrame: ...
def drop_column(self, label: str) -> DataFrame: ...
def set_column(self, label: str, value: Column) -> DataFrame: ...

insert would insert the new column at the given position, akin to list.insert.
drop_column would behave like pd.DataFrame.drop
set_column would behave like calling df.assign(label=value) with a pandas DataFrame (potential copy semantics notwithstanding)

There was discussion of plural analogues drop_columns and set_columns that would accept sequences. No one objected to these, but they were not made canonical.

@mwaskom
Copy link
Contributor

mwaskom commented Oct 29, 2022

Are the discussion notes meant to be private? I'd like to follow along with what's being decided here...

@rgommers
Copy link
Member

rgommers commented Nov 9, 2022

Are the discussion notes meant to be private? I'd like to follow along with what's being decided here...

@mwaskom good question. Those notes are more a literal transcript than regular meeting notes; those transcripts and the video recordings are meant to be private. The alternative is more "don't make them" rather than publishing them - I think of these sort of like hallway conversations or BoFs at conferences, recording such events would change the dynamics a lot. However, it should definitely be possible to follow along and participate. Anything discussed should be summarized with the key rationales for proposed decisions, so that we both have a public record and give everyone a change to jump in and provide arguments for going a different route. And we should avoid linking to private notes, just like in any other open source project.

In this case it looks like there were no major conceptual issues here. A summary of the most important points made:

  • we want to avoid mutation in the API, and don't have __setitem__ or an API to delete columns inplace
  • regarding naming: explicit is preferred over implicit, as is consistency in naming between similar methods. Therefore after some back and forth there was a preference for drop_column over drop.
  • the types that value and label take are important considerations as well. It'd be great to not have as many types as what is typical in Pandas (or NumPy), because that makes the API harder to implement, static typing difficult, etc. - it has generally proven to be a better idea to have APIs with clean static typing.
    • value for example could have been a list/iterable, or a 1-column dataframe. however there was a preference to keep it simpler, so only Column.

@MarcoGorelli
Copy link
Contributor

closing as there's now dataframe.assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants