Skip to content

Suggestion: Make assign accepts list of dictionaries #18797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datajanko opened this issue Dec 15, 2017 · 7 comments · Fixed by #18852
Closed

Suggestion: Make assign accepts list of dictionaries #18797

datajanko opened this issue Dec 15, 2017 · 7 comments · Fixed by #18852
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@datajanko
Copy link
Contributor

datajanko commented Dec 15, 2017

Problem description

I really like the assign function and it's ability to be applied in pipelines.
However, if you pass a dictionary via prefixed by **, the dictionary must only contain columns that already exist in the preceeding dataframe. So in a dataframe, that contains column 'A', and I want to construct column B as f(A) and column C = g(A, B), im forced to do

# Your code here
pd.DataFrame({'A':[1, 2]}).assign(**{'B': lambda x:f(x['A'])}).assign(**{'C': lambda x:g(x['A'], x['B])})

       A    B     C
0      1    f(1)  g(1, f(1))
1      2    f(2)  g(2, f(2))

for some f and g and obtain a result like seen above. In extreme cases, this can lead to a lot of chained assign statements.

For convenience we could change the signature slightly to accept also *args, but every element in args should be such that the original assign function could be applied. In particular args could be a list of dictionaries.

With this, we could write the previous code as

# Your code here
pd.DataFrame({'A':[1, 2]}).assign(*[{'B': lambda x:f(x['A']),{'C': lambda x:g(x['A'], x['B])}])

Of course (also as it is right now) the user is responsible to construct a correct "computational graph" here. Additionally, the implementation I currently think of would use len(args) (-1 intermediate) copies of the original dataframe. However, using the stacked procedure above, this also happens.

Thus, we obtain a simpler syntactic way of using assign and we don't break the original implementation.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: None
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

In my opinion, a cleaner solution to this would be #14207 (taking advantage of dict ordered-ness in 3.6+)

@TomAugspurger
Copy link
Contributor

Agreed with @chris-b1.

@datajanko
Copy link
Contributor Author

datajanko commented Dec 15, 2017

Actually, one of the first things I checked was if the dict-orderedness can be used for this task. Unfortunately, this is not the case.

However, if I understand the current source code correct, the naive implementation to use the dict-orderedness must perform a full data frame copy after each added keyword. In fact, in the worst case (e.g. as described above), this will inevitably be necessary. So for this case, we keep the notation simple and there is no speed tradeoff.
But what if we have lots of assign statements that only depend on the already existing columns? Then, it seems, that we must perform a copy after each computed column.This could degrade performance in many (major) use cases.

@TomAugspurger
Copy link
Contributor

However, if I understand the current source code correct, the naive implementation to use the dict-orderedness must perform a full data frame copy after each added keyword.

Not sure what you mean here, but df.assign(a=1) and df.assign(a=1, b=2) each perform one copy.

The current implementation does two steps:

  1. all the computation into an intermediate dictionary
  2. all the assignments from that intermediate dictionary

If we allowed dependent assignment, we would interleave those two, compute and assign for each key - value pair. There would still be a single copy per .assign, but you won't have to break up assignments into multiple as often (thus saving copies).

@datajanko
Copy link
Contributor Author

Sorry for not being precise enough:

my original idea was to wrap a for loop around

pandas/pandas/core/frame.py

Lines 2700 to 2715 in c28b624

data = self.copy()
# do all calculations first...
results = OrderedDict()
for k, v in kwargs.items():
results[k] = com._apply_if_callable(v, data)
# preserve order for 3.6 and later, but sort by key for 3.5 and earlier
if PY36:
results = results.items()
else:
results = sorted(results.items())
# ... and then assign
for k, v in results:
data[k] = v
return data

this would result in only a copy for each "list entry" of my suggestion.

Anyway, the "do all calculations first" section, will not be "valid" anymore for python 3.6, right?
for python 3.6, this could look like

for k, v in kwargs.items():
    data[k] = = com._apply_if_callable(v, data)

for older versions of python, everything should remain the same. So the version distinction should be made directly after the copy statement, right?
Mind, if I make a pull request for that in the next couple of days?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 16, 2017

this would result in only a copy for each "list entry" of my suggestion.

Thanks for the clarification, sorry I didn't get that sooner.

So the version distinction should be made directly after the copy statement, right? Mind, if I make a pull request for that in the next couple of days?

Yeah, that sounds about right. Please do!

@datajanko
Copy link
Contributor Author

I'm also sorry, for not explaining my thought process better. You couldn't know what I know/knew about this issues, specifically since my contribution track record is almost non existing ;)

So, I just rearranged some of the tests and also some of the code.
It works fine if the arguments are callables. However, the shortened notation we used (df.assign(a=1, b=2)) can turn out to be df.assign(b=df.a, c=df.b), then of course df.b is not defined at the moment of the definition of the kwargs dictionary. Using callables has simply the advantage of evaluating the column names lazily.

Would it be okay, to only work well with callables?

datajanko pushed a commit to datajanko/pandas that referenced this issue Dec 18, 2017
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
datajanko pushed a commit to datajanko/pandas that referenced this issue Dec 19, 2017
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
datajanko pushed a commit to datajanko/pandas that referenced this issue Dec 23, 2017
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
@jreback jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 23, 2017
@jreback jreback added this to the 0.23.0 milestone Dec 23, 2017
datajanko pushed a commit to datajanko/pandas that referenced this issue Jan 9, 2018
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
datajanko pushed a commit to datajanko/pandas that referenced this issue Feb 9, 2018
Specifically, 'df.assign(b=1, c=lambda x:x['b'])'
does not throw an exception in python 3.6 and above.
Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.

populates dsintro and frame.py with examples and warning

- adds example to frame.py
- reworked warning in dsintro
- reworked Notes in frame.py

Remains open:

frame.py probably is responsible vor travis not passing: doc test that requires python 3.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants