-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Suggestion: Make assign accepts list of dictionaries #18797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In my opinion, a cleaner solution to this would be #14207 (taking advantage of dict ordered-ness in 3.6+) |
Agreed with @chris-b1. |
Actually, one of the first things I checked was if the dict-orderedness can be used for this task. Unfortunately, this is not the case. However, if I understand the current source code correct, the naive implementation to use the dict-orderedness must perform a full data frame copy after each added keyword. In fact, in the worst case (e.g. as described above), this will inevitably be necessary. So for this case, we keep the notation simple and there is no speed tradeoff. |
Not sure what you mean here, but The current implementation does two steps:
If we allowed dependent assignment, we would interleave those two, compute and assign for each key - value pair. There would still be a single copy per |
Sorry for not being precise enough: my original idea was to wrap a for loop around Lines 2700 to 2715 in c28b624
this would result in only a copy for each "list entry" of my suggestion. Anyway, the "do all calculations first" section, will not be "valid" anymore for python 3.6, right?
for older versions of python, everything should remain the same. So the version distinction should be made directly after the copy statement, right? |
Thanks for the clarification, sorry I didn't get that sooner.
Yeah, that sounds about right. Please do! |
I'm also sorry, for not explaining my thought process better. You couldn't know what I know/knew about this issues, specifically since my contribution track record is almost non existing ;) So, I just rearranged some of the tests and also some of the code. Would it be okay, to only work well with callables? |
Specifically, 'df.assign(b=1, c=lambda x:x['b'])' does not throw an exception in python 3.6 and above. Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
Specifically, 'df.assign(b=1, c=lambda x:x['b'])' does not throw an exception in python 3.6 and above. Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
Specifically, 'df.assign(b=1, c=lambda x:x['b'])' does not throw an exception in python 3.6 and above. Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
Specifically, 'df.assign(b=1, c=lambda x:x['b'])' does not throw an exception in python 3.6 and above. Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797.
Specifically, 'df.assign(b=1, c=lambda x:x['b'])' does not throw an exception in python 3.6 and above. Further details are discussed in Issues pandas-dev#14207 and pandas-dev#18797. populates dsintro and frame.py with examples and warning - adds example to frame.py - reworked warning in dsintro - reworked Notes in frame.py Remains open: frame.py probably is responsible vor travis not passing: doc test that requires python 3.6
Problem description
I really like the assign function and it's ability to be applied in pipelines.
However, if you pass a dictionary via prefixed by **, the dictionary must only contain columns that already exist in the preceeding dataframe. So in a dataframe, that contains column 'A', and I want to construct column B as f(A) and column C = g(A, B), im forced to do
for some f and g and obtain a result like seen above. In extreme cases, this can lead to a lot of chained assign statements.
For convenience we could change the signature slightly to accept also *args, but every element in args should be such that the original assign function could be applied. In particular args could be a list of dictionaries.
With this, we could write the previous code as
Of course (also as it is right now) the user is responsible to construct a correct "computational graph" here. Additionally, the implementation I currently think of would use len(args) (-1 intermediate) copies of the original dataframe. However, using the stacked procedure above, this also happens.
Thus, we obtain a simpler syntactic way of using assign and we don't break the original implementation.
Output of
pd.show_versions()
pandas: 0.20.3
pytest: 3.2.1
pip: None
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: