Skip to content

QST: is the new behavior of df.apply(my_func, axis=1) in v1.1.0 intended? #35483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
manihamidi opened this issue Jul 30, 2020 · 2 comments
Closed
2 tasks done
Labels
Duplicate Report Duplicate issue or pull request

Comments

@manihamidi
Copy link

manihamidi commented Jul 30, 2020

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

import pandas as pd
def test_func(row):
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

df = pd.DataFrame({'a': [1,2,3], 'b': ['i','j', 'k']})
df.apply(test_func, axis=1)

The above code ran on pandas 1.1.0 returns:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

While in pandas 1.0.5 it returns:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

Using python 3.8.3 and IPython 7.16.1.

The Question:

❓ What is the right way of getting the v1.0.5 behavior in v1.1.0?

I did see this release note but honestly can't figure out if this is an intended/unintended side effect of it: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.1.0.html#apply-and-applymap-on-dataframe-evaluates-first-row-column-only-once

thanks

@manihamidi manihamidi added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jul 30, 2020
@rhshadrach
Copy link
Member

rhshadrach commented Jul 31, 2020

In great generality, one should not mutate containers when iterating over them.

def test_func(row):
    row = row.copy()
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

gives

   a  b   c  d
0  1  i  1i  2
1  2  j  2j  3
2  3  k  3k  4

Of course, the vectorized version of this will be much faster:

%%timeit

df['c'] = df['a'].astype(str) + df['b']
df['d'] = df['a'] + 1

gives 564 µs ± 5.97 µs per loop whereas your version is 5.34 ms ± 16.9 µs per loop.

@simonjayhawkins
Copy link
Member

Thanks @manihamidi for the report. Same issue as #35462 so closing as duplicate.

@simonjayhawkins simonjayhawkins added Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jul 31, 2020
@simonjayhawkins simonjayhawkins added this to the No action milestone Jul 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants