Skip to content

BUG: list output incorrectly reused when calling DataFrame.apply #54250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
pkch opened this issue Jul 25, 2023 · 2 comments
Open
3 tasks done

BUG: list output incorrectly reused when calling DataFrame.apply #54250

pkch opened this issue Jul 25, 2023 · 2 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@pkch
Copy link

pkch commented Jul 25, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
print(df.apply(lambda s: [s], axis=1))

Issue Description

Pandas seems to incorrectly reuse the ouput list of the function passed to pd.DataFrame.apply().

Expected Behavior

Expected 3 different rows in the output. Instead got 3 identical rows.

Installed Versions

INSTALLED VERSIONS

commit : b5958ee
python : 3.10.11.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-smp-931.25.0.0
Version : #1 [v4.15.0-931.25.0.0] SMP @1683922095
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.24.1
pytz : 2023.3
dateutil : 2.8.1
pip : None
setuptools : None
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 3.2.3
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2023.3.0
fastparquet : None
gcsfs : None
matplotlib : 3.6.1
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : 0.12.0
pyarrow : 10.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
sqlalchemy : None
tables : 3.6.2-dev
tabulate : 0.8.10
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@pkch pkch added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 25, 2023
@dicristina
Copy link
Contributor

The function in the example receives a series which corresponds to each row (this is what df.apply(..., axis=1) does). The problem is that internally the same series is reused and its values changed for each iteration. I think this was introduced in #34909.

pandas/pandas/core/apply.py

Lines 1096 to 1101 in 0e8c730

for arr, name in zip(values, self.index):
# GH#35462 re-pin mgr in case setitem changed it
ser._mgr = mgr
mgr.set_values(arr)
object.__setattr__(ser, "_name", name)
yield ser

When the function returns a series the problem is avoided by making a copy but this does not apply to the example code.

pandas/pandas/core/apply.py

Lines 971 to 977 in 0e8c730

for i, v in enumerate(series_gen):
# ignore SettingWithCopy here in case the user mutates
results[i] = self.func(v, *self.args, **self.kwargs)
if isinstance(results[i], ABCSeries):
# If we have a view on v, we need to make a copy because
# series_generator will swap out the underlying data
results[i] = results[i].copy(deep=False)

A trivial workaround would be returning [s.copy()] instead of [s].

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2024
@toihr
Copy link

toihr commented Jul 23, 2024

Bumping this up, it took me so long to identify this issue, I thought everything else was wrong in my pipeline until i examined the apply function. This should not be how this works i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

4 participants