Skip to content

DataFrame Groupby Apply Returning Unexpected Result? #12652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jaradc opened this issue Mar 17, 2016 · 6 comments · Fixed by #41431
Closed

DataFrame Groupby Apply Returning Unexpected Result? #12652

jaradc opened this issue Mar 17, 2016 · 6 comments · Fixed by #41431
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jaradc
Copy link

jaradc commented Mar 17, 2016

Why doesn't applying the addColumn function to a DataFrameGroupby object return the expected output below?

Code Sample, a copy-pasteable example if possible

import pandas as pd

df = pd.DataFrame({
  ('C', 'julian'): [0.258185, 0.52591899999999991, 0.17491099999999998, 0.94083099999999997, 0.70193700000000003, 0.189361, 0.90364500000000003, 0.56848199999999993, 0.44919799999999993, 0.39054899999999998],
  ('B', 'geoffrey'): [0.27970200000000001, 0.54119799999999996, 0.36436499999999999, 0.73802900000000005, 0.85527000000000009, 0.37441099999999999, 0.87378500000000003, 0.062140000000000001, 0.008404, 0.171458], 
  ('A', 'julian'): [0.20408199999999999, 0.263235, 0.196243, 0.52878500000000006, 0.85351699999999997, 0.23979699999999998, 0.98073399999999999, 0.59194199999999997, 0.81816699999999998, 0.21742399999999998], 
  ('B', 'julian'): [0.79572500000000002, 0.507324, 0.65340799999999999, 0.65416000000000007, 0.803087, 0.94354400000000005, 0.85009699999999988, 0.56629799999999997, 0.28205000000000002, 0.47193299999999999], 
  ('A', 'geoffrey'): [0.073676000000000005, 0.096733, 0.028613, 0.831569, 0.26324999999999998, 0.069519000000000011, 0.29041400000000001, 0.088387000000000007, 0.061483000000000003, 0.42760200000000004], 
  ('C', 'geoffrey'): [0.25811200000000001, 0.75765199999999999, 0.92473300000000003, 0.29447299999999998, 0.26469799999999999, 0.84664699999999993, 0.11871300000000001, 0.87206399999999995, 0.65837000000000001, 0.23442600000000002]},
  columns=pd.MultiIndex.from_tuples([('A','julian'),('A','geoffrey'), ('B','julian'),('B','geoffrey'), ('C','julian'),('C','geoffrey')]))

def addColumn(grouped):
  name = grouped.columns[0][1]
  grouped['sum', name] = grouped.sum(axis=1)
  #print(grouped)
  return grouped

result = df.groupby(level=1, axis=1).apply(addColumn)

Expected Output

      A         B         C       sum         A         B         C       sum  
   geoffrey  geoffrey  geoffrey  geoffrey    julian    julian    julian    julian  
0  0.073676  0.279702  0.258112  0.611491  0.204082  0.795725  0.258185  1.257992  
1  0.096733  0.541198  0.757652  1.395584  0.263235  0.507324  0.525919  1.296478  
2  0.028613  0.364365  0.924733  1.317710  0.196243  0.653408  0.174911  1.024561  
3  0.831569  0.738029  0.294473  1.864071  0.528785  0.654160  0.940831  2.123776  
4  0.263250  0.855270  0.264698  1.383219  0.853517  0.803087  0.701937  2.358542  
5  0.069519  0.374411  0.846647  1.290578  0.239797  0.943544  0.189361  1.372703  
6  0.290414  0.873785  0.118713  1.282912  0.980734  0.850097  0.903645  2.734476  
7  0.088387  0.062140  0.872064  1.022590  0.591942  0.566298  0.568482  1.726721  
8  0.061483  0.008404  0.658370  0.728257  0.818167  0.282050  0.449198  1.549415  
9  0.427602  0.171458  0.234426  0.833486  0.217424  0.471933  0.390549  1.079906  

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: None
pip: None
setuptools: 15.0
Cython: 0.22
numpy: 1.9.3
scipy: 0.16.0c1
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: None
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.4.2
bs4: 4.1.0
html5lib: 0.999
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
Jinja2: 2.8

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

your post is not copy-pastable, something wrong in the formatting.

further you REALLY REALLY should not be mutating INSIDE an .apply. We have to ban this. it is wholly bad practice.

@jaradc
Copy link
Author

jaradc commented Mar 17, 2016

It actually is copy/pasteable. Would you like a screenshot proving that?

Anyways, the question is why doesn't this work, not whether this is considered good or bad practice. It comes from a question on stackoverflow and I'm trying to understand:

  1. Is my understanding about groupby apply incorrect
  2. or is this a bug?

@jreback jreback added this to the 0.18.1 milestone Mar 17, 2016
@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

this is partially a bug, will update soon.

@jreback jreback added the Bug label Mar 17, 2016
@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Sep 1, 2016
@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map and removed Usage Question labels Oct 9, 2019
@rhshadrach
Copy link
Member

I'm seeing the expected output on master. That said, I don't think tests should be added for this. One should not mutate in an apply in the first place.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2020

note that we do have a couple tests that lock down this behavior

when we want to change it - these kinds of tests are helpful

@rhshadrach
Copy link
Member

Alright - I'll mark as such.

@rhshadrach rhshadrach added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Sep 26, 2020
@mroeschke mroeschke removed Apply Apply, Aggregate, Transform, Map Bug Groupby labels Apr 23, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants