Skip to content

Group-by/apply unexpected output with some operations when as_index=False #14547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anandtrex opened this issue Oct 31, 2016 · 6 comments
Closed
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby

Comments

@anandtrex
Copy link

anandtrex commented Oct 31, 2016

A small, complete example of the issue

import pandas as pd
import numpy as np
dd = dict(a=np.arange(9), b=np.repeat(np.arange(3), 3))
df = pd.DataFrame(dd)

# Problematic line below
df.groupby('b', as_index=False).apply(np.std)
# The line below has the same problem
df.groupby('b', as_index=False).std()
# When as_index=False is not passed in, it works as expected.

# The following two lines work exactly as expected
# (An example of the group-by/apply working for some operations
# df.groupby('b', as_index=False).apply(np.mean)
# df.groupby('b', as_index=False).mean()

Expected Output

          a    b
0  0.816497  0.0
1  0.816497  1.0
2  0.816497  2.0

Actual Output

          a    b
0  0.816497  0.0
1  0.816497  0.0
2  0.816497  0.0

When as_index=False is passed into groupby, finding the standard deviation doesn't work as expected. When as_index=True, everything works as expected.

Finding the mean works as expected in both cases.

I have been able to reproduce the problem also on Linux with the same version of pandas.

Output of pd.show_versions()

> INSTALLED VERSIONS > ------------------ > commit: None > python: 3.5.2.final.0 > python-bits: 64 > OS: Darwin > OS-release: 15.6.0 > machine: x86_64 > processor: i386 > byteorder: little > LC_ALL: en_US > LANG: en_US.UTF8 > > pandas: 0.18.0 > nose: None > pip: 8.1.2 > setuptools: 28.5.0 > Cython: None > numpy: 1.11.0 > scipy: 0.17.0 > statsmodels: None > xarray: None > IPython: 5.1.0 > sphinx: 1.3.5 > patsy: None > dateutil: 2.5.3 > pytz: 2016.3 > blosc: None > bottleneck: None > tables: None > numexpr: None > matplotlib: 1.5.1 > openpyxl: None > xlrd: 0.9.4 > xlwt: None > xlsxwriter: None > lxml: None > bs4: 4.4.1 > html5lib: None > httplib2: None > apiclient: None > sqlalchemy: None > pymysql: None > psycopg2: None > jinja2: 2.8 > boto: None
@chris-b1
Copy link
Contributor

chris-b1 commented Oct 31, 2016

Yeah, I don't think as_index=False is that well tested; xref #13217, though that seems to be a separate issue. PR to fix welcome!

@chris-b1 chris-b1 added this to the Next Major Release milestone Oct 31, 2016
@jorisvandenbossche
Copy link
Member

Note that the two cases (.apply(np.std) and .std()) give a different result:

In [59]: df.groupby('b', as_index=False).apply(np.std)
Out[59]: 
          a    b
0  0.816497  0.0
1  0.816497  0.0
2  0.816497  0.0

In [60]: df.groupby('b', as_index=False).std()
Out[60]: 
          b    a
0  0.000000  1.0
1  1.000000  1.0
2  1.414214  1.0

I think the .apply(np.std) case is actually correct. It also applies the function on the 'b' column, and therefore this are all 0's. So it seems mainly the 'b' column in the .std() case that is wrong

@jreback
Copy link
Contributor

jreback commented Oct 31, 2016

these have different degrees of freedom

@jorisvandenbossche
Copy link
Member

@jreback Yes, that explains the 1 vs 0.816, but not the 0.00, 1.00, 1.41 for the b column

@jreback
Copy link
Contributor

jreback commented Nov 1, 2016

this is a dupe of #10355

@jreback jreback closed this as completed Nov 1, 2016
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Nov 1, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Nov 1, 2016
@xieyuheng
Copy link

#25315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

5 participants