Skip to content

observed keyword for SeriesGroupBy Ignored #24880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tsdev opened this issue Jan 23, 2019 · 4 comments · Fixed by #26463
Closed

observed keyword for SeriesGroupBy Ignored #24880

tsdev opened this issue Jan 23, 2019 · 4 comments · Fixed by #26463
Milestone

Comments

@tsdev
Copy link

tsdev commented Jan 23, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                                                                                                                    
df = pd.DataFrame({'a': ['x','x','y'], 'b': ['a','b','a'], 'c': [7,8,9]})                                                                                                                                                                    
df['a'] = df['a'].astype('category')
df['b'] = df['b'].astype('category')
result1 = df.groupby(['a','b']).c.agg('sum')
result2 = df.groupby(['a','b']).agg('sum')

Problem description

The calculated result1 and result2 DataFrames are different.
Result1:

a  b
x  a    7
   b    8
y  a    9
Name: c, dtype: int64

Result2

       c
a b     
x a  7.0
  b  8.0
y a  9.0
  b  NaN

Expected Output

I expect that both results have 4 rows, as the observed option is False by default.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.16.0 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
@WillAyd
Copy link
Member

WillAyd commented Jan 23, 2019

Hmm OK thanks for the report. Something must be going awry with the observed keyword in the former case (which is a SeriesGroupBy)

In [20]: df.groupby(['a','b'], observed=True).c.sum()                           
Out[20]: 
a  b
x  a    7
   b    8
y  a    9
Name: c, dtype: int64

In [21]: df.groupby(['a','b'], observed=False).c.sum()                          
Out[21]: 
a  b
x  a    7
   b    8
y  a    9
Name: c, dtype: int64

Investigation and PRs would certainly be welcome

@WillAyd WillAyd added this to the Contributions Welcome milestone Jan 23, 2019
@WillAyd WillAyd changed the title Grouping-aggregation on categorical columns gives inconsistent behaviour observed keyword for SeriesGroupBy Ignored Jan 23, 2019
@jreback
Copy link
Contributor

jreback commented Jan 23, 2019

check this on master
we just merged a patch that affected this

@WillAyd
Copy link
Member

WillAyd commented Jan 23, 2019

Confirmed on master

@rock321987
Copy link

This issue still persists in pandas 0.25.3. I had asked about this on Stackoverflow and got here, but found this is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants