Skip to content

DOC: update groupby NA group handing / workaround #5456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mkeller-upb opened this issue Nov 6, 2013 · 6 comments
Closed

DOC: update groupby NA group handing / workaround #5456

mkeller-upb opened this issue Nov 6, 2013 · 6 comments

Comments

@mkeller-upb
Copy link

Add more explicit docs / work-around for dealing with groupby and NA groups

(see comments)

Changelog: 07.Nov.2013: Add line to example below to preprocess table content.

I expect the following behavior: A DataFrame.groupby splits the dataframe/table into subtables according to the grouping-condition. A column name as a grouping-condition will give me subtables for each individual value in that column. Similarly, grouping with multiple columns (a list of column names) gives me a group for each occurring combination of these columns (or let me put it differently, the unique "values" of multiple columns to group for are tuples).

So if I'm wrong with my expectations, I couldn't read a different meaning or to-expect-behavior from the documentation (e.g. pandas.DataFrame.groupby.__doc__), then there is a lake of clarification.

Otherwise I found a bug and I am in the need for a fix: Some existing combinations are not provided with a group or splited subtable -- I checked it with drop_duplicates. And, finally, grouped.__iter__ ignores more/other combinations as grouped.groups.keys() -- Here, I also would expect, that both follows the same implementation...

I tracked it to the depth of pandas to pandas.core.Grouper._get_group_keys or better _KeyMapper.get_key, self.levelslooks good, but the list-comprehension-getmethod-zip-action goes wrong or eventually pandas.core.Grouper.group_info provides a too small ngroups value oorr something else.

pandas.__version__ : 0.12.0-1062-g3c57949 (from 6.11.2013)
numpy.__version__ : 1.7.2
MacOSX 10.9

Test Example:

import pickle
import sys
import os

import pandas as pd

grp_cols = ['algorithm', 'customalpha']
df = "ccopy_reg\n_reconstructor\np0\n(cpandas.core.frame\nDataFrame\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\ng0\n(cpandas.core.internals\nBlockManager\np5\ng2\nNtp6\nRp7\n((lp8\ncnumpy.core.multiarray\n_reconstruct\np9\n(cpandas.core.index\nIndex\np10\n(I0\ntp11\nS'b'\np12\ntp13\nRp14\n((I1\n(I2\ntp15\ncnumpy\ndtype\np16\n(S'O8'\np17\nI0\nI1\ntp18\nRp19\n(I3\nS'|'\np20\nNNNI-1\nI-1\nI63\ntp21\nbI00\n(lp22\nS'algorithm'\np23\naS'customalpha'\np24\natp25\n(Ntp26\ntp27\nbag9\n(cpandas.core.index\nInt64Index\np28\n(I0\ntp29\ng12\ntp30\nRp31\n((I1\n(I13\ntp32\ng16\n(S'i8'\np33\nI0\nI1\ntp34\nRp35\n(I3\nS'<'\np36\nNNNI-1\nI-1\nI0\ntp37\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x0c\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x11\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x15\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x17\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x18\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1a\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1e\\x00\\x00\\x00\\x00\\x00\\x00\\x00 \\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np38\ntp39\n(Ntp40\ntp41\nba(lp42\ng9\n(cnumpy\nndarray\np43\n(I0\ntp44\ng12\ntp45\nRp46\n(I1\n(I2\nI13\ntp47\ng16\n(S'O8'\np48\nI0\nI1\ntp49\nRp50\n(I3\nS'|'\np51\nNNNI-1\nI-1\nI63\ntp52\nbI00\n(lp53\nS'ScenarioAlgoLocalHeuristicM'\np54\naS'ScenarioAlgoLocalHeuristicM'\np55\naS'ScenarioAlgoLocalHeuristicM'\np56\naS'ScenarioAlgoLocalHeuristicM'\np57\naS'ScenarioAlgoLocalHeuristicMC'\np58\naS'ScenarioAlgoCMTFLP'\np59\naS'ScenarioAlgoLocalHeuristicMC'\np60\naS'ScenarioAlgoLocalHeuristicMC'\np61\naS'ScenarioAlgoLocalHeuristicMC'\np62\naS'ScenarioAlgoLocalHeuristicMC'\np63\naS'ScenarioAlgoLocalHeuristicM'\np64\naS'ScenarioAlgoLocalHeuristicM'\np65\naS'ScenarioAlgoLocalHeuristicMC'\np66\naS'exp'\np67\naS'r100'\np68\naNaS'r333'\np69\naNaNaS'r333'\np70\naS'r100'\np71\naS'linear'\np72\naS'exp'\np73\naS'r10'\np74\naS'linear'\np75\naS'r10'\np76\natp77\nba(lp78\ng9\n(g10\n(I0\ntp79\ng12\ntp80\nRp81\n((I1\n(I2\ntp82\ng19\nI00\n(lp83\ng23\nag24\natp84\n(Ntp85\ntp86\nbatp87\nbb."
df = pickle.loads(df)

# Unexpected behavior was caused by None - values (which are treaded as NaN values), thanks jreback
df.fillna("default", inplace=True) # replaces None/NaN values

print "raw data: (", len(df), ")\n", df
print
print

df_grps1 = df[grp_cols].drop_duplicates()
df_grps2 = df.groupby(grp_cols)
df_grps3 = [grp for grp, _ in df.groupby(grp_cols)]

print "df_grps1 (#", len(df_grps1), "): \n", df_grps1
print
print "df_grps2 (#", len(df_grps2), "): "
for tpl in df_grps2.groups.keys():
    print tpl
print
print "df_grps3 (#", len(df_grps3), "): "
for tpl in df_grps3:
    print tpl

assert len(df_grps1) == len(df_grps2), "baad bug !!!"
assert len(df_grps2) == len(df_grps3), "baad bug !!!"
assert len(df_grps1) == len(df_grps3), "baad bug!!!"

print "passed without error"
@jreback
Copy link
Contributor

jreback commented Nov 6, 2013

you have None in your groups, which are dropped, see here

if you add df = df.fillna('foo') at after you unpickle your script will work fine.

The way to 'solve' this problem is to fill the groups with a string, group, perform your operation, then if you really-really want a nan in an index (which in general in allowed, but makes indexing almost impossible), then you can set those strings back to nan.

@mkeller-upb
Copy link
Author

That's good. So it's not a bug and everything is much easier. To remaining points:
a) I recommend to add a hint for this behavior in pandas.DataFrame.groupby.__doc__.
b) And mention at all three documentation places (your tutorial link, DataFrame, fillna, that , NaN, and Na are treaded similar.

Thanks a lot for your quick answer, jreback.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2013

ok...will convert this issue to a doc updating one then...thanks for the comments

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@springcoil
Copy link
Contributor

I'm adding something to this - just to bring this up the list. So what exactly has to be done - there needs to be a Doc change to the docs itself or the docstring as well?

@jreback
Copy link
Contributor

jreback commented Aug 15, 2015

would add an example of how to work around (like the above), here

@mroeschke
Copy link
Member

As described in #47337 (review), there is dropna=False which will keep the NA groups now so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants