DOC: update groupby NA group handing / workaround #5456

mkeller-upb · 2013-11-06T21:32:28Z

Add more explicit docs / work-around for dealing with groupby and NA groups

(see comments)

Changelog: 07.Nov.2013: Add line to example below to preprocess table content.

I expect the following behavior: A DataFrame.groupby splits the dataframe/table into subtables according to the grouping-condition. A column name as a grouping-condition will give me subtables for each individual value in that column. Similarly, grouping with multiple columns (a list of column names) gives me a group for each occurring combination of these columns (or let me put it differently, the unique "values" of multiple columns to group for are tuples).

So if I'm wrong with my expectations, I couldn't read a different meaning or to-expect-behavior from the documentation (e.g. pandas.DataFrame.groupby.__doc__), then there is a lake of clarification.

Otherwise I found a bug and I am in the need for a fix: Some existing combinations are not provided with a group or splited subtable -- I checked it with drop_duplicates. And, finally, grouped.__iter__ ignores more/other combinations as grouped.groups.keys() -- Here, I also would expect, that both follows the same implementation...

I tracked it to the depth of pandas to pandas.core.Grouper._get_group_keys or better _KeyMapper.get_key, self.levelslooks good, but the list-comprehension-getmethod-zip-action goes wrong or eventually pandas.core.Grouper.group_info provides a too small ngroups value oorr something else.

pandas.__version__ : 0.12.0-1062-g3c57949 (from 6.11.2013)
numpy.__version__ : 1.7.2
MacOSX 10.9

Test Example:

import pickle
import sys
import os

import pandas as pd

grp_cols = ['algorithm', 'customalpha']
df = "ccopy_reg\n_reconstructor\np0\n(cpandas.core.frame\nDataFrame\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\ng0\n(cpandas.core.internals\nBlockManager\np5\ng2\nNtp6\nRp7\n((lp8\ncnumpy.core.multiarray\n_reconstruct\np9\n(cpandas.core.index\nIndex\np10\n(I0\ntp11\nS'b'\np12\ntp13\nRp14\n((I1\n(I2\ntp15\ncnumpy\ndtype\np16\n(S'O8'\np17\nI0\nI1\ntp18\nRp19\n(I3\nS'|'\np20\nNNNI-1\nI-1\nI63\ntp21\nbI00\n(lp22\nS'algorithm'\np23\naS'customalpha'\np24\natp25\n(Ntp26\ntp27\nbag9\n(cpandas.core.index\nInt64Index\np28\n(I0\ntp29\ng12\ntp30\nRp31\n((I1\n(I13\ntp32\ng16\n(S'i8'\np33\nI0\nI1\ntp34\nRp35\n(I3\nS'<'\np36\nNNNI-1\nI-1\nI0\ntp37\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x0c\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x11\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x15\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x17\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x18\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1a\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1e\\x00\\x00\\x00\\x00\\x00\\x00\\x00 \\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np38\ntp39\n(Ntp40\ntp41\nba(lp42\ng9\n(cnumpy\nndarray\np43\n(I0\ntp44\ng12\ntp45\nRp46\n(I1\n(I2\nI13\ntp47\ng16\n(S'O8'\np48\nI0\nI1\ntp49\nRp50\n(I3\nS'|'\np51\nNNNI-1\nI-1\nI63\ntp52\nbI00\n(lp53\nS'ScenarioAlgoLocalHeuristicM'\np54\naS'ScenarioAlgoLocalHeuristicM'\np55\naS'ScenarioAlgoLocalHeuristicM'\np56\naS'ScenarioAlgoLocalHeuristicM'\np57\naS'ScenarioAlgoLocalHeuristicMC'\np58\naS'ScenarioAlgoCMTFLP'\np59\naS'ScenarioAlgoLocalHeuristicMC'\np60\naS'ScenarioAlgoLocalHeuristicMC'\np61\naS'ScenarioAlgoLocalHeuristicMC'\np62\naS'ScenarioAlgoLocalHeuristicMC'\np63\naS'ScenarioAlgoLocalHeuristicM'\np64\naS'ScenarioAlgoLocalHeuristicM'\np65\naS'ScenarioAlgoLocalHeuristicMC'\np66\naS'exp'\np67\naS'r100'\np68\naNaS'r333'\np69\naNaNaS'r333'\np70\naS'r100'\np71\naS'linear'\np72\naS'exp'\np73\naS'r10'\np74\naS'linear'\np75\naS'r10'\np76\natp77\nba(lp78\ng9\n(g10\n(I0\ntp79\ng12\ntp80\nRp81\n((I1\n(I2\ntp82\ng19\nI00\n(lp83\ng23\nag24\natp84\n(Ntp85\ntp86\nbatp87\nbb."
df = pickle.loads(df)

# Unexpected behavior was caused by None - values (which are treaded as NaN values), thanks jreback
df.fillna("default", inplace=True) # replaces None/NaN values

print "raw data: (", len(df), ")\n", df
print
print

df_grps1 = df[grp_cols].drop_duplicates()
df_grps2 = df.groupby(grp_cols)
df_grps3 = [grp for grp, _ in df.groupby(grp_cols)]

print "df_grps1 (#", len(df_grps1), "): \n", df_grps1
print
print "df_grps2 (#", len(df_grps2), "): "
for tpl in df_grps2.groups.keys():
    print tpl
print
print "df_grps3 (#", len(df_grps3), "): "
for tpl in df_grps3:
    print tpl

assert len(df_grps1) == len(df_grps2), "baad bug !!!"
assert len(df_grps2) == len(df_grps3), "baad bug !!!"
assert len(df_grps1) == len(df_grps3), "baad bug!!!"

print "passed without error"

The text was updated successfully, but these errors were encountered:

jreback · 2013-11-06T22:01:15Z

you have None in your groups, which are dropped, see here

if you add df = df.fillna('foo') at after you unpickle your script will work fine.

The way to 'solve' this problem is to fill the groups with a string, group, perform your operation, then if you really-really want a nan in an index (which in general in allowed, but makes indexing almost impossible), then you can set those strings back to nan.

mkeller-upb · 2013-11-07T16:05:57Z

That's good. So it's not a bug and everything is much easier. To remaining points:
a) I recommend to add a hint for this behavior in pandas.DataFrame.groupby.__doc__.
b) And mention at all three documentation places (your tutorial link, DataFrame, fillna, that , NaN, and Na are treaded similar.

Thanks a lot for your quick answer, jreback.

jreback · 2013-11-07T16:12:33Z

ok...will convert this issue to a doc updating one then...thanks for the comments

springcoil · 2015-08-13T20:33:31Z

I'm adding something to this - just to bring this up the list. So what exactly has to be done - there needs to be a Doc change to the docs itself or the docstring as well?

jreback · 2015-08-15T17:23:02Z

would add an example of how to work around (like the above), here

mroeschke · 2022-06-14T16:36:10Z

As described in #47337 (review), there is dropna=False which will keep the NA groups now so closing

jreback modified the milestones: 0.15.0, 0.14.0 Apr 4, 2014

jreback mentioned this issue Apr 28, 2014

BUG: Groupby NaT Handling #6992

Closed

jreback mentioned this issue Sep 19, 2014

Bloomberg Hackathon #8323

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Apr 19, 2015

groupby/transform with NaNs in grouped column #9941

Closed

nbonnotte mentioned this issue Jun 19, 2016

WIP: ENH: pivot/groupby index with nan #12607

Closed

36 tasks

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

datapythonista added the Effort Low label Jul 6, 2018

jbrockmendel removed the Effort Low label Oct 21, 2019

aamnv mentioned this issue Jun 13, 2022

DOC: GH5456 Adding workaround info on NA / NaT handling for groupby #47337

Closed

2 tasks

mroeschke closed this as completed Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: update groupby NA group handing / workaround #5456

DOC: update groupby NA group handing / workaround #5456

mkeller-upb commented Nov 6, 2013

jreback commented Nov 6, 2013

mkeller-upb commented Nov 7, 2013

jreback commented Nov 7, 2013

springcoil commented Aug 13, 2015

jreback commented Aug 15, 2015

mroeschke commented Jun 14, 2022

DOC: update groupby NA group handing / workaround #5456

DOC: update groupby NA group handing / workaround #5456

Comments

mkeller-upb commented Nov 6, 2013

jreback commented Nov 6, 2013

mkeller-upb commented Nov 7, 2013

jreback commented Nov 7, 2013

springcoil commented Aug 13, 2015

jreback commented Aug 15, 2015

mroeschke commented Jun 14, 2022