Iterating over groups skips groups with nan in the name #10468

fbnrst · 2015-06-29T10:35:38Z

Today I discovered a strange behaviour when iterating over groups where the group name contains a nan. Take the following data frame and group it. As expected there are two groups:

In [1]: df = pd.DataFrame([['a', 1, 1], ['b', sp.nan, 2]], columns = ['a', 'b', 'c'])

In [2]: grouped = df.groupby(['a', 'b'])

In [3]: print grouped.groups
{('a', 1.0): [0], ('b', nan): [1]}

Now, if I iterate over the groups the second group is missing:

In [4]: for name, group in grouped:
   ...:         print name
('a', 1.0)

I would expect that the iteration works like this:

In [5]: for name, group in grouped.groups.items():
   ...:         print name
('a', 1.0)
('b', nan)

pandas version:

In [6]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-55-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.1
Cython: 0.20.1post0
numpy: 1.9.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 3.2.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: 2.3.9
sqlalchemy: 0.8.4
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)

fbnrst · 2015-06-29T10:42:14Z

I just confirmed the issue for the current version of pandas:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-55-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.2
nose: 1.3.1
Cython: 0.20.1post0
numpy: 1.9.2
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 3.2.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 0.8.4
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)

jreback · 2015-06-29T11:07:26Z

see docs here

by-definition groupby skips nan groups

fbnrst · 2015-06-29T12:15:52Z

i see the point and that makes sense.
But the behaviour of groupby is still a little bit inconsistent. If I group just by one column containing nan the corresponding elements are immediately removed:

In [9]: grouped2 = df.groupby(['b'])

In [10]: grouped2.groups
Out[10]: {1.0: [0]}

But in the example I posted above where I grouped using multiple columns the nan entry is still listed as a group:

In [11]: grouped = df.groupby(['a', 'b'])

In [12]: print grouped.groups
{('a', 1.0): [0], ('b', nan): [1]}

Shouldn't the second entry not be there because it contains a nan?

jreback · 2015-06-29T15:45:21Z

no, the 2nd is not nan, but a tuple (that happens to contain nan). having nan in your groupers is pretty odd. can you show a full example of what you are trying to do?

fbnrst · 2015-07-01T22:39:46Z

Exactly it's a tuple containing nan. Sorry, I was a bit sloppy.

The example in which I encountered nan in my groupers goes as follows. I have some (biological) data x and they were measured under 2 different conditions, condition1 and condition2. It happened that for condition2 also the spatial position was important and this data was categorized in two subcategories zone1 and zone2 (the spatial position itself was not recorded). For condition1 the spatial position was not measured and therefore this data was missing. The DataFrame looked like this:

>>>df = pd.DataFrame()
>>>df['condition']=3*['condition1']+6*['condition2']
>>>df['zone']=3*[sp.nan]+3*['zone1']+3*['zone2']
>>>df['x']=[1,3,2,1,4,3,2,1,2]
>>>print df

    condition   zone  x
0  condition1    NaN  1
1  condition1    NaN  3
2  condition1    NaN  2
3  condition2  zone1  1
4  condition2  zone1  4
5  condition2  zone1  3
6  condition2  zone2  2
7  condition2  zone2  1
8  condition2  zone2  2

I did univariate scatter plots for the measurements from the 3 conditions "condition1", "condition2, zone1" and "condition2, zone2":

grouped = df.groupby(['condition', 'zone'])
for i, (name, group) in enumerate(grouped):
    plt.scatter(len(group)*[i], group['x'])

Then I wanted to set the xticks according to the group names by

plt.xticks(range(len(group)), grouped.groups.keys())
plt.show()

To my surprise I produced 3 labels but only had data in 2 groups.
Now, I do understand that this behavior comes from the fact, that the groups with a nan in the group name are ignored in the loop but they are present in the grouped.groups dict. I think the behavior would be more consistent if the groups with a nan in the group name are not present in the grouped.groups dict.
Would you agree?

jreback · 2015-07-01T23:10:54Z

usual way to do this is simply to fill your missing groups (with a value or '')

In [34]: df.fillna('missing').groupby(['condition','zone']).sum()
Out[34]: 
                    x
condition  zone      
condition1 missing  6
condition2 zone1    8
           zone2    5

jreback · 2015-07-01T23:15:57Z

I put up an example in #10484

This could be a bug; the state is inconsitent.

bear24rw · 2017-12-19T20:06:47Z

I also just hit this. I understand that nan != nan but I would still like to be able to group by nans without having to fill them with some random value.

In [1]: df = pd.DataFrame({'a':[0,0,0,1,1,1],'b':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'c':[9,8,7,6,5,54]})
In [2]: df.groupby(['a','b']).groups
Out[2]:
{(0, nan): Int64Index([0, 1, 2], dtype='int64'),
 (1, nan): Int64Index([3, 4, 5], dtype='int64')}
In [3]: for group, group_df in df.groupby(['a','b']):
    ...:     print(group)
    ...:
In [4]: for group, group_df in df.fillna('missing').groupby(['a','b']):
    ...:     print(group)
    ...:
(0, 'missing')
(1, 'missing')

I'd really like to have the group tuple contain a nan instead of having to fill it with a value. There are times I want to write group_df to a csv and having to replace missing with nan again before doing so is annoying.

jreback closed this as completed Jun 29, 2015

jreback added Groupby Usage Question labels Jun 29, 2015

jreback mentioned this issue Jul 1, 2015

BUG: partial nan levels missing when grouping #10484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating over groups skips groups with nan in the name #10468

Iterating over groups skips groups with nan in the name #10468

fbnrst commented Jun 29, 2015

fbnrst commented Jun 29, 2015

jreback commented Jun 29, 2015

fbnrst commented Jun 29, 2015

jreback commented Jun 29, 2015

fbnrst commented Jul 1, 2015

jreback commented Jul 1, 2015

jreback commented Jul 1, 2015

bear24rw commented Dec 19, 2017

Iterating over groups skips groups with nan in the name #10468

Iterating over groups skips groups with nan in the name #10468

Comments

fbnrst commented Jun 29, 2015

fbnrst commented Jun 29, 2015

jreback commented Jun 29, 2015

fbnrst commented Jun 29, 2015

jreback commented Jun 29, 2015

fbnrst commented Jul 1, 2015

jreback commented Jul 1, 2015

jreback commented Jul 1, 2015

bear24rw commented Dec 19, 2017