Skip to content

Iterating over groups skips groups with nan in the name #10468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fbnrst opened this issue Jun 29, 2015 · 8 comments
Closed

Iterating over groups skips groups with nan in the name #10468

fbnrst opened this issue Jun 29, 2015 · 8 comments

Comments

@fbnrst
Copy link
Contributor

fbnrst commented Jun 29, 2015

Today I discovered a strange behaviour when iterating over groups where the group name contains a nan. Take the following data frame and group it. As expected there are two groups:

In [1]: df = pd.DataFrame([['a', 1, 1], ['b', sp.nan, 2]], columns = ['a', 'b', 'c'])

In [2]: grouped = df.groupby(['a', 'b'])

In [3]: print grouped.groups
{('a', 1.0): [0], ('b', nan): [1]}

Now, if I iterate over the groups the second group is missing:

In [4]: for name, group in grouped:
   ...:         print name
('a', 1.0)

I would expect that the iteration works like this:

In [5]: for name, group in grouped.groups.items():
   ...:         print name
('a', 1.0)
('b', nan)

pandas version:

In [6]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-55-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.1
Cython: 0.20.1post0
numpy: 1.9.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 3.2.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: 2.3.9
sqlalchemy: 0.8.4
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
@fbnrst
Copy link
Contributor Author

fbnrst commented Jun 29, 2015

I just confirmed the issue for the current version of pandas:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-55-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.2
nose: 1.3.1
Cython: 0.20.1post0
numpy: 1.9.2
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 3.2.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 0.8.4
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)

@jreback
Copy link
Contributor

jreback commented Jun 29, 2015

see docs here

by-definition groupby skips nan groups

@fbnrst
Copy link
Contributor Author

fbnrst commented Jun 29, 2015

i see the point and that makes sense.
But the behaviour of groupby is still a little bit inconsistent. If I group just by one column containing nan the corresponding elements are immediately removed:

In [9]: grouped2 = df.groupby(['b'])

In [10]: grouped2.groups
Out[10]: {1.0: [0]}

But in the example I posted above where I grouped using multiple columns the nan entry is still listed as a group:

In [11]: grouped = df.groupby(['a', 'b'])

In [12]: print grouped.groups
{('a', 1.0): [0], ('b', nan): [1]}

Shouldn't the second entry not be there because it contains a nan?

@jreback
Copy link
Contributor

jreback commented Jun 29, 2015

no, the 2nd is not nan, but a tuple (that happens to contain nan). having nan in your groupers is pretty odd. can you show a full example of what you are trying to do?

@fbnrst
Copy link
Contributor Author

fbnrst commented Jul 1, 2015

Exactly it's a tuple containing nan. Sorry, I was a bit sloppy.

The example in which I encountered nan in my groupers goes as follows. I have some (biological) data x and they were measured under 2 different conditions, condition1 and condition2. It happened that for condition2 also the spatial position was important and this data was categorized in two subcategories zone1 and zone2 (the spatial position itself was not recorded). For condition1 the spatial position was not measured and therefore this data was missing. The DataFrame looked like this:

>>>df = pd.DataFrame()
>>>df['condition']=3*['condition1']+6*['condition2']
>>>df['zone']=3*[sp.nan]+3*['zone1']+3*['zone2']
>>>df['x']=[1,3,2,1,4,3,2,1,2]
>>>print df

    condition   zone  x
0  condition1    NaN  1
1  condition1    NaN  3
2  condition1    NaN  2
3  condition2  zone1  1
4  condition2  zone1  4
5  condition2  zone1  3
6  condition2  zone2  2
7  condition2  zone2  1
8  condition2  zone2  2

I did univariate scatter plots for the measurements from the 3 conditions "condition1", "condition2, zone1" and "condition2, zone2":

grouped = df.groupby(['condition', 'zone'])
for i, (name, group) in enumerate(grouped):
    plt.scatter(len(group)*[i], group['x'])

Then I wanted to set the xticks according to the group names by

plt.xticks(range(len(group)), grouped.groups.keys())
plt.show()

To my surprise I produced 3 labels but only had data in 2 groups.
Now, I do understand that this behavior comes from the fact, that the groups with a nan in the group name are ignored in the loop but they are present in the grouped.groups dict. I think the behavior would be more consistent if the groups with a nan in the group name are not present in the grouped.groups dict.
Would you agree?

@jreback
Copy link
Contributor

jreback commented Jul 1, 2015

usual way to do this is simply to fill your missing groups (with a value or '')

In [34]: df.fillna('missing').groupby(['condition','zone']).sum()
Out[34]: 
                    x
condition  zone      
condition1 missing  6
condition2 zone1    8
           zone2    5

@jreback
Copy link
Contributor

jreback commented Jul 1, 2015

I put up an example in #10484

This could be a bug; the state is inconsitent.

@bear24rw
Copy link

I also just hit this. I understand that nan != nan but I would still like to be able to group by nans without having to fill them with some random value.

In [1]: df = pd.DataFrame({'a':[0,0,0,1,1,1],'b':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'c':[9,8,7,6,5,54]})
In [2]: df.groupby(['a','b']).groups
Out[2]:
{(0, nan): Int64Index([0, 1, 2], dtype='int64'),
 (1, nan): Int64Index([3, 4, 5], dtype='int64')}
In [3]: for group, group_df in df.groupby(['a','b']):
    ...:     print(group)
    ...:
In [4]: for group, group_df in df.fillna('missing').groupby(['a','b']):
    ...:     print(group)
    ...:
(0, 'missing')
(1, 'missing')

I'd really like to have the group tuple contain a nan instead of having to fill it with a value. There are times I want to write group_df to a csv and having to replace missing with nan again before doing so is annoying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants