-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Iterating over groups skips groups with nan in the name #10468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just confirmed the issue for the current version of pandas: INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-55-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.2
nose: 1.3.1
Cython: 0.20.1post0
numpy: 1.9.2
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 3.2.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 0.8.4
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext) |
see docs here by-definition groupby skips nan groups |
i see the point and that makes sense. In [9]: grouped2 = df.groupby(['b'])
In [10]: grouped2.groups
Out[10]: {1.0: [0]} But in the example I posted above where I grouped using multiple columns the nan entry is still listed as a group: In [11]: grouped = df.groupby(['a', 'b'])
In [12]: print grouped.groups
{('a', 1.0): [0], ('b', nan): [1]} Shouldn't the second entry not be there because it contains a nan? |
no, the 2nd is not |
Exactly it's a tuple containing The example in which I encountered >>>df = pd.DataFrame()
>>>df['condition']=3*['condition1']+6*['condition2']
>>>df['zone']=3*[sp.nan]+3*['zone1']+3*['zone2']
>>>df['x']=[1,3,2,1,4,3,2,1,2]
>>>print df
condition zone x
0 condition1 NaN 1
1 condition1 NaN 3
2 condition1 NaN 2
3 condition2 zone1 1
4 condition2 zone1 4
5 condition2 zone1 3
6 condition2 zone2 2
7 condition2 zone2 1
8 condition2 zone2 2 I did univariate scatter plots for the measurements from the 3 conditions "condition1", "condition2, zone1" and "condition2, zone2": grouped = df.groupby(['condition', 'zone'])
for i, (name, group) in enumerate(grouped):
plt.scatter(len(group)*[i], group['x']) Then I wanted to set the xticks according to the group names by plt.xticks(range(len(group)), grouped.groups.keys())
plt.show() To my surprise I produced 3 labels but only had data in 2 groups. |
usual way to do this is simply to fill your missing groups (with a value or '')
|
I put up an example in #10484 This could be a bug; the state is inconsitent. |
I also just hit this. I understand that In [1]: df = pd.DataFrame({'a':[0,0,0,1,1,1],'b':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'c':[9,8,7,6,5,54]})
In [2]: df.groupby(['a','b']).groups
Out[2]:
{(0, nan): Int64Index([0, 1, 2], dtype='int64'),
(1, nan): Int64Index([3, 4, 5], dtype='int64')}
In [3]: for group, group_df in df.groupby(['a','b']):
...: print(group)
...:
In [4]: for group, group_df in df.fillna('missing').groupby(['a','b']):
...: print(group)
...:
(0, 'missing')
(1, 'missing') I'd really like to have the group tuple contain a |
Today I discovered a strange behaviour when iterating over groups where the group name contains a nan. Take the following data frame and group it. As expected there are two groups:
Now, if I iterate over the groups the second group is missing:
I would expect that the iteration works like this:
pandas version:
The text was updated successfully, but these errors were encountered: