Skip to content

Groupby iteration fails when one of the key's values is None #14841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nsfinkelstein opened this issue Dec 9, 2016 · 2 comments
Closed

Groupby iteration fails when one of the key's values is None #14841

nsfinkelstein opened this issue Dec 9, 2016 · 2 comments
Labels
Duplicate Report Duplicate issue or pull request Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@nsfinkelstein
Copy link

nsfinkelstein commented Dec 9, 2016

Code Sample

import pandas as pd

df = pd.DataFrame({
    'a': [1, 1, 2, 3],
    'b': [None, 1, 2, 3],
    'c': [1,2,3,4],
})

grouped = df.groupby(('a', 'b'))
print('Number of groups:', len(grouped))
# Number of groups: 4

num_iterations = 0
for _ in grouped:
    num_iterations += 1

print('Number of groups iterated over:', num_iterations) 
# Number of groups iterated over: 3

Problem description

Presently, when a GroupBy object is iterated over, any group where one of the columns grouped by is None is skipped.

This is a problem because when iterating over groups, we expect to iterate over all groups.

It is also a problem because it is not sensible to say the length of an iterable is x when iterating over it only performs some y < x number of iterations.

Expected Output

I'd expect iterating over a GroupBy object to iterate over all groups, regardless of the value in the key columns.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-504.12.2.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Dec 9, 2016

Thanks for the report, this is a duplicate of #3729, with a WIP fix at #12607 - please try it out and comment if you'd like.

@chris-b1 chris-b1 closed this as completed Dec 9, 2016
@chris-b1 chris-b1 added this to the No action milestone Dec 9, 2016
@chris-b1 chris-b1 added Duplicate Report Duplicate issue or pull request Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 9, 2016
@nsfinkelstein
Copy link
Author

@chris-b1 Thanks - apologies for the duplication, I didn't notice that post when I was looking through issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

2 participants