-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistant behaviour of empty groups when grouping with one vs. many #23865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you produce a more minimal example to reproduce the error? I'm assuming |
Here's the most minimal example I think I can come up with: df = pd.DataFrame({'foo' : [1,1,3,3], 'bar' : ['A', 'B', 'A', 'B']})
# gives zero count for the (1, 2] bin as expected
df.groupby(pd.cut(df['foo'], bins=[0,1,2,3])).size()
# the (1,2] bin is not shown
df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar']).size() does this make sense? I can't think of a meaningful example that doesn't use |
i think this is specific to logic in .size(). welcome investigation.
|
Coming back to this issue as I recently ran into another example in real life data. Another way of looking at the problem that occurred to me is as a difference between calling df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar']).count()['foo']
# output
foo bar
(0, 1] A 1.0
B 1.0
(1, 2] A NaN
B NaN
(2, 3] A 1.0
B 1.0
Name: foo, dtype: float64 and taking the column first then calling df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar'])['foo'].count()
# output
foo bar
(0, 1] A 1
B 1
(2, 3] A 1
B 1
Name: foo, dtype: int64 Which gives the output where the I notice that [x for x,y in df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar'])]
# output
[(Interval(0, 1, closed='right'), 'A'),
(Interval(0, 1, closed='right'), 'B'),
(Interval(2, 3, closed='right'), 'A'),
(Interval(2, 3, closed='right'), 'B')]
[x for x,y in df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar'])['foo']]
# output
[(Interval(0, 1, closed='right'), 'A'),
(Interval(0, 1, closed='right'), 'B'),
(Interval(2, 3, closed='right'), 'A'),
(Interval(2, 3, closed='right'), 'B')] i.e. missing the empty groups. For my use case, the ideal solution might be to have a |
This seems to have been fixed between 0.25.0 and 1.0.0. Not sure if it's appropriate to close this in case it gets regressed, but please do so if it is. |
master includes the (1,2] bin
|
can confirm (using code sample in #23865 (comment)) that this issue was resolved in #29690 |
take |
Not sure if this is a bug or expected behaviour, but it's something that catches me out constantly.
Problem description
I want to look at the distribution of my
percentage
columns, so I dogroupby
so that I can use sensible bins. This is my firstgropuby
in the code example. The output is as expected:Including the bins for which there were zero observations (50-60% and 80-90%).
Now I want to also group by place. This is my second
groupby
. Now my empty groups disappear:When I
unstack
this to make a summary table the empty percentage bins are missing:This will cause a problem when I go to plot this data, as the spacing on my 'percentage' axis will not be consistent. If I'm just counting the size of the bins then I can get round the problem by reindexing with the original categories:
but this seems like a complicated solution for what should be a relatively common problem. And if I want to do some other type of aggregation, e.g. calculate the mean number of cups of coffee for each group, I can't figure it out:
I have missing groups, but I can't fix it with
stack/fillna/unstack
as I don't want to fill in a value - I want to leave it as missing data, but still have the group appear.Reading through previous issues, it sounds like I am describing this:
#8138
but the thread says it's fixed, so I can't figure out what is different in my situation.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.8.1
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: