Skip to content

BUG: groupby.agg coercing booleans #14873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
enfeizhan opened this issue Dec 13, 2016 · 5 comments · Fixed by #25327
Closed

BUG: groupby.agg coercing booleans #14873

enfeizhan opened this issue Dec 13, 2016 · 5 comments · Fixed by #25327
Labels
Dtype Conversions Unexpected or buggy dtype conversions good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@enfeizhan
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd

dat = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [0, 1, 2, 3], 'c': [None, None, 1, 1]})

gp = dat.groupby('a')

case_one = gp['b'].aggregate(lambda x: (x != 0).all())

case_two = gp['c'].aggregate(lambda x: x.isnull().all())

Problem description

By comparing case_one and case_two, it seems that aggregate function turns boolean values sometimes to integers 1 or 0.

For case_one we get,

a
1    False
2     True
Name: b, dtype: bool

However for case_two we get

a
1    1.0
2    0.0
Name: c, dtype: float64

Expected Output

For case_two we should get

a
1    True
2     False
Name: b, dtype: bool

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: en_AU.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

This same issue appears in a different environment as well.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_AU.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

this is a duplicate of this: #14423

.groupby tries to cast back to the original types which is generally the correct result. Since you are performaning an arbitrary opaque function (this is by-definition what a lambada is), it is not clear what the correct result should be.

I suppose we could be a little softer in the casting w.r.t. boolean results. (but of course someone will have an issue with that :)

@jreback jreback closed this as completed Dec 13, 2016
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Groupby labels Dec 13, 2016
@jreback jreback added this to the No action milestone Dec 13, 2016
@jreback jreback modified the milestones: 0.20.0, No action Mar 14, 2017
@jreback jreback reopened this Mar 14, 2017
@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

this is independent of #14423

looks like _try_cast needs to respect boolean

@jreback jreback added Difficulty Intermediate and removed Duplicate Report Duplicate issue or pull request labels Mar 14, 2017
@jreback jreback changed the title aggregate function turns boolean to integer randomly BUG: groupb.agg coercing booleans Mar 14, 2017
@jreback jreback changed the title BUG: groupb.agg coercing booleans BUG: groupby.agg coercing booleans Mar 14, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@mroeschke
Copy link
Member

This looks fixed. Could use a test.

In [12]: case_two
Out[12]:
a
1     True
2    False
Name: c, dtype: bool

In [13]: pd.__version__
Out[13]: '0.25.0.dev0+50.gf75a220ff'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Feb 3, 2019
@TrigonaMinima
Copy link

@mroeschke is this issue really fixed? In the #21240 issue referenced just above, @jreback says this issue is the root cause of that issue. And when I try to run the code part from that issue, the problem still seems to be there.

@mroeschke
Copy link
Member

#21240 isn't fixed but this specific case is fixed

@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Feb 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants