Skip to content

BUG: grouped.last() will sometimes turn a boolean column into Int64 #46409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
dalejung opened this issue Mar 18, 2022 · 5 comments · Fixed by #47736
Closed
2 of 3 tasks

BUG: grouped.last() will sometimes turn a boolean column into Int64 #46409

dalejung opened this issue Mar 18, 2022 · 5 comments · Fixed by #47736
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@dalejung
Copy link
Contributor

dalejung commented Mar 18, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    {
        'id': [1, 2, 3, 4],
        'test': [True, pd.NA, pd.NA, False]
    }
).convert_dtypes()

grouped = df.groupby('id')
bad = grouped.last()
assert bad.test.dtype == pd.BooleanDtype() # fails

Issue Description

On the latest master this returns an Int64 column.

    test
id
1      1
2   <NA>
3   <NA>
4      0

I checked 1.4.0 and it properly returns the boolean dtype.

     test
id
1    True
2    <NA>
3    <NA>
4   False

What is weird is that changing 3. to True/False will give the proper dtype.

Expected Behavior

Retain the boolean dtype from df.test.

Installed Versions

INSTALLED VERSIONS

commit : 663147e
python : 3.10.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.16.12-arch1-1
Version : #1 SMP PREEMPT Wed, 02 Mar 2022 12:22:51 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+545.g663147edd3
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : 0.29.28
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.0.1
matplotlib : None
numba : 0.55.1
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 8.0.0.dev230+gb2ae3d74d
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 2.0.0b1
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

@dalejung dalejung added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2022
@ivan-ngchakming
Copy link
Contributor

Did some digging and I believe this is why this is happening:

The groupby.last() call will execute these lines to get the result

result = maybe_fill(np.empty(out_shape, dtype=out_dtype))
if self.kind == "aggregate":
counts = np.zeros(ngroups, dtype=np.int64)
if self.how in ["min", "max", "mean", "last", "first"]:
func(
out=result,
counts=counts,
values=values,
labels=comp_ids,
min_count=min_count,
mask=mask,
result_mask=result_mask,
is_datetimelike=is_datetimelike,
)

Running the reproducible example gives these as the runtime values of each arguments

# pandas.core.groupby.ops.WrappedCythonOp._call_cython_op() line 523
# func is group_last attr of pandas._libs.groupby.libgroupby

func(
    out=result,                         # array([[0],[1],[2],[3]], dtype=int64)
    counts=counts,                      # array([0, 0, 0, 0], dtype=int64)
    values=values,                      # array([[1],[0],[0],[0]], dtype=int64)
    labels=comp_ids,                    # array([0, 1, 2, 3], dtype=int64)
    min_count=min_count,                # -1
    mask=mask,                          # array([[False],[True],[True],[False]])
    result_mask=result_mask,            # array([[False],[False],[False],[False]])
    is_datetimelike=is_datetimelike,    # False
)
>>> array([[1],[1],[2],[0]], dtype=int64)

This creates an issue as pandas will fail to convert the result back to the original type: bool, because the np.allclose(new_result, result, rtol=0) check in maybe_downcast_numeric() is not passable.

ie. [[1],[1],[2],[0]] != [[True],[True],[True],[False]]

op_result = maybe_downcast_to_dtype(result, res_dtype)

converted = maybe_downcast_numeric(result, dtype, do_round)

if np.allclose(new_result, result, rtol=0):
return new_result

Why Only Happens Sometimes?

This only happens sometime because the result variable is initialized using np.empty
which Return a new array of given shape and type, without initializing entries.

result = maybe_fill(np.empty(out_shape, dtype=out_dtype))

For example, if result is initialized as array([[1],[1],[1],[0]], dtype=int64)

# pandas.core.groupby.ops.WrappedCythonOp._call_cython_op() line 523
func(
    out=result,                         # array([[1],[1],[1],[0]], dtype=int64)  <- this is different
    counts=counts,                      # array([0, 0, 0, 0], dtype=int64)
    values=values,                      # array([[1],[0],[0],[0]], dtype=int64)
    labels=comp_ids,                    # array([0, 1, 2, 3], dtype=int64)
    min_count=min_count,                # -1
    mask=mask,                          # array([[False],[True],[True],[False]])
    result_mask=result_mask,            # array([[False],[False],[False],[False]])
    is_datetimelike=is_datetimelike,    # False
)
>>> array([[1],[1],[1],[0]], dtype=int64)  # this can pass the `np.allclose()` check

@sydneyeh
Copy link

sydneyeh commented Apr 5, 2022

take

@ayeshaam
Copy link

ayeshaam commented Apr 5, 2022

take

@sydneyeh
Copy link

sydneyeh commented Apr 5, 2022

@dalejung @ivan-ngchakming Hi! I have just claimed this project and was wondering what files are used to test the project? Thank you in advance!

sydneyeh added a commit to sydneyeh/pandas that referenced this issue Apr 15, 2022
… file added in pandas/tests/groupby/test_groupby_last.py, issue pandas-dev#46409
@mroeschke
Copy link
Member

Looks like this works on main. Could use a test

In [6]: import pandas as pd
   ...:
   ...: df = pd.DataFrame(
   ...:     {
   ...:         'id': [1, 2, 3, 4],
   ...:         'test': [True, pd.NA, pd.NA, False]
   ...:     }
   ...: ).convert_dtypes()
   ...:
   ...: grouped = df.groupby('id')
   ...: bad = grouped.last()

In [7]: bad
Out[7]:
     test
id
1    True
2    <NA>
3    <NA>
4   False

In [8]: pd.__version__
Out[8]: '1.5.0.dev0+1088.g37e6239f61'

In [9]: bad.dtypes
Out[9]:
test    boolean
dtype: object

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022
mroeschke pushed a commit that referenced this issue Jul 18, 2022
…46409) (#47736)

* TST: add test for last method on dataframe grouped by on boolean column (#46409)

* TST: add test for last method on dataframe grouped by on boolean column (#46409)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)

* BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
5 participants