-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: grouped.last() will sometimes turn a boolean column into Int64 #46409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Did some digging and I believe this is why this is happening: The pandas/pandas/core/groupby/ops.py Lines 519 to 532 in 6033ed4
Running the reproducible example gives these as the runtime values of each arguments # pandas.core.groupby.ops.WrappedCythonOp._call_cython_op() line 523
# func is group_last attr of pandas._libs.groupby.libgroupby
func(
out=result, # array([[0],[1],[2],[3]], dtype=int64)
counts=counts, # array([0, 0, 0, 0], dtype=int64)
values=values, # array([[1],[0],[0],[0]], dtype=int64)
labels=comp_ids, # array([0, 1, 2, 3], dtype=int64)
min_count=min_count, # -1
mask=mask, # array([[False],[True],[True],[False]])
result_mask=result_mask, # array([[False],[False],[False],[False]])
is_datetimelike=is_datetimelike, # False
)
>>> array([[1],[1],[2],[0]], dtype=int64) This creates an issue as pandas will fail to convert the result back to the original type: ie. pandas/pandas/core/groupby/ops.py Line 589 in 6033ed4
pandas/pandas/core/dtypes/cast.py Line 284 in 6033ed4
pandas/pandas/core/dtypes/cast.py Lines 362 to 363 in 6033ed4
Why Only Happens Sometimes?This only happens sometime because the result variable is initialized using np.empty pandas/pandas/core/groupby/ops.py Line 519 in 6033ed4
For example, if result is initialized as # pandas.core.groupby.ops.WrappedCythonOp._call_cython_op() line 523
func(
out=result, # array([[1],[1],[1],[0]], dtype=int64) <- this is different
counts=counts, # array([0, 0, 0, 0], dtype=int64)
values=values, # array([[1],[0],[0],[0]], dtype=int64)
labels=comp_ids, # array([0, 1, 2, 3], dtype=int64)
min_count=min_count, # -1
mask=mask, # array([[False],[True],[True],[False]])
result_mask=result_mask, # array([[False],[False],[False],[False]])
is_datetimelike=is_datetimelike, # False
)
>>> array([[1],[1],[1],[0]], dtype=int64) # this can pass the `np.allclose()` check |
take |
take |
@dalejung @ivan-ngchakming Hi! I have just claimed this project and was wondering what files are used to test the project? Thank you in advance! |
… file added in pandas/tests/groupby/test_groupby_last.py, issue pandas-dev#46409
Looks like this works on main. Could use a test
|
…46409) (#47736) * TST: add test for last method on dataframe grouped by on boolean column (#46409) * TST: add test for last method on dataframe grouped by on boolean column (#46409) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673) * BUG: PeriodIndex fails to handle NA, rather than putting NaT in its place (#46673)
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
On the latest master this returns an Int64 column.
I checked 1.4.0 and it properly returns the boolean dtype.
What is weird is that changing 3. to True/False will give the proper dtype.
Expected Behavior
Retain the boolean dtype from
df.test
.Installed Versions
INSTALLED VERSIONS
commit : 663147e
python : 3.10.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.16.12-arch1-1
Version : #1 SMP PREEMPT Wed, 02 Mar 2022 12:22:51 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.5.0.dev0+545.g663147edd3
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : 0.29.28
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.0.1
matplotlib : None
numba : 0.55.1
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 8.0.0.dev230+gb2ae3d74d
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 2.0.0b1
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None
The text was updated successfully, but these errors were encountered: