-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
I noticed you didn't mention whether this issue exists on the main branch - I was able to replicate this issue on the main branch |
In the code in the OP, there are three cases presented. Case 3 appears to me to clearly be a bug. For Case 2, this was changed in #46367. It was done to be consistent with other transforming ops, e.g.
In general, this comes from (a) dropping the null groups, (b) computing the result without nulls, and then (c) aligning the transform result with the input's index. On this last step, alignment, any missing values are filled in with null. For ngroup however, we make an explicit replacement instead of aligning. I agree coercing to float is undesirable, and that there is an argument that -1 is the correct result (the ngroup in general is the code, and null groups are coded as -1), however it seems to me it disagrees with how the user should reason about transforms with This is tangential, but my preferred solution is to remove |
…andas-dev#50100 Wrote test case explicitly exposing issue where the result is nan
…andas-dev#50100 Added explicit nan check for test case to make it more readable on failure and implemented the fix.
…andas-dev#50100 added style fixes for previous commit
Hello! About
I don´t now you guys, but every group by I made I usually set "dropna = False" and "observed = True". I was even searching if therer is a way to change de default argument here. |
There are safe bits of categoricals and less safe bits :-). It used to be much worse, but I think nowadays most Pandas functions don't go horribly wrong when you use categoricals. ngroup is definitely a bit of an obscure function so it's not surprising that it has issues. Using an object dtype instead is definitely going to be safer though, so if you aren't working with huge data and can afford a bit of memory bloat, I would recommend you do that. (I agree with you that the default |
Ref: #43999 |
moving off the 2.0 milestone |
The 3rd case seems to return 0 now. I suppose could use a test
|
Hey @mroeschke, I'm new contributing to Pandas! The reported bug seems to be fixed, right? We just need tests to ensure that now? |
Also, at least now the documentation says:
so that the behavior in case 2 of the OP is correct according to the docs. |
Take |
I've linked a pull-request that adds this test case in |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Probably related to the recent categorical/groupby changes e.g. #48702. I'm raising this as a separate issue because I don't see any ticket specifically discussing the impact on
ngroup
.Expected Behavior
with
dropna=False
groups containing missing values should get their ownngroup
number. Thengroup
should never be a float i.e. the group number may contain -1 if dropna=True and any group contains a missing value.Installed Versions
pandas : 1.5.2
numpy : 1.23.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
/home/mboling/opt/conda/envs/pandas1.5.2/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
The text was updated successfully, but these errors were encountered: