BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

batterseapower · 2022-12-07T08:46:01Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# This returns 0 on any Pandas version. That's as expected.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]

# These return -1 on Pandas 1.4.4, and nan in Pandas 1.5.0 or 1.5.2
# From the docs I'm not sure what is meant to happen. Arguably returning -1 is the right behaviour (since in particular it avoids casting the group numbers to float), so this is a bug.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]

# This returns -1 on Pandas 1.4.4, 0 on Pandas 1.5.0 and nan on 1.5.2
# I'm pretty sure that the correct answer is 0 (for consistency with the float case above)
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]

Issue Description

Probably related to the recent categorical/groupby changes e.g. #48702. I'm raising this as a separate issue because I don't see any ticket specifically discussing the impact on ngroup.

Expected Behavior

with dropna=False groups containing missing values should get their own ngroup number. The ngroup should never be a float i.e. the group number may contain -1 if dropna=True and any group contains a missing value.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d python : 3.10.8.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-348.20.1.el8_5.x86_64 Version : #1 SMP Thu Mar 10 20:59:28 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.5.2
numpy : 1.23.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

/home/mboling/opt/conda/envs/pandas1.5.2/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

The text was updated successfully, but these errors were encountered:

codestorm177 · 2022-12-07T18:54:30Z

take

codestorm177 · 2022-12-07T19:23:00Z

I noticed you didn't mention whether this issue exists on the main branch - I was able to replicate this issue on the main branch

rhshadrach · 2022-12-08T04:37:13Z

In the code in the OP, there are three cases presented. Case 3 appears to me to clearly be a bug.

For Case 2, this was changed in #46367. It was done to be consistent with other transforming ops, e.g.

df = pd.DataFrame.from_dict({'a': [1, 1, 2, np.nan, 3], 'b': range(5)})
gb = df.groupby('a', dropna=True)
result = gb.cumsum()
print(result)
#      b
# 0  0.0
# 1  1.0
# 2  2.0
# 3  NaN
# 4  4.0

result = gb.transform('mean')
print(result)
#      b
# 0  0.5
# 1  0.5
# 2  2.0
# 3  NaN
# 4  4.0

In general, this comes from (a) dropping the null groups, (b) computing the result without nulls, and then (c) aligning the transform result with the input's index. On this last step, alignment, any missing values are filled in with null.

For ngroup however, we make an explicit replacement instead of aligning. I agree coercing to float is undesirable, and that there is an argument that -1 is the correct result (the ngroup in general is the code, and null groups are coded as -1), however it seems to me it disagrees with how the user should reason about transforms with dropna=True works.

This is tangential, but my preferred solution is to remove dropna entirely from groupby (so it always behaves as dropna=False), but this would be much farther out if it is pursued at all; I still think we should come to a resolution here.

…andas-dev#50100 Wrote test case explicitly exposing issue where the result is nan

…andas-dev#50100 Added explicit nan check for test case to make it more readable on failure and implemented the fix.

…andas-dev#50100 added style fixes for previous commit

frbelotto · 2023-02-07T22:32:47Z

Hello!
I´ve been just facing this bug!
My question is. Is "dangerous" to use categorical types? I was luck to see the failure before sending an important report, so I am now asking myself if using categorical is a safe option. For now, I am only using it as a simple way to reduce the DF size.

About

This is tangential, but my preferred solution is to remove dropna entirely from groupby (so it always behaves as dropna=False), but this would be much farther out if it is pursued at all; I still think we should come to a resolution here.

I don´t now you guys, but every group by I made I usually set "dropna = False" and "observed = True". I was even searching if therer is a way to change de default argument here.

batterseapower · 2023-02-08T02:58:24Z

There are safe bits of categoricals and less safe bits :-). It used to be much worse, but I think nowadays most Pandas functions don't go horribly wrong when you use categoricals. ngroup is definitely a bit of an obscure function so it's not surprising that it has issues. Using an object dtype instead is definitely going to be safer though, so if you aren't working with huge data and can afford a bit of memory bloat, I would recommend you do that.

(I agree with you that the default observed=False behaviour is probably the biggest footgun in the Pandas categorical support today.)

rhshadrach · 2023-02-08T03:33:48Z

Ref: #43999

MarcoGorelli · 2023-03-27T18:41:13Z

moving off the 2.0 milestone

mroeschke · 2023-08-22T23:41:37Z

The 3rd case seems to return 0 now. I suppose could use a test

In [4]: pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False, observed=False).ngroup().iloc[0]
Out[4]: 0

josemayer · 2023-08-27T04:38:09Z

Hey @mroeschke, I'm new contributing to Pandas!

The reported bug seems to be fixed, right? We just need tests to ensure that now?

rhshadrach · 2023-08-29T05:44:56Z

Also, at least now the documentation says:

Groups with missing keys (where pd.isna() is True) will be labeled with NaN and will be skipped from the count.

so that the behavior in case 2 of the OP is correct according to the docs.

josemayer · 2023-09-02T14:31:01Z

Take

josemayer · 2023-09-05T15:52:02Z

The 3rd case seems to return 0 now. I suppose could use a test

In [4]: pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False, observed=False).ngroup().iloc[0]
Out[4]: 0

I've linked a pull-request that adds this test case in pandas/tests/groupby/test_groupby.py (#54966). This is my first pull-request to Pandas. Can any mantainer review it?

batterseapower added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 7, 2022

github-actions bot assigned codestorm177 Dec 7, 2022

rhshadrach added Groupby Regression Functionality that used to work in a prior pandas version Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2022

rhshadrach added this to the 1.5.3 milestone Dec 8, 2022

darren4 pushed a commit to codestorm177/panda-dev-local that referenced this issue Dec 11, 2022

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical p…

b4a7947

…andas-dev#50100 Wrote test case explicitly exposing issue where the result is nan

darren4 pushed a commit to codestorm177/panda-dev-local that referenced this issue Dec 11, 2022

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical p…

965bf08

…andas-dev#50100 added style fixes for previous commit

datapythonista modified the milestones: 1.5.3, 1.5.4 Jan 18, 2023

datapythonista modified the milestones: 1.5.4, 2.0 Feb 27, 2023

MarcoGorelli modified the milestones: 2.0, 2.1 Mar 27, 2023

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Regression Functionality that used to work in a prior pandas version Needs Discussion Requires discussion from core team before further action labels Aug 22, 2023

mroeschke removed this from the 2.1 milestone Aug 22, 2023

github-actions bot assigned josemayer Sep 2, 2023

josemayer mentioned this issue Sep 2, 2023

TST: add test case of ngroup with NaN value #54966

Merged

4 tasks

mroeschke closed this as completed in #54966 Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

batterseapower commented Dec 7, 2022

codestorm177 commented Dec 7, 2022

codestorm177 commented Dec 7, 2022

rhshadrach commented Dec 8, 2022

frbelotto commented Feb 7, 2023

batterseapower commented Feb 8, 2023

rhshadrach commented Feb 8, 2023

MarcoGorelli commented Mar 27, 2023

mroeschke commented Aug 22, 2023

josemayer commented Aug 27, 2023

rhshadrach commented Aug 29, 2023

josemayer commented Sep 2, 2023

josemayer commented Sep 5, 2023

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

Comments

batterseapower commented Dec 7, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

codestorm177 commented Dec 7, 2022

codestorm177 commented Dec 7, 2022

rhshadrach commented Dec 8, 2022

frbelotto commented Feb 7, 2023

batterseapower commented Feb 8, 2023

rhshadrach commented Feb 8, 2023

MarcoGorelli commented Mar 27, 2023

mroeschke commented Aug 22, 2023

josemayer commented Aug 27, 2023

rhshadrach commented Aug 29, 2023

josemayer commented Sep 2, 2023

josemayer commented Sep 5, 2023