Skip to content

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
batterseapower opened this issue Dec 7, 2022 · 12 comments · Fixed by #54966
Closed
2 of 3 tasks

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

batterseapower opened this issue Dec 7, 2022 · 12 comments · Fixed by #54966
Assignees
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions

Comments

@batterseapower
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# This returns 0 on any Pandas version. That's as expected.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]

# These return -1 on Pandas 1.4.4, and nan in Pandas 1.5.0 or 1.5.2
# From the docs I'm not sure what is meant to happen. Arguably returning -1 is the right behaviour (since in particular it avoids casting the group numbers to float), so this is a bug.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]

# This returns -1 on Pandas 1.4.4, 0 on Pandas 1.5.0 and nan on 1.5.2
# I'm pretty sure that the correct answer is 0 (for consistency with the float case above)
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]

Issue Description

Probably related to the recent categorical/groupby changes e.g. #48702. I'm raising this as a separate issue because I don't see any ticket specifically discussing the impact on ngroup.

Expected Behavior

with dropna=False groups containing missing values should get their own ngroup number. The ngroup should never be a float i.e. the group number may contain -1 if dropna=True and any group contains a missing value.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d python : 3.10.8.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-348.20.1.el8_5.x86_64 Version : #1 SMP Thu Mar 10 20:59:28 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.5.2
numpy : 1.23.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

/home/mboling/opt/conda/envs/pandas1.5.2/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

@batterseapower batterseapower added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 7, 2022
@codestorm177
Copy link

take

@codestorm177
Copy link

I noticed you didn't mention whether this issue exists on the main branch - I was able to replicate this issue on the main branch

@rhshadrach
Copy link
Member

In the code in the OP, there are three cases presented. Case 3 appears to me to clearly be a bug.

For Case 2, this was changed in #46367. It was done to be consistent with other transforming ops, e.g.

df = pd.DataFrame.from_dict({'a': [1, 1, 2, np.nan, 3], 'b': range(5)})
gb = df.groupby('a', dropna=True)
result = gb.cumsum()
print(result)
#      b
# 0  0.0
# 1  1.0
# 2  2.0
# 3  NaN
# 4  4.0

result = gb.transform('mean')
print(result)
#      b
# 0  0.5
# 1  0.5
# 2  2.0
# 3  NaN
# 4  4.0

In general, this comes from (a) dropping the null groups, (b) computing the result without nulls, and then (c) aligning the transform result with the input's index. On this last step, alignment, any missing values are filled in with null.

For ngroup however, we make an explicit replacement instead of aligning. I agree coercing to float is undesirable, and that there is an argument that -1 is the correct result (the ngroup in general is the code, and null groups are coded as -1), however it seems to me it disagrees with how the user should reason about transforms with dropna=True works.

This is tangential, but my preferred solution is to remove dropna entirely from groupby (so it always behaves as dropna=False), but this would be much farther out if it is pursued at all; I still think we should come to a resolution here.

@rhshadrach rhshadrach added Groupby Regression Functionality that used to work in a prior pandas version Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2022
@rhshadrach rhshadrach added this to the 1.5.3 milestone Dec 8, 2022
darren4 pushed a commit to codestorm177/panda-dev-local that referenced this issue Dec 11, 2022
…andas-dev#50100

Wrote test case explicitly exposing issue where the result is nan
darren4 pushed a commit to codestorm177/panda-dev-local that referenced this issue Dec 11, 2022
…andas-dev#50100

Added explicit nan check for test case to make it more readable on failure and implemented the fix.
darren4 pushed a commit to codestorm177/panda-dev-local that referenced this issue Dec 11, 2022
@datapythonista datapythonista modified the milestones: 1.5.3, 1.5.4 Jan 18, 2023
@frbelotto
Copy link

Hello!
I´ve been just facing this bug!
My question is. Is "dangerous" to use categorical types? I was luck to see the failure before sending an important report, so I am now asking myself if using categorical is a safe option. For now, I am only using it as a simple way to reduce the DF size.

About

This is tangential, but my preferred solution is to remove dropna entirely from groupby (so it always behaves as dropna=False), but this would be much farther out if it is pursued at all; I still think we should come to a resolution here.

I don´t now you guys, but every group by I made I usually set "dropna = False" and "observed = True". I was even searching if therer is a way to change de default argument here.

@batterseapower
Copy link
Contributor Author

There are safe bits of categoricals and less safe bits :-). It used to be much worse, but I think nowadays most Pandas functions don't go horribly wrong when you use categoricals. ngroup is definitely a bit of an obscure function so it's not surprising that it has issues. Using an object dtype instead is definitely going to be safer though, so if you aren't working with huge data and can afford a bit of memory bloat, I would recommend you do that.

(I agree with you that the default observed=False behaviour is probably the biggest footgun in the Pandas categorical support today.)

@rhshadrach
Copy link
Member

Ref: #43999

@datapythonista datapythonista modified the milestones: 1.5.4, 2.0 Feb 27, 2023
@MarcoGorelli
Copy link
Member

moving off the 2.0 milestone

@MarcoGorelli MarcoGorelli modified the milestones: 2.0, 2.1 Mar 27, 2023
@mroeschke
Copy link
Member

The 3rd case seems to return 0 now. I suppose could use a test

In [4]: pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False, observed=False).ngroup().iloc[0]
Out[4]: 0

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Regression Functionality that used to work in a prior pandas version Needs Discussion Requires discussion from core team before further action labels Aug 22, 2023
@mroeschke mroeschke removed this from the 2.1 milestone Aug 22, 2023
@josemayer
Copy link
Contributor

Hey @mroeschke, I'm new contributing to Pandas!

The reported bug seems to be fixed, right? We just need tests to ensure that now?

@rhshadrach
Copy link
Member

Also, at least now the documentation says:

Groups with missing keys (where pd.isna() is True) will be labeled with NaN and will be skipped from the count.

so that the behavior in case 2 of the OP is correct according to the docs.

@josemayer
Copy link
Contributor

Take

@josemayer
Copy link
Contributor

The 3rd case seems to return 0 now. I suppose could use a test

In [4]: pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False, observed=False).ngroup().iloc[0]
Out[4]: 0

I've linked a pull-request that adds this test case in pandas/tests/groupby/test_groupby.py (#54966). This is my first pull-request to Pandas. Can any mantainer review it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants