BUG: groupby with dropna=False and 'unique' aggregate creates duplicated index #42016

mhabets · 2021-06-15T11:06:26Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

df_list = [[1, 1, 1, 'A'], [1, None, 1, 'A'], [1, None, 2, 'A'], [1, None, 3, 'A']]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c", 'partner'])

# with 'unique', groupby result contains duplicates
partners =\
    df_dropna.groupby(['a', 'b', 'c'], dropna=False)\
    .agg({'partner': ['nunique', 'unique']})
partners.index.is_unique

# when using another aggregate function than 'unique', it is working as expected
partners =\
    df_dropna.groupby(['a', 'b', 'c'], dropna=False)\
    .agg({'partner': ['nunique', 'sum']})
partners.index.is_unique

Problem description

When using 'unique' and another aggregate function in case of a groupby on key containing NA values, the result index contains duplicates but it shouldn't:

        partner       
        nunique unique
a b   c               
1 1.0 1       1    [A]
  NaN 1       1    [A]
      2       1    [A]
      3       1    [A]
      1       1    [A]
      2       1    [A]
      3       1    [A]

Expected Output

Groupby result index shouldn't contain duplicates as below:

        partner       
        nunique unique
a b   c               
1 1.0 1       1    [A]
  NaN 1       1    [A]
      2       1    [A]
      3       1    [A]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en
LOCALE : English_United Kingdom.1252

pandas : 1.2.4
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.22
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-06-15T14:25:41Z

Hi, thanks for your report. Could you please show your result and what you would expect?

mhabets · 2021-06-15T15:15:29Z

Hi @phofl - I have updated the issue description as requested.

LustigePerson · 2022-01-07T13:08:21Z

I missed this Isse when I opened #45226.
For this I just want to add that this only happens when nuniqueis used first.

Correct:

print(df.groupby(["a", "b"], dropna=False)[["c"]].agg(["unique", "nunique"]))

              c        
         unique nunique
a   b                  
1.0 1.0  [a, b]       2
    2.0     [c]       1
NaN NaN     [d]       1

Incorrect

print(df.groupby(["a", "b"], dropna=False)[["c"]].agg(["nunique", "unique"]))

              c        
        nunique  unique
a   b                  
1.0 1.0       2  [a, b]
    2.0       1     [c]
NaN NaN       1     [d]
    NaN       1     [d]

rhshadrach · 2024-03-02T21:40:08Z

I'm now seeing the expected output on main. Could use a test.

mhabets added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 15, 2021

mroeschke added Apply Apply, Aggregate, Transform, Map Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

phofl mentioned this issue Jan 7, 2022

BUG: Groupby with aggregation nunique and NaN leads to strange behavior (duplicated rows in output) #45226

Closed

3 tasks

rhshadrach added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Mar 2, 2024

luke396 mentioned this issue Mar 3, 2024

TST: Add test for grouby after dropna agg nunique and unique #57711

Merged

5 tasks

mroeschke closed this as completed in #57711 Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby with dropna=False and 'unique' aggregate creates duplicated index #42016

BUG: groupby with dropna=False and 'unique' aggregate creates duplicated index #42016

mhabets commented Jun 15, 2021 •

edited

Loading

INSTALLED VERSIONS

phofl commented Jun 15, 2021

mhabets commented Jun 15, 2021

LustigePerson commented Jan 7, 2022

rhshadrach commented Mar 2, 2024

BUG: groupby with dropna=False and 'unique' aggregate creates duplicated index #42016

BUG: groupby with dropna=False and 'unique' aggregate creates duplicated index #42016

Comments

mhabets commented Jun 15, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

phofl commented Jun 15, 2021

mhabets commented Jun 15, 2021

LustigePerson commented Jan 7, 2022

rhshadrach commented Mar 2, 2024

mhabets commented Jun 15, 2021 •

edited

Loading

Output of `pd.show_versions()`