BUG: .groupby() produces MultiIndex with NaN in levels instead of encoded in codes #54347

Tolker-KU · 2023-08-01T06:54:17Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [1, float("nan"), 2, float("nan")], "C": [2, 4, 6, 8]})
df_grouped = df.groupby(["A", "B"], dropna=False).sum()

index = df_grouped.index
pd.testing.assert_index_equal(index, pd.MultiIndex.from_frame(index.to_frame()))

Issue Description

The snippet above results in an AssertionError, because NaN in the MultiIndex is encoded in levels when returned from .groupby(), but in codes when returned from pd.MultiIndex.from_frame().

AssertionError: Index are different

Index values are different (50.0 %)
[left]:  MultiIndex([(1, 1.0),
            (2, nan),
            (3, 2.0),
            (4, nan)],
           names=['A', 'B'])
[right]: MultiIndex([(1, 1.0),
            (2, nan),
            (3, 2.0),
            (4, nan)],
           names=['A', 'B'])

See

pandas/pandas/core/groupby/grouper.py

Lines 767 to 779 in 4357621

    
           if np.any(na_mask): 
        
               if self._sort: 
        
                   # Replace NA codes with `largest code + 1` 
        
                   na_code = len(categories) 
        
                   codes = np.where(na_mask, na_code, codes) 
        
               else: 
        
                   # Insert NA code into the codes based on first appearance 
        
                   # A negative code must exist, no need to check codes[na_idx] < 0 
        
                   na_idx = na_mask.argmax() 
        
                   # count number of unique codes that comes before the nan value 
        
                   na_code = algorithms.nunique_ints(codes[:na_idx]) 
        
                   codes = np.where(codes >= na_code, codes + 1, codes) 
        
                   codes = np.where(na_mask, na_code, codes)

Expected Behavior

I expect the assertion to pass.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 2.0.3
numpy : 1.25.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2
Cython : None
pytest : 7.4.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : 2.0.19
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Process finished with exit code 0

The text was updated successfully, but these errors were encountered:

Tolker-KU · 2023-08-01T07:36:39Z

I'm not that familiar with the code base so I can't tell if the following is the correct solultion, but a possible fix is to call MultiIndex with verify_integrity=True here.

pandas/pandas/core/groupby/ops.py

Lines 766 to 775 in 8067a01

    
           @cache_readonly 
        
           def result_index(self) -> Index: 
        
               if len(self.groupings) == 1: 
        
                   return self.groupings[0].result_index.rename(self.names[0]) 
        
               codes = self.reconstructed_codes 
        
               levels = [ping.result_index for ping in self.groupings] 
        
               return MultiIndex( 
        
                   levels=levels, codes=codes, verify_integrity=False, names=self.names 
        
               )

This results in the codes being verifed and NaN encodeded as -1

pandas/pandas/core/indexes/multi.py

Lines 361 to 363 in 8067a01

    
           if verify_integrity: 
        
               new_codes = result._verify_integrity() 
        
               result._codes = new_codes

The above might not be desirable due to performance overhead. Another solution is to override the codes just before they are passed to MultiIndex using logic like

pandas/pandas/core/indexes/multi.py

Lines 370 to 392 in 8067a01

    
               def _validate_codes(self, level: list, code: list): 
        
                   """ 
        
                   Reassign code values as -1 if their corresponding levels are NaN. 
        
                   Parameters 
        
                   ---------- 
        
                   code : list 
        
                       Code to reassign. 
        
                   level : list 
        
                       Level to check for missing values (NaN, NaT, None). 
        
                   Returns 
        
                   ------- 
        
                   new code where code value = -1 if it corresponds 
        
                   to a level with missing values (NaN, NaT, None). 
        
                   """ 
        
                   null_mask = isna(level) 
        
                   if np.any(null_mask): 
        
                       # error: Incompatible types in assignment 
        
                       # (expression has type "ndarray[Any, dtype[Any]]", 
        
                       # variable has type "List[Any]") 
        
                       code = np.where(null_mask[code], -1, code)  # type: ignore[assignment] 
        
                   return code

phofl · 2023-08-06T12:06:46Z

cc @rhshadrach

this does look buggy to me

rhshadrach · 2023-08-06T15:07:52Z

I'm not able to reproduce on main. The OP example does not raise. Agreed that if it does raise, there is a bug.

phofl · 2023-08-06T15:27:30Z

Oh good point, looks like I tried on 2.0.3

MurzNN · 2024-06-25T15:52:21Z

Starting work on this, will try to implement the test.

Tolker-KU added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 1, 2023

phofl added Groupby MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2023

phofl added Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 6, 2023

Tolker-KU mentioned this issue Dec 10, 2023

TST: Test .groupby() encodes NaNs correctly #56437

Closed

5 tasks

phofl mentioned this issue Jun 23, 2024

Add regression tests phofl/pydata-yerevan-sprint#1

Open

9 tasks

MurzNN mentioned this issue Jun 25, 2024

Added a test for the groupby MultiIndex with codes #59102

Merged

5 tasks

mroeschke closed this as completed in #59102 Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: .groupby() produces MultiIndex with NaN in levels instead of encoded in codes #54347

BUG: .groupby() produces MultiIndex with NaN in levels instead of encoded in codes #54347

Tolker-KU commented Aug 1, 2023 •

edited

Loading

INSTALLED VERSIONS

Tolker-KU commented Aug 1, 2023

phofl commented Aug 6, 2023

rhshadrach commented Aug 6, 2023 •

edited

Loading

phofl commented Aug 6, 2023

MurzNN commented Jun 25, 2024

BUG: .groupby() produces MultiIndex with NaN in levels instead of encoded in codes #54347

BUG: .groupby() produces MultiIndex with NaN in levels instead of encoded in codes #54347

Comments

Tolker-KU commented Aug 1, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

Tolker-KU commented Aug 1, 2023

phofl commented Aug 6, 2023

rhshadrach commented Aug 6, 2023 • edited Loading

phofl commented Aug 6, 2023

MurzNN commented Jun 25, 2024

Tolker-KU commented Aug 1, 2023 •

edited

Loading

rhshadrach commented Aug 6, 2023 •

edited

Loading