Skip to content

BUG: .groupby() produces MultiIndex with NaN in levels instead of encoded in codes #54347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
Tracked by #1
Tolker-KU opened this issue Aug 1, 2023 · 5 comments · Fixed by #59102
Closed
2 of 3 tasks
Tracked by #1
Labels
Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions

Comments

@Tolker-KU
Copy link
Contributor

Tolker-KU commented Aug 1, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [1, float("nan"), 2, float("nan")], "C": [2, 4, 6, 8]})
df_grouped = df.groupby(["A", "B"], dropna=False).sum()

index = df_grouped.index
pd.testing.assert_index_equal(index, pd.MultiIndex.from_frame(index.to_frame()))

Issue Description

The snippet above results in an AssertionError, because NaN in the MultiIndex is encoded in levels when returned from .groupby(), but in codes when returned from pd.MultiIndex.from_frame().

AssertionError: Index are different

Index values are different (50.0 %)
[left]:  MultiIndex([(1, 1.0),
            (2, nan),
            (3, 2.0),
            (4, nan)],
           names=['A', 'B'])
[right]: MultiIndex([(1, 1.0),
            (2, nan),
            (3, 2.0),
            (4, nan)],
           names=['A', 'B'])

See

if np.any(na_mask):
if self._sort:
# Replace NA codes with `largest code + 1`
na_code = len(categories)
codes = np.where(na_mask, na_code, codes)
else:
# Insert NA code into the codes based on first appearance
# A negative code must exist, no need to check codes[na_idx] < 0
na_idx = na_mask.argmax()
# count number of unique codes that comes before the nan value
na_code = algorithms.nunique_ints(codes[:na_idx])
codes = np.where(codes >= na_code, codes + 1, codes)
codes = np.where(na_mask, na_code, codes)

Expected Behavior

I expect the assertion to pass.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 2.0.3
numpy : 1.25.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2
Cython : None
pytest : 7.4.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : 2.0.19
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Process finished with exit code 0

@Tolker-KU Tolker-KU added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 1, 2023
@Tolker-KU
Copy link
Contributor Author

I'm not that familiar with the code base so I can't tell if the following is the correct solultion, but a possible fix is to call MultiIndex with verify_integrity=True here.

@cache_readonly
def result_index(self) -> Index:
if len(self.groupings) == 1:
return self.groupings[0].result_index.rename(self.names[0])
codes = self.reconstructed_codes
levels = [ping.result_index for ping in self.groupings]
return MultiIndex(
levels=levels, codes=codes, verify_integrity=False, names=self.names
)

This results in the codes being verifed and NaN encodeded as -1

if verify_integrity:
new_codes = result._verify_integrity()
result._codes = new_codes

The above might not be desirable due to performance overhead. Another solution is to override the codes just before they are passed to MultiIndex using logic like

def _validate_codes(self, level: list, code: list):
"""
Reassign code values as -1 if their corresponding levels are NaN.
Parameters
----------
code : list
Code to reassign.
level : list
Level to check for missing values (NaN, NaT, None).
Returns
-------
new code where code value = -1 if it corresponds
to a level with missing values (NaN, NaT, None).
"""
null_mask = isna(level)
if np.any(null_mask):
# error: Incompatible types in assignment
# (expression has type "ndarray[Any, dtype[Any]]",
# variable has type "List[Any]")
code = np.where(null_mask[code], -1, code) # type: ignore[assignment]
return code

@phofl phofl added Groupby MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2023
@phofl
Copy link
Member

phofl commented Aug 6, 2023

cc @rhshadrach

this does look buggy to me

@rhshadrach
Copy link
Member

rhshadrach commented Aug 6, 2023

I'm not able to reproduce on main. The OP example does not raise. Agreed that if it does raise, there is a bug.

@phofl phofl added Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 6, 2023
@phofl
Copy link
Member

phofl commented Aug 6, 2023

Oh good point, looks like I tried on 2.0.3

@MurzNN
Copy link
Contributor

MurzNN commented Jun 25, 2024

Starting work on this, will try to implement the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
4 participants