Skip to content

BUG: assert_frame_equals fails when MultiIndex elements are None #54521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
attack68 opened this issue Aug 13, 2023 · 3 comments
Open
2 of 3 tasks

BUG: assert_frame_equals fails when MultiIndex elements are None #54521

attack68 opened this issue Aug 13, 2023 · 3 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Testing pandas testing functions or related to the test suite

Comments

@attack68
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df = DataFrame({
    "Currency": ["USD", "USD", "USD"],
    "Collateral": [None, None, None],
    "Payment": [dt(2023, 1, 1), dt(2023, 1, 1), dt(2023, 2, 1)],
    "Cashflow": [100.0, 100.0, 200.0],
})
gf = df.groupby(["Currency", "Collateral", "Payment"], dropna=False)
       .sum()
       .unstack([0, 1])
       .droplevel(0, axis=1)

Issue Description

Screenshot 2023-08-13 at 09 02 18

This regards assert_frame_equal as part of a test suite.

The first problem with regards to testing gf is that I cannot create a MultiIndex and instantiate its dtypes directly.

expected = DataFrame(
    [200.0, 200.0],
    columns=MultiIndex.from_tuples([("USD", np.nan)], names=["Currency", "Collateral"]),
    index=[dt(2023, 1, 1), dt(2023, 2, 1)]
)
assert_frame_equal(gf, expected, atol=1e-5)

>> Traceback..
>       assert_attr_equal("dtype", left, right, obj=obj)
E       AssertionError: MultiIndex level [1] are different
E       
E       Attribute "dtype" are different
E       [left]:  float64
E       [right]: object

But I can manipulate it like this:

expected = DataFrame(
    [200.0, 200.0],
    columns=MultiIndex.from_arrays([
        Index(["USD"], name="Currency"), 
        Index([None], name="Collateral", dtype="float64")
    ]),
    index=[dt(2023, 1, 1), dt(2023, 2, 1)]
)
assert_frame_equal(gf, expected, atol=1e-5)

>> Traceback..
>       assert_frame_equal(result, expected, atol=1e-4)
E       AssertionError: DataFrame.iloc[:, 0] (column name="('USD', nan)") are different
E       
E       Attribute "name" are different
E       [left]:  ('USD', nan)
E       [right]: ('USD', nan)

There is a NA check for name but apparently it doesn't nest within the tuple so that this fails.

Expected Behavior

Should assert True. Na nest should be checked and works fine.

Installed Versions

2.0.3

@attack68 attack68 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 13, 2023
@phofl
Copy link
Member

phofl commented Aug 15, 2023

Is groupby necessary to reproduce? We fixed some bugs in the MultiIndex construction before the 2.1 rc came out

@attack68
Copy link
Contributor Author

attack68 commented Aug 16, 2023 via email

@jbrockmendel jbrockmendel added the Testing pandas testing functions or related to the test suite label Nov 1, 2023
@chaoyihu
Copy link
Contributor

chaoyihu commented Jul 13, 2024

Is groupby necessary to reproduce?

The intention here is to assert equal on two MultiIndex objects of float type. I think the role of groupby here is to circumvent initializing MultiIndex directly with NA values, which will result in an inferred object dtype.

We fixed some bugs in the MultiIndex construction before the 2.1 rc came out

This still exists on main as of 1165859. The assertion failure says:

Attribute "name" are different
[left]:  ('USD', nan)
[right]: ('USD', nan)

because the levels are not the same:

print(gf.columns)
print(gf.columns.levels)
print(gf.columns.codes)
print("===========")
print(expected.columns)
print(expected.columns.levels)
print(expected.columns.codes)
print("===========")
print(gf.columns == expected.columns)
assert_frame_equal(gf, expected)
MultiIndex([('USD', nan)],
           names=['Currency', 'Collateral'])
[['USD'], [nan]]  # here it is a float nan in the level list
[[0], [0]]
===========
MultiIndex([('USD', nan)],
           names=['Currency', 'Collateral'])
[['USD'], []]  # here it is an empty list
[[0], [-1]]
===========
[False]
AssertionError: DataFrame.iloc[:, 0] (column name="('USD', nan)") are different
Attribute "name" are different
[left]:  ('USD', nan)
[right]: ('USD', nan)

The fix should be similar to #59069, by giving NA values a location index of -1 in levels attributes of MultiIndex instance when is it created from groupby.

@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

5 participants