Skip to content

TST: Multiindex slicing with NaNs, unexpected results for #25154 #39356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 5, 2021
20 changes: 20 additions & 0 deletions pandas/tests/indexing/multiindex/test_getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,26 @@ def test_frame_getitem_nan_multiindex(nulls_fixture):
tm.assert_frame_equal(result, expected)


def test_frame_getitem_nan_cols_multiindex(nulls_fixture):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add all of the cases in the original post (working an non-working), there are 5 cases i think. pls parameterize

Copy link
Contributor Author

@theodorju theodorju Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. The additional cases were added and parameterized.

Keyword argument check_column_type was needed when slicing with (["b"], [np.nan]), in test case 5, the reason is explained below:

Test Case 5:
Slicing out (["b"], [np.nan]):

When asserting the types of the columns of the actual and expected result for the second level of the multiindex, on _check_types of assert_index_equal on asserters.py the left argument that is based on the sliced multiindex (argument left in _check_types) is:

  • left: Index(['bar', 'foo'], dtype='object')

while the right argument that is based on the expected result is:

  • right: Index([], dtype='object')

Based on that the left has "string" as inferred type, while the right one is empty which causes the test to fail, even though the resulting dataframes are identical.

I think this is because when slicing a dataframe with a multi-index the resulting levels of columns are the initial ones present in the original dataframe and are not updated.

In order to avoid that comparison I passed check_column_type=False as keyword argument.

# Slicing MultiIndex including levels with nan values, for more information
# see GH#25154
data = [[1, 2, 3], [4, 5, 6]]
index = ["First", nulls_fixture]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the index only once if it is the same. You could remove the other definitions and write the data directly into the object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments.
Suggested changes were implemented.

columns = MultiIndex.from_tuples([("a", "foo"), ("b", "foo"), ("b", nulls_fixture)])
df = DataFrame(data=data, columns=columns, index=index, dtype="int64")

# Slicing out 'b', ['foo', nan]
cols = (["b"], ["foo", nulls_fixture])
result = df.loc[:, cols]
expected_columns = MultiIndex.from_tuples([("b", "foo"), ("b", nulls_fixture)])
expected_index = ["First", nulls_fixture]
expected = DataFrame(
[[2, 3], [5, 6]], columns=expected_columns, index=expected_index, dtype="int64"
)

tm.assert_frame_equal(result, expected)


# ----------------------------------------------------------------------------
# test indexing of DataFrame with multi-level Index with duplicates
# ----------------------------------------------------------------------------
Expand Down