Skip to content

BUG: Added test cases to check loc on multiindex with NaNs #29751 #38772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 7, 2021

Conversation

kasim95
Copy link
Contributor

@kasim95 kasim95 commented Dec 29, 2020

Added test cases to check loc on multiindex containing NaN values using np.nan, pd.NA, and None

@kasim95 kasim95 changed the title BUG: Added test cases to check loc on a multiindex with nan values #… BUG: Added test cases to check loc on multiindex with NaNs #29751 Dec 29, 2020
@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Dec 29, 2020
expected = DataFrame(arr[:2], columns=cols, dtype="int").set_index(["a", "b"])
tm.assert_frame_equal(result, expected)

result = df.loc[idx:, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try to parametrize with slices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have parametrized the slices in the recent commit

@@ -279,3 +280,32 @@ def test_loc_empty_multiindex():
result = df
expected = DataFrame([1, 2, 3, 4], index=index, columns=["value"])
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("nan", [np.nan, pd.NA, None])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a fixture for all NaNs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nulls_fixture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the test case to use nulls_fixture in the most recent commit

[51, nan, 53],
]
cols = ["a", "b", "c"]
df = DataFrame(arr, columns=cols, dtype="int").set_index(["a", "b"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use dtype='int64' to avoid the 32-bit failures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed dtype to int64 in the recent commit

start = df.index[idx1]
end = df.index[idx2]
result = df.loc[start:end, :]
expected = DataFrame(arr[idx1 : (idx2 + 1)], columns=cols, dtype="int64").set_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert each of the 4 sliced results in the OP. also pls construct the expected by hard coding as much as possible (clearly the nan needs to come from the fixtures), but hard code the actual values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original post contains the first two sliced results (a.loc) as the reference for the next two sliced results (b.loc). I have replaced the previous tests with the two relevant sliced tests from the original post.

I also had to remove the parameterized test cases because the mypy type checker fails when using a tuple to slice a dataframe. The code I used is as follows:

@pytest.mark.parametrize(
    "indexer,expected_arr",
    [
        (
            lambda df, null: df.loc[:(21, null)],
            lambda null: [[11, null, 13], [21, null, 23]]
        ),
        (
            lambda df, null: df.loc[(21, null):],
            lambda null: [[21, null, 23], [31, null, 33], [41, null, 43]]
        ),
        (
            lambda df, null: df.loc[(21, null):(31, null)],
            lambda null: [[21, null, 23], [31, null, 33]]
        ),
    ],
)
def test_frame_getitem_nan_multiindex(nulls_fixture, indexer, expected_arr):
    # GH#29751
    # loc on a multiindex containing nan values
    arr = [
        [11, nulls_fixture, 13],
        [21, nulls_fixture, 23],
        [31, nulls_fixture, 33],
        [41, nulls_fixture, 43]
    ]
    cols = ["a", "b", "c"]
    df = DataFrame(arr, columns=cols, dtype="int64").set_index(["a", "b"])

    result = indexer(df, nulls_fixture)
    arr1 = expected_arr(nulls_fixture)
    expected = DataFrame(arr1, columns=cols, dtype="int64").set_index(
        ["a", "b"]
    )
    tm.assert_frame_equal(result, expected)

The mypy type checker failed on lines 5, 9, & 13.
I would love to hear any workarounds for multiindex slicing to parameterize the tests.
Instead, I assigned the tuple to a variable and used it to slice the dataframe which passed the mypy type check.

Also, I hardcoded the expected array and the index tuple as requested.

@jreback jreback added this to the 1.3 milestone Jan 7, 2021
@jreback jreback merged commit 10bdde6 into pandas-dev:master Jan 7, 2021
@jreback
Copy link
Contributor

jreback commented Jan 7, 2021

thanks @kasim95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent .loc slicing behaviour with NaNs in MultiIndex dataframe
4 participants