Skip to content

TST: Multiindex slicing with NaNs, unexpected results for #25154 #39356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 5, 2021

Conversation

theodorju
Copy link
Contributor

@theodorju theodorju commented Jan 23, 2021

Added test case for Multiindex slicing with NaNs.
Executed linting test based on the instructions provided here and 23 errors were reported, however none regarding the python test file modified.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx :)

# Slicing MultiIndex including levels with nan values, for more information
# see GH#25154
data = [[1, 2, 3], [4, 5, 6]]
index = ["First", nulls_fixture]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the index only once if it is the same. You could remove the other definitions and write the data directly into the object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments.
Suggested changes were implemented.

@theodorju
Copy link
Contributor Author

I see that one python test, namely pandas/tests/io/parser/common/test_chunksize.py::test_warn_if_chunks_have_mismatched_type, fails, however I am not sure that my addition affects that test and I'm currently not able to replicate the failure localy.

@phofl
Copy link
Member

phofl commented Jan 24, 2021

This test is unreliable, you can ignore this one

@jreback jreback added MultiIndex Testing pandas testing functions or related to the test suite Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 24, 2021
@jreback jreback added this to the 1.3 milestone Jan 24, 2021
@@ -229,6 +229,30 @@ def test_frame_getitem_nan_multiindex(nulls_fixture):
tm.assert_frame_equal(result, expected)


def test_frame_getitem_nan_cols_multiindex(nulls_fixture):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add all of the cases in the original post (working an non-working), there are 5 cases i think. pls parameterize

Copy link
Contributor Author

@theodorju theodorju Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. The additional cases were added and parameterized.

Keyword argument check_column_type was needed when slicing with (["b"], [np.nan]), in test case 5, the reason is explained below:

Test Case 5:
Slicing out (["b"], [np.nan]):

When asserting the types of the columns of the actual and expected result for the second level of the multiindex, on _check_types of assert_index_equal on asserters.py the left argument that is based on the sliced multiindex (argument left in _check_types) is:

  • left: Index(['bar', 'foo'], dtype='object')

while the right argument that is based on the expected result is:

  • right: Index([], dtype='object')

Based on that the left has "string" as inferred type, while the right one is empty which causes the test to fail, even though the resulting dataframes are identical.

I think this is because when slicing a dataframe with a multi-index the resulting levels of columns are the initial ones present in the original dataframe and are not updated.

In order to avoid that comparison I passed check_column_type=False as keyword argument.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry 6 cases

if isinstance(result, DataFrame):
tm.assert_frame_equal(result, expected, check_column_type=False)
elif isinstance(result, Series):
tm.assert_series_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.assert_equal(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but unfortunately I couldn't pass check_column_type=False because it's not accepted in assert_series_equal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use
MultiIndex(codes=[[1], [-1]], levels=[["a", "b"], ["bar", "foo"]])

to create your columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. Requested changes were implemented.

@@ -87,7 +86,8 @@ def test_series_getitem_returns_scalar(
(lambda s: s[(2000, 3, 4)], KeyError, r"^\(2000, 3, 4\)$"),
(lambda s: s.loc[(2000, 3, 4)], KeyError, r"^\(2000, 3, 4\)$"),
(lambda s: s.loc[(2000, 3, 4, 5)], IndexingError, "Too many indexers"),
(lambda s: s.__getitem__(len(s)), KeyError, ""), # match should include len(s)
(lambda s: s.__getitem__(len(s)), KeyError, ""),
# match should include len(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are doing this and removed the line above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed by accident. Sorry for missing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I revert those changes in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. Requested changes were implemented.

@jreback jreback merged commit c5a1348 into pandas-dev:master Feb 5, 2021
@jreback
Copy link
Contributor

jreback commented Feb 5, 2021

thanks @theodorju

@theodorju theodorju deleted the pandas-25154 branch August 17, 2022 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiindex slicing with NaNs, unexpected results
3 participants