Skip to content

added test to indexing on groupby, #32464 #44046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

gabrieldi95
Copy link
Contributor

This makes sure an index of a groupby with more then one column is always a MultiIndex.

a = pd.DataFrame({"a": [], "b": [], "c": []})

index_1 = a.groupby(["a", "b"]).sum().index
index_2 = a.groupby(["a", "b", "c"]).sum().index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead could you construct the full DataFrame result of a.groupby(["a", "b", "c"]).sum() and a.groupby(["a", "b"]).sum() and use tm.assert_frame_equal?

Also could you move this test to a more appropriate groupby testing file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke, I changed it so it uses tm.assert_frame_equal, but I'm not sure what the appropriate testing file would be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably pandas/tests/groupby/test_function.py

@alimcmaster1 alimcmaster1 added Groupby MultiIndex Testing pandas testing functions or related to the test suite labels Oct 16, 2021
@alimcmaster1 alimcmaster1 added this to the 1.3.4 milestone Oct 16, 2021
@jreback jreback modified the milestones: 1.3.4, 1.4 Oct 16, 2021
@gabrieldi95
Copy link
Contributor Author

gabrieldi95 commented Oct 17, 2021

I tried using tm.assert_frame_equal to test if a.groupby(["a", "b", "c"]).sum() has a MultiIndex like this:

    a = DataFrame({"a": [], "b": [], "c": []})
    result = a.groupby(["a", "b", "c"]).sum()

    expected_index = MultiIndex.from_arrays([[], [], []], names=("a", "b", "c"))
    expected = DataFrame(
        np.ndarray((0, 0)), dtype="float64", index=expected_index, columns=Index([])
    )

    # Tests if groupby with all columns has a multiindex
    tm.assert_frame_equal(result, expected)

But it didn't work, so I'm keeping just the other assertion, that tests if both groupby's are equal.

a = DataFrame({"a": [], "b": [], "c": []})

agg_1 = a.groupby(["a", "b"]).sum().iloc[:, :0]
agg_2 = a.groupby(["a", "b", "c"]).sum().droplevel("c")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay could you change the assertion back to testing the MultiIndex result if the new assertion can't test the whole frame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Like this?

# GH 32464
# Test if index after groupby with more then one column is always MultiIndex
a = DataFrame({"a": [], "b": [], "c": []})

agg_1 = a.groupby(["a", "b"]).sum()
agg_2 = a.groupby(["a", "b", "c"]).sum()

# Tests if group by with all columns has a MultiIndex
assert isinstance(agg_2.index, pd.core.indexes.multi.MultiIndex)

result = agg_1.iloc[:, :0]
expected = agg_2.droplevel("c")

# Tests if both agreggations have multiindex
tm.assert_frame_equal(result, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use tm.assert_index_equal

agg_2 = a.groupby(["a", "b", "c"]).sum()

# Tests if group by with all columns has a MultiIndex
assert isinstance(agg_2.index, pd.core.indexes.multi.MultiIndex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_index_equal already checks this

# Test if index after groupby with more then one column is always MultiIndex
a = DataFrame({"a": [], "b": [], "c": []})

agg_1 = a.groupby(["a", "b"]).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, assert the index of these two results independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get what you mean...
You mean something like this?

tm.assert_index_equal(a.groupby(["a", "b", "c"]).sum().index, pd.MultiIndex.from_arrays([[], [], []], names=("a", "b", "c")))
tm.assert_index_equal(a.groupby(["a", "b"]).sum().index, pd.MultiIndex.from_arrays([[], []], names=("a", "b")))

I tried but it doesn't work, it says AssertionError: MultiIndex level [0] are different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I tried this and it passed:

def test_if_is_multiindex():
    # GH 32464
    # Test if index after groupby with more then one column is always MultiIndex
    a = DataFrame({"a": [1, 2], "b": [5, 6], "c": [8, 9]})

    result = a.groupby(["a", "b"]).sum().index
    expected = pd.MultiIndex.from_arrays([[1,2],[5,6]], names=("a", "b"))
    
    tm.assert_index_equal(result, expected)

    result = a.groupby(["a", "b", "c"]).sum().index
    expected = pd.MultiIndex.from_arrays([[1,2],[5,6],[8,9]], names=("a", "b", "c"))

    tm.assert_index_equal(result, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can build the index using the pd.MultiIndex constructor directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just test in the case of an empty frame here - nonempty is covered elsewhere. Indeed it is tricky to construct an empty MultiIndex of the expected type; this works:

df = DataFrame({"a": [1], "b": [2], "c": [3]}).set_index(['a', 'b', 'c'])
result = df.groupby(["a", "b", "c"]).sum().index[:0]
expected = df.index[:0]
tm.assert_index_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabrieldi95 can you replace the test with basically this example

@jreback
Copy link
Contributor

jreback commented Nov 14, 2021

can you merge master and update to comments

@gabrieldi95
Copy link
Contributor Author

Yes, I was working on other stuff, will work on it now!

@gabrieldi95
Copy link
Contributor Author

Hm I don't really know why the check fail, can someone help me?

@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

can you merge master again and ping on green

@gabrieldi95
Copy link
Contributor Author

@jreback can you help me with the error?

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

is worthwhile but needs to be passing (and pls merge master)

@jreback jreback removed this from the 1.4 milestone Dec 23, 2021
@rhshadrach
Copy link
Member

@gabrieldi95 - The CI failure appears unrelated to me; can you try merging master again

@jreback
Copy link
Contributor

jreback commented Jan 31, 2022

@gabrieldi95 can you merge master and address comments

@rhshadrach
Copy link
Member

Closing to take this off the queue. @gabrieldi95 please let me know if you'd like to continue this!

@rhshadrach rhshadrach closed this Feb 15, 2022
@gabrieldi95
Copy link
Contributor Author

@rhshadrach sorry for taking so long! Yes, I would like to continue

@rhshadrach rhshadrach reopened this Feb 15, 2022
@@ -1167,6 +1167,38 @@ def test_groupby_sum_below_mincount_nullable_integer():
tm.assert_frame_equal(result, expected)


def test_if_is_multiindex():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename to something like test_empty_multiindex

@jreback
Copy link
Contributor

jreback commented Feb 27, 2022

can you merge master

@jreback
Copy link
Contributor

jreback commented Mar 6, 2022

can you merge master and update to comments

@jreback jreback added this to the 1.5 milestone Apr 6, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in main and address the review and we can reopen.

@mroeschke mroeschke closed this May 7, 2022
@mroeschke mroeschke removed this from the 1.5 milestone May 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby MultiIndex Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grouping by all columns of an empty DataFrame should produce MultiIndex, but doesn't
5 participants