Skip to content

BUG: handle multi-dimensional grouping better #36605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Oct 31, 2020

Conversation

dimithras
Copy link
Contributor

@dimithras dimithras commented Sep 24, 2020

As described in issue #36158 multi-dimensional groups prepared with lambda or named function raise error on indices as isnan() is not defined for MultiIndex.

Grouping already has the check for MultiIndex on line 440, but it's not triggered on init when grouper is represented by lambda function and gets casted to MultiIndex later where MultiIndex stays as is.

Tested on the code provided in #36158, yet not sure if significant testing is needed as this is a relatively small change to code.
Grouper code has not been changed for a while, should be save to merge.

Tests for this particular issue provided.

Concerns: probably it will be better to make isinstance(grouper, MultiIndex): a function as after this change this piece of code occurs twice in init.
Those operate on different grouper object. First one is passed on init, second one is self, where grouper is changed from lambda to MultiIndex and needs further change added in PR.

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

Could you add tests illustrating the new behavior? That's the first thing we look for when reviewing

dimithras added a commit to dimithras/pandas that referenced this pull request Sep 24, 2020
Bug in issue pandas-dev#36158 appeared both for indices and get_group().
@dimithras
Copy link
Contributor Author

Hi @arw2019 and thanks for initial review!
Updated PR, tests provided.

@dimithras
Copy link
Contributor Author

All tasks successful, tests passed, ready for review.

@jreback jreback changed the title Resolve issue #36158 BUG: handle multi-dimensional grouping better Sep 24, 2020
@dimithras dimithras requested a review from jreback September 24, 2020 19:03
@AlexanderCecile
Copy link

I thought I would mention this here, for visibility: The error also occurs with regular functions, not just lambda, see my comment on the issue.

@dimithras
Copy link
Contributor Author

dimithras commented Sep 24, 2020

@AlexanderCecile thank you very much for your input, highly appreciated. Fair point, behavior is the same both for lambda and defined function.
I've just tested a function from your comment. Current solution also covers this case.

Nature of a bug:
With tuples in cells example using groupby by a column, grouper is initialized with actual values during grouper init (line 413), this way MultiIndex is casted to an array on line 440.
However if lambda or function is used, grouper is assigned lambda or named function respectively and line 440 is skipped as it does not see MultiIndex behind it.
Later on line 512 changes are triggered and line 518 assigns actual value (MultiIndex) instead of a function, however this case skips line 440 casting and bug appears.

@dimithras dimithras marked this pull request as ready for review September 25, 2020 01:01
@dimithras dimithras marked this pull request as draft September 26, 2020 16:52
dimithras added a commit to dimithras/pandas that referenced this pull request Sep 26, 2020
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 27, 2020
@rhshadrach
Copy link
Member

@dimithras - Sorry, I missed your ping. If this is read for review, can you merge master and take this off draft status?

@dimithras
Copy link
Contributor Author

Hi @rhshadrach , sorry for letting this hanged. I looked at the failing test and issues connected to it, I'm more or less sure it should not be a problem to remove the test. Yet would be better to have a green light from you or someone else more familiar with pandas dev than me.

I'll remove the failing test tomorrow (CEST) so that it will be ready to merge.

dimithras and others added 9 commits October 28, 2020 23:25
Bug in issue pandas-dev#36158 appeared both for indices and get_group().
Missed `self.` during making PR, tests failed. Now corrected!
As described in issue pandas-dev#36158 multi-dimensional groups prepared with lambda raise error on indices as isnan() is not defined for MultiIndex.

Grouping already has the check for MultiIndex on line 440, but it's not triggered on __init__ when grouper is represented by lambda function and gets casted to MultiIndex later where MT stays as is.
Previous assertion replaced with a new check.
Added description to linking issue.

Co-Authored-By: Richard Shadrach <[email protected]>
dimithras and others added 4 commits October 28, 2020 23:46
Bug in issue pandas-dev#36158 for indices and get_group().
Previous assertion replaced with a new check.
Added description to linking issue.

Co-Authored-By: Richard Shadrach <[email protected]>
@dimithras dimithras marked this pull request as ready for review October 28, 2020 21:25
@dimithras
Copy link
Contributor Author

Master merged with no conflicts, I also did a rebase to shrink the number of commits.
@rhshadrach got it out of draft for now. Let me know if something needs to be done on Github's failing tests, do not see any clue why this is happening.

@dimithras dimithras marked this pull request as draft October 29, 2020 11:37
@dimithras
Copy link
Contributor Author

Converted to draft, test_counting.py test_groupby_count_dateparseerror is failing with same type error. Going to have a look at it.

After merging latest master issue is gone without applying the suggested fixes both by me and @rhshadrach .

Removing all changes yet leaving newly added tests referring to original issue.
@dimithras
Copy link
Contributor Author

After merging with latest master the issue is gone without making changes to grouper.

@rhshadrach unfortunately _transform_index breaks the test on test_counting.py in test_groupby_count_dateparseerror where lambda function is passed all multiindex items one by one instead of tuples. Instead of 10 tuples to 10 rows it gets 20 items where multiindex has two values.

All changes reverted, PR left with added tests for original issue.

@dimithras dimithras marked this pull request as ready for review October 29, 2020 13:20
@rhshadrach
Copy link
Member

py37_32bit failed from out of memory error, Windows py38_np18 has one unrelated failure:

FAILED pandas/tests/indexing/test_partial.py::TestPartialSetting::test_slice_irregular_datetime_index_with_nan

@rhshadrach
Copy link
Member

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating, changes look good.

@rhshadrach rhshadrach added Needs Tests Unit test(s) needed to prevent regressions and removed Bug Stale labels Oct 29, 2020
@dimithras
Copy link
Contributor Author

@jreback will you have a look if you have time?

@jreback jreback added this to the 1.2 milestone Oct 31, 2020
@jreback jreback merged commit e44744e into pandas-dev:master Oct 31, 2020
@jreback
Copy link
Contributor

jreback commented Oct 31, 2020

thanks @dimithras

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
ukarroum pushed a commit to ukarroum/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
5 participants