-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: group with multiple named results #21171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As of version 0.23.0 MultiIndex throws an exception in case it contains duplicated level names. This can happen as a result of various groupby operations (21075). This commit changes the behavior of groupby slightly: In case there are duplicated names contained in the index these names get suffixed by there corresonding position (i.e. [name,name] => [name0,name1])
Hello @guenteru! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 23, 2018 at 09:46 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #21171 +/- ##
==========================================
+ Coverage 91.84% 91.84% +<.01%
==========================================
Files 153 153
Lines 49505 49510 +5
==========================================
+ Hits 45466 45471 +5
Misses 4039 4039
Continue to review full report at Codecov.
|
pandas/core/groupby/groupby.py
Outdated
return orig_names | ||
# in case duplicates are contained rename all of them | ||
if len(set(orig_names)) < len(orig_names): | ||
orig_names = [''.join([str(x), str(i)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use .format
instead of .join
. I'd also suggest adding an underscore like we do in merges
pandas/core/groupby/groupby.py
Outdated
@@ -2298,7 +2298,18 @@ def levels(self): | |||
|
|||
@property | |||
def names(self): | |||
return [ping.name for ping in self.groupings] | |||
# GH 19029 | |||
# add suffix to level name in case they contain duplicates (GH 19029): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove ref to GH here, since it's on the line above
# GH18872: conflicting names in desired index | ||
with pytest.raises(ValueError): | ||
# GH 19029: conflicitng names should not raise a value error anymore | ||
raised = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing this you should say what the expected
object is and compare to that the result of the groupby
pandas/tests/groupby/test_groupby.py
Outdated
|
||
def test_dup_index_names(): | ||
# dup. index names in groupby operations should be renamed (GH 19029): | ||
df = pd.DataFrame({'date': list(pd.date_range('5.1.2018', '5.3.2018')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you need to wrap the date_range in a list (?)
pandas/tests/groupby/test_groupby.py
Outdated
|
||
failed = False | ||
try: | ||
result = df.groupby([df.date.dt.month, df.date.dt.day])['vals'].sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before - don't need this try...catch block. For what it's worth though let's not duplicate tests, so either have this check here in its own method and remove from the other test, or just update the other test and leave this. See which one makes more sense
pandas/tests/groupby/test_groupby.py
Outdated
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_empty_index_names(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if say 2 out of 3 levels use None
as a name - would these tests still hold?
Only duplicates get suffixed by their corresponding enumeration value: ['name', None, 'name'] gets transformed into ['name_0', None, 'name_1'] Superfluous test cases have been deleted and some additonal test statements have been added.
@@ -2298,7 +2298,28 @@ def levels(self): | |||
|
|||
@property | |||
def names(self): | |||
return [ping.name for ping in self.groupings] | |||
# add suffix to level name in case they contain duplicates (GH 19029): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super complicated, what exactly are you trying to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. The goal here is to add a suffix to duplicate entries:
# takes care of multiplicity:
['x', 'x', 'y', 'y'] => ['x_0', 'x_1', 'y_0', 'y_1']
Before the current version the code just enumerated all the entries regardless of their multiplicity:
['x', 'x', 'y', 'y'] => ['x_0', 'x_1', 'y_2', 'y_3']
For some reason I thought that the current version would be better suited.
I can switch back to the old (and probably less confusing) version if you want.
|
||
|
||
def test_dup_index_names(): | ||
# dup. index names in groupby operations should be renamed (GH 19029): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize this
s = pd.Series(range(3), name='foo') | ||
pytest.raises(ValueError, pd.crosstab, s, s) | ||
failed = False | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are asserting that this works, simply collect the result and compare vs expected
closing as stale. if you'd like to continue pls ping. |
git diff upstream/master -u -- "*.py" | flake8 --diff
This bugfix gets rid of duplicated names that can be the result of groupby operations (#19029).
I opted to implement one of the ideas proposed by toobaz: duplicated names get suffixed by their corresponding position, i.e. ['name','name'] gets transformed into ['name0', 'name1'].