Skip to content

BUG: group with multiple named results #21171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

guenteru
Copy link

This bugfix gets rid of duplicated names that can be the result of groupby operations (#19029).
I opted to implement one of the ideas proposed by toobaz: duplicated names get suffixed by their corresponding position, i.e. ['name','name'] gets transformed into ['name0', 'name1'].

  • A few testcases have been added.
  • One particular testcase had to be changed (test_crosstab_dup_index_names)
    • This is because with the new bugfix crosstab does not yield a ValueError anymore.

guenteru added 3 commits May 22, 2018 13:39
As of version 0.23.0 MultiIndex throws an exception in case it contains
duplicated level names. This can happen as a result of various groupby
operations (21075). This commit changes the behavior of groupby slightly: In
case there are duplicated names contained in the index these names get suffixed by there
corresonding position (i.e. [name,name] => [name0,name1])
@pep8speaks
Copy link

pep8speaks commented May 22, 2018

Hello @guenteru! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 23, 2018 at 09:46 Hours UTC

@codecov
Copy link

codecov bot commented May 22, 2018

Codecov Report

Merging #21171 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21171      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49505    49510       +5     
==========================================
+ Hits        45466    45471       +5     
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.24% <100%> (ø) ⬆️
#single 41.87% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 92.67% <100%> (+0.01%) ⬆️
pandas/core/indexes/interval.py 93.14% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be90d49...7cd448a. Read the comment docs.

return orig_names
# in case duplicates are contained rename all of them
if len(set(orig_names)) < len(orig_names):
orig_names = [''.join([str(x), str(i)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .format instead of .join. I'd also suggest adding an underscore like we do in merges

@@ -2298,7 +2298,18 @@ def levels(self):

@property
def names(self):
return [ping.name for ping in self.groupings]
# GH 19029
# add suffix to level name in case they contain duplicates (GH 19029):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove ref to GH here, since it's on the line above

# GH18872: conflicting names in desired index
with pytest.raises(ValueError):
# GH 19029: conflicitng names should not raise a value error anymore
raised = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this you should say what the expected object is and compare to that the result of the groupby


def test_dup_index_names():
# dup. index names in groupby operations should be renamed (GH 19029):
df = pd.DataFrame({'date': list(pd.date_range('5.1.2018', '5.3.2018')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you need to wrap the date_range in a list (?)


failed = False
try:
result = df.groupby([df.date.dt.month, df.date.dt.day])['vals'].sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before - don't need this try...catch block. For what it's worth though let's not duplicate tests, so either have this check here in its own method and remove from the other test, or just update the other test and leave this. See which one makes more sense

tm.assert_series_equal(result, expected)


def test_empty_index_names():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if say 2 out of 3 levels use None as a name - would these tests still hold?

guenteru added 2 commits May 23, 2018 11:45
Only duplicates get suffixed by their corresponding enumeration value:
['name', None, 'name'] gets transformed into ['name_0', None, 'name_1']

Superfluous test cases have been deleted and some additonal test
statements have been added.
@jreback jreback changed the title Bug19029 COMPAT: warn on duplicated names in MI construction May 23, 2018
@jreback jreback changed the title COMPAT: warn on duplicated names in MI construction BUG: group with multiple named results May 23, 2018
@@ -2298,7 +2298,28 @@ def levels(self):

@property
def names(self):
return [ping.name for ping in self.groupings]
# add suffix to level name in case they contain duplicates (GH 19029):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super complicated, what exactly are you trying to do here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. The goal here is to add a suffix to duplicate entries:

# takes care of multiplicity:
['x', 'x', 'y', 'y']  => ['x_0', 'x_1', 'y_0', 'y_1']

Before the current version the code just enumerated all the entries regardless of their multiplicity:

['x', 'x', 'y', 'y']  => ['x_0', 'x_1', 'y_2', 'y_3']

For some reason I thought that the current version would be better suited.
I can switch back to the old (and probably less confusing) version if you want.



def test_dup_index_names():
# dup. index names in groupby operations should be renamed (GH 19029):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this

s = pd.Series(range(3), name='foo')
pytest.raises(ValueError, pd.crosstab, s, s)
failed = False
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are asserting that this works, simply collect the result and compare vs expected

@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

closing as stale. if you'd like to continue pls ping.

@jreback jreback closed this Nov 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warn on duplicate names in MI?
4 participants