BUG: group with multiple named results #21171

guenteru · 2018-05-22T11:53:57Z

closes Warn on duplicate names in MI? #19029
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This bugfix gets rid of duplicated names that can be the result of groupby operations (#19029).
I opted to implement one of the ideas proposed by toobaz: duplicated names get suffixed by their corresponding position, i.e. ['name','name'] gets transformed into ['name0', 'name1'].

A few testcases have been added.
One particular testcase had to be changed (test_crosstab_dup_index_names)
- This is because with the new bugfix crosstab does not yield a ValueError anymore.

As of version 0.23.0 MultiIndex throws an exception in case it contains duplicated level names. This can happen as a result of various groupby operations (21075). This commit changes the behavior of groupby slightly: In case there are duplicated names contained in the index these names get suffixed by there corresonding position (i.e. [name,name] => [name0,name1])

pep8speaks · 2018-05-22T11:54:03Z

Hello @guenteru! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 23, 2018 at 09:46 Hours UTC

codecov · 2018-05-22T12:33:32Z

Codecov Report

Merging #21171 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21171      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49505    49510       +5     
==========================================
+ Hits        45466    45471       +5     
  Misses       4039     4039

Flag	Coverage Δ
#multiple	`90.24% <100%> (ø)`	⬆️
#single	`41.87% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`92.67% <100%> (+0.01%)`	⬆️
pandas/core/indexes/interval.py	`93.14% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be90d49...7cd448a. Read the comment docs.

WillAyd · 2018-05-22T16:11:41Z

pandas/core/groupby/groupby.py

+            return orig_names
+        # in case duplicates are contained rename all of them
+        if len(set(orig_names)) < len(orig_names):
+            orig_names = [''.join([str(x), str(i)])


Use .format instead of .join. I'd also suggest adding an underscore like we do in merges

WillAyd · 2018-05-22T16:12:24Z

pandas/core/groupby/groupby.py

@@ -2298,7 +2298,18 @@ def levels(self):

    @property
    def names(self):
-        return [ping.name for ping in self.groupings]
+        # GH 19029
+        # add suffix to level name in case they contain duplicates (GH 19029):


Can remove ref to GH here, since it's on the line above

WillAyd · 2018-05-22T16:13:28Z

pandas/tests/groupby/test_categorical.py

-    # GH18872: conflicting names in desired index
-    with pytest.raises(ValueError):
+    # GH 19029: conflicitng names should not raise a value error anymore
+    raised = False


Rather than doing this you should say what the expected object is and compare to that the result of the groupby

WillAyd · 2018-05-22T16:14:01Z

pandas/tests/groupby/test_groupby.py

+
+def test_dup_index_names():
+    # dup. index names in groupby operations should be renamed (GH 19029):
+    df = pd.DataFrame({'date': list(pd.date_range('5.1.2018', '5.3.2018')),


Don't think you need to wrap the date_range in a list (?)

WillAyd · 2018-05-22T16:14:53Z

pandas/tests/groupby/test_groupby.py

+
+    failed = False
+    try:
+        result = df.groupby([df.date.dt.month, df.date.dt.day])['vals'].sum()


Same comment as before - don't need this try...catch block. For what it's worth though let's not duplicate tests, so either have this check here in its own method and remove from the other test, or just update the other test and leave this. See which one makes more sense

WillAyd · 2018-05-22T16:16:59Z

pandas/tests/groupby/test_groupby.py

+    tm.assert_series_equal(result, expected)
+
+
+def test_empty_index_names():


What happens if say 2 out of 3 levels use None as a name - would these tests still hold?

Only duplicates get suffixed by their corresponding enumeration value: ['name', None, 'name'] gets transformed into ['name_0', None, 'name_1'] Superfluous test cases have been deleted and some additonal test statements have been added.

jreback · 2018-05-23T10:51:43Z

pandas/core/groupby/groupby.py

@@ -2298,7 +2298,28 @@ def levels(self):

    @property
    def names(self):
-        return [ping.name for ping in self.groupings]
+        # add suffix to level name in case they contain duplicates (GH 19029):


this is super complicated, what exactly are you trying to do here?

Hi. The goal here is to add a suffix to duplicate entries:

# takes care of multiplicity: ['x', 'x', 'y', 'y'] => ['x_0', 'x_1', 'y_0', 'y_1']

Before the current version the code just enumerated all the entries regardless of their multiplicity:

['x', 'x', 'y', 'y'] => ['x_0', 'x_1', 'y_2', 'y_3']

For some reason I thought that the current version would be better suited.
I can switch back to the old (and probably less confusing) version if you want.

jreback · 2018-05-23T10:52:35Z

pandas/tests/groupby/test_groupby.py

+
+
+def test_dup_index_names():
+    # dup. index names in groupby operations should be renamed (GH 19029):


can you parameterize this

jreback · 2018-05-23T10:53:04Z

pandas/tests/reshape/test_pivot.py

        s = pd.Series(range(3), name='foo')
-        pytest.raises(ValueError, pd.crosstab, s, s)
+        failed = False
+        try:


if you are asserting that this works, simply collect the result and compare vs expected

jreback · 2018-11-01T01:38:19Z

closing as stale. if you'd like to continue pls ping.

guenteru added 3 commits May 22, 2018 13:39

update old testcase to satisfy new behavior

117872f

add additional groupby testcases (19029)

32e44c3

guenteru force-pushed the bug19029 branch from 9372c69 to 4ae01f2 Compare May 22, 2018 15:54

WillAyd requested changes May 22, 2018

View reviewed changes

guenteru added 2 commits May 23, 2018 11:45

resolve flake8 conflicts

c2a3fa5

change groupby-behaviour (duplicates) & tests

7cd448a

Only duplicates get suffixed by their corresponding enumeration value: ['name', None, 'name'] gets transformed into ['name_0', None, 'name_1'] Superfluous test cases have been deleted and some additonal test statements have been added.

guenteru force-pushed the bug19029 branch from 4ae01f2 to 7cd448a Compare May 23, 2018 09:46

jreback changed the title ~~Bug19029~~ COMPAT: warn on duplicated names in MI construction May 23, 2018

jreback changed the title ~~COMPAT: warn on duplicated names in MI construction~~ BUG: group with multiple named results May 23, 2018

jreback requested changes May 23, 2018

View reviewed changes

jreback added Bug Groupby labels May 23, 2018

jreback closed this Nov 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: group with multiple named results #21171

BUG: group with multiple named results #21171

guenteru commented May 22, 2018

pep8speaks commented May 22, 2018 •

edited

Loading

codecov bot commented May 22, 2018 •

edited

Loading

WillAyd May 22, 2018

WillAyd May 22, 2018

WillAyd May 22, 2018

WillAyd May 22, 2018

WillAyd May 22, 2018

WillAyd May 22, 2018

jreback May 23, 2018

guenteru May 23, 2018

jreback May 23, 2018

jreback May 23, 2018

jreback commented Nov 1, 2018

		tm.assert_series_equal(result, expected)


		def test_empty_index_names():



		def test_dup_index_names():
		# dup. index names in groupby operations should be renamed (GH 19029):

BUG: group with multiple named results #21171

BUG: group with multiple named results #21171

Conversation

guenteru commented May 22, 2018

pep8speaks commented May 22, 2018 • edited Loading

Comment last updated on May 23, 2018 at 09:46 Hours UTC

codecov bot commented May 22, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 1, 2018

pep8speaks commented May 22, 2018 •

edited

Loading

codecov bot commented May 22, 2018 •

edited

Loading