Skip to content

Commit ac18401

Browse files
jonmmeaseischurov
authored andcommitted
ENH: Allow the groupby by param to handle columns and index levels (GH5677) (pandas-dev#14432)
1 parent 2f1a5b9 commit ac18401

File tree

5 files changed

+252
-15
lines changed

5 files changed

+252
-15
lines changed

doc/source/groupby.rst

+71-12
Original file line numberDiff line numberDiff line change
@@ -94,11 +94,21 @@ The mapping can be specified many different ways:
9494
- For DataFrame objects, a string indicating a column to be used to group. Of
9595
course ``df.groupby('A')`` is just syntactic sugar for
9696
``df.groupby(df['A'])``, but it makes life simpler
97+
- For DataFrame objects, a string indicating an index level to be used to group.
9798
- A list of any of the above things
9899

99100
Collectively we refer to the grouping objects as the **keys**. For example,
100101
consider the following DataFrame:
101102

103+
.. note::
104+
105+
.. versionadded:: 0.20
106+
107+
A string passed to ``groupby`` may refer to either a column or an index level.
108+
If a string matches both a column name and an index level name then a warning is
109+
issued and the column takes precedence. This will result in an ambiguity error
110+
in a future version.
111+
102112
.. ipython:: python
103113
104114
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
@@ -237,17 +247,6 @@ the length of the ``groups`` dict, so it is largely just a convenience:
237247
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
238248
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
239249

240-
241-
.. ipython:: python
242-
:suppress:
243-
244-
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
245-
'foo', 'bar', 'foo', 'foo'],
246-
'B' : ['one', 'one', 'two', 'three',
247-
'two', 'two', 'one', 'three'],
248-
'C' : np.random.randn(8),
249-
'D' : np.random.randn(8)})
250-
251250
.. _groupby.multiindex:
252251

253252
GroupBy with MultiIndex
@@ -289,7 +288,9 @@ chosen level:
289288
290289
s.sum(level='second')
291290
292-
Also as of v0.6, grouping with multiple levels is supported.
291+
.. versionadded:: 0.6
292+
293+
Grouping with multiple levels is supported.
293294

294295
.. ipython:: python
295296
:suppress:
@@ -306,15 +307,73 @@ Also as of v0.6, grouping with multiple levels is supported.
306307
s
307308
s.groupby(level=['first', 'second']).sum()
308309
310+
.. versionadded:: 0.20
311+
312+
Index level names may be supplied as keys.
313+
314+
.. ipython:: python
315+
316+
s.groupby(['first', 'second']).sum()
317+
309318
More on the ``sum`` function and aggregation later.
310319

320+
Grouping DataFrame with Index Levels and Columns
321+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
322+
A DataFrame may be grouped by a combination of columns and index levels by
323+
specifying the column names as strings and the index levels as ``pd.Grouper``
324+
objects.
325+
326+
.. ipython:: python
327+
328+
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
329+
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
330+
331+
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
332+
333+
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
334+
'B': np.arange(8)},
335+
index=index)
336+
337+
df
338+
339+
The following example groups ``df`` by the ``second`` index level and
340+
the ``A`` column.
341+
342+
.. ipython:: python
343+
344+
df.groupby([pd.Grouper(level=1), 'A']).sum()
345+
346+
Index levels may also be specified by name.
347+
348+
.. ipython:: python
349+
350+
df.groupby([pd.Grouper(level='second'), 'A']).sum()
351+
352+
.. versionadded:: 0.20
353+
354+
Index level names may be specified as keys directly to ``groupby``.
355+
356+
.. ipython:: python
357+
358+
df.groupby(['second', 'A']).sum()
359+
311360
DataFrame column selection in GroupBy
312361
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
313362

314363
Once you have created the GroupBy object from a DataFrame, for example, you
315364
might want to do something different for each of the columns. Thus, using
316365
``[]`` similar to getting a column from a DataFrame, you can do:
317366

367+
.. ipython:: python
368+
:suppress:
369+
370+
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
371+
'foo', 'bar', 'foo', 'foo'],
372+
'B' : ['one', 'one', 'two', 'three',
373+
'two', 'two', 'one', 'three'],
374+
'C' : np.random.randn(8),
375+
'D' : np.random.randn(8)})
376+
318377
.. ipython:: python
319378
320379
grouped = df.groupby(['A'])

doc/source/whatsnew/v0.20.0.txt

+14
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,20 @@ Other enhancements
5151
- ``Series.sort_index`` accepts parameters ``kind`` and ``na_position`` (:issue:`13589`, :issue:`14444`)
5252

5353
- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)
54+
- Strings passed to ``DataFrame.groupby()`` as the ``by`` parameter may now reference either column names or index level names (:issue:`5677`)
55+
56+
.. ipython:: python
57+
58+
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
59+
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
60+
61+
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
62+
63+
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
64+
'B': np.arange(8)},
65+
index=index)
66+
67+
df.groupby(['second', 'A']).sum()
5468

5569

5670
- Multiple offset aliases with decimal points are now supported (e.g. '0.5min' is parsed as '30s') (:issue:`8419`)

pandas/core/generic.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -4007,7 +4007,7 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
40074007
Parameters
40084008
----------
40094009
by : mapping function / list of functions, dict, Series, or tuple /
4010-
list of column names.
4010+
list of column names or index level names.
40114011
Called on each element of the object index to determine the groups.
40124012
If a dict or Series is passed, the Series or dict VALUES will be
40134013
used to determine the groups

pandas/core/groupby.py

+14-2
Original file line numberDiff line numberDiff line change
@@ -2459,8 +2459,20 @@ def is_in_obj(gpr):
24592459
exclusions.append(name)
24602460

24612461
elif is_in_axis(gpr): # df.groupby('name')
2462-
in_axis, name, gpr = True, gpr, obj[gpr]
2463-
exclusions.append(name)
2462+
if gpr in obj:
2463+
if gpr in obj.index.names:
2464+
warnings.warn(
2465+
("'%s' is both a column name and an index level.\n"
2466+
"Defaulting to column but "
2467+
"this will raise an ambiguity error in a "
2468+
"future version") % gpr,
2469+
FutureWarning, stacklevel=2)
2470+
in_axis, name, gpr = True, gpr, obj[gpr]
2471+
exclusions.append(name)
2472+
elif gpr in obj.index.names:
2473+
in_axis, name, level, gpr = False, None, gpr, None
2474+
else:
2475+
raise KeyError(gpr)
24642476
elif isinstance(gpr, Grouper) and gpr.key is not None:
24652477
# Add key to exclusions
24662478
exclusions.append(gpr.key)

pandas/tests/test_groupby.py

+152
Original file line numberDiff line numberDiff line change
@@ -521,6 +521,158 @@ def test_grouper_column_and_index(self):
521521
expected = df_single.reset_index().groupby(['inner', 'B']).mean()
522522
assert_frame_equal(result, expected)
523523

524+
def test_grouper_index_level_as_string(self):
525+
# GH 5677, allow strings passed as the `by` parameter to reference
526+
# columns or index levels
527+
528+
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3),
529+
('b', 1), ('b', 2), ('b', 3)])
530+
idx.names = ['outer', 'inner']
531+
df_multi = pd.DataFrame({"A": np.arange(6),
532+
'B': ['one', 'one', 'two',
533+
'two', 'one', 'one']},
534+
index=idx)
535+
536+
df_single = df_multi.reset_index('outer')
537+
538+
# Column and Index on MultiIndex
539+
result = df_multi.groupby(['B', 'inner']).mean()
540+
expected = df_multi.groupby(['B', pd.Grouper(level='inner')]).mean()
541+
assert_frame_equal(result, expected)
542+
543+
# Index and Column on MultiIndex
544+
result = df_multi.groupby(['inner', 'B']).mean()
545+
expected = df_multi.groupby([pd.Grouper(level='inner'), 'B']).mean()
546+
assert_frame_equal(result, expected)
547+
548+
# Column and Index on single Index
549+
result = df_single.groupby(['B', 'inner']).mean()
550+
expected = df_single.groupby(['B', pd.Grouper(level='inner')]).mean()
551+
assert_frame_equal(result, expected)
552+
553+
# Index and Column on single Index
554+
result = df_single.groupby(['inner', 'B']).mean()
555+
expected = df_single.groupby([pd.Grouper(level='inner'), 'B']).mean()
556+
assert_frame_equal(result, expected)
557+
558+
# Single element list of Index on MultiIndex
559+
result = df_multi.groupby(['inner']).mean()
560+
expected = df_multi.groupby(pd.Grouper(level='inner')).mean()
561+
assert_frame_equal(result, expected)
562+
563+
# Single element list of Index on single Index
564+
result = df_single.groupby(['inner']).mean()
565+
expected = df_single.groupby(pd.Grouper(level='inner')).mean()
566+
assert_frame_equal(result, expected)
567+
568+
# Index on MultiIndex
569+
result = df_multi.groupby('inner').mean()
570+
expected = df_multi.groupby(pd.Grouper(level='inner')).mean()
571+
assert_frame_equal(result, expected)
572+
573+
# Index on single Index
574+
result = df_single.groupby('inner').mean()
575+
expected = df_single.groupby(pd.Grouper(level='inner')).mean()
576+
assert_frame_equal(result, expected)
577+
578+
def test_grouper_column_index_level_precedence(self):
579+
# GH 5677, when a string passed as the `by` parameter
580+
# matches a column and an index level the column takes
581+
# precedence
582+
583+
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3),
584+
('b', 1), ('b', 2), ('b', 3)])
585+
idx.names = ['outer', 'inner']
586+
df_multi_both = pd.DataFrame({"A": np.arange(6),
587+
'B': ['one', 'one', 'two',
588+
'two', 'one', 'one'],
589+
'inner': [1, 1, 1, 1, 1, 1]},
590+
index=idx)
591+
592+
df_single_both = df_multi_both.reset_index('outer')
593+
594+
# Group MultiIndex by single key
595+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
596+
result = df_multi_both.groupby('inner').mean()
597+
598+
expected = df_multi_both.groupby([pd.Grouper(key='inner')]).mean()
599+
assert_frame_equal(result, expected)
600+
not_expected = df_multi_both.groupby(pd.Grouper(level='inner')).mean()
601+
self.assertFalse(result.index.equals(not_expected.index))
602+
603+
# Group single Index by single key
604+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
605+
result = df_single_both.groupby('inner').mean()
606+
607+
expected = df_single_both.groupby([pd.Grouper(key='inner')]).mean()
608+
assert_frame_equal(result, expected)
609+
not_expected = df_single_both.groupby(pd.Grouper(level='inner')).mean()
610+
self.assertFalse(result.index.equals(not_expected.index))
611+
612+
# Group MultiIndex by single key list
613+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
614+
result = df_multi_both.groupby(['inner']).mean()
615+
616+
expected = df_multi_both.groupby([pd.Grouper(key='inner')]).mean()
617+
assert_frame_equal(result, expected)
618+
not_expected = df_multi_both.groupby(pd.Grouper(level='inner')).mean()
619+
self.assertFalse(result.index.equals(not_expected.index))
620+
621+
# Group single Index by single key list
622+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
623+
result = df_single_both.groupby(['inner']).mean()
624+
625+
expected = df_single_both.groupby([pd.Grouper(key='inner')]).mean()
626+
assert_frame_equal(result, expected)
627+
not_expected = df_single_both.groupby(pd.Grouper(level='inner')).mean()
628+
self.assertFalse(result.index.equals(not_expected.index))
629+
630+
# Group MultiIndex by two keys (1)
631+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
632+
result = df_multi_both.groupby(['B', 'inner']).mean()
633+
634+
expected = df_multi_both.groupby(['B',
635+
pd.Grouper(key='inner')]).mean()
636+
assert_frame_equal(result, expected)
637+
not_expected = df_multi_both.groupby(['B',
638+
pd.Grouper(level='inner')
639+
]).mean()
640+
self.assertFalse(result.index.equals(not_expected.index))
641+
642+
# Group MultiIndex by two keys (2)
643+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
644+
result = df_multi_both.groupby(['inner', 'B']).mean()
645+
646+
expected = df_multi_both.groupby([pd.Grouper(key='inner'),
647+
'B']).mean()
648+
assert_frame_equal(result, expected)
649+
not_expected = df_multi_both.groupby([pd.Grouper(level='inner'),
650+
'B']).mean()
651+
self.assertFalse(result.index.equals(not_expected.index))
652+
653+
# Group single Index by two keys (1)
654+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
655+
result = df_single_both.groupby(['B', 'inner']).mean()
656+
657+
expected = df_single_both.groupby(['B',
658+
pd.Grouper(key='inner')]).mean()
659+
assert_frame_equal(result, expected)
660+
not_expected = df_single_both.groupby(['B',
661+
pd.Grouper(level='inner')
662+
]).mean()
663+
self.assertFalse(result.index.equals(not_expected.index))
664+
665+
# Group single Index by two keys (2)
666+
with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
667+
result = df_single_both.groupby(['inner', 'B']).mean()
668+
669+
expected = df_single_both.groupby([pd.Grouper(key='inner'),
670+
'B']).mean()
671+
assert_frame_equal(result, expected)
672+
not_expected = df_single_both.groupby([pd.Grouper(level='inner'),
673+
'B']).mean()
674+
self.assertFalse(result.index.equals(not_expected.index))
675+
524676
def test_grouper_getting_correct_binner(self):
525677

526678
# GH 10063

0 commit comments

Comments
 (0)