Skip to content

Commit 5dea811

Browse files
committed
Merge pull request #10931 from nickeubank/test_groupby_sort_preservation
Add tests to ensure sort preserved by groupby, add docs
2 parents f5b85ca + 3b49fad commit 5dea811

File tree

3 files changed

+55
-10
lines changed

3 files changed

+55
-10
lines changed

doc/source/groupby.rst

+25-8
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,31 @@ only verifies that you've passed a valid mapping.
160160
GroupBy operations (though can't be guaranteed to be the most
161161
efficient). You can get quite creative with the label mapping functions.
162162

163+
.. _groupby.sorting:
164+
165+
GroupBy sorting
166+
~~~~~~~~~~~~~~~~~~~~~~~~~
167+
168+
By default the group keys are sorted during the ``groupby`` operation. You may however pass ``sort=False`` for potential speedups:
169+
170+
.. ipython:: python
171+
172+
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
173+
df2.groupby(['X']).sum()
174+
df2.groupby(['X'], sort=False).sum()
175+
176+
177+
Note that ``groupby`` will preserve the order in which *observations* are sorted *within* each group. For example, the groups created by ``groupby()`` below are in the order the appeared in the original ``DataFrame``:
178+
179+
.. ipython:: python
180+
181+
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
182+
df3.groupby(['X']).get_group('A')
183+
184+
df3.groupby(['X']).get_group('B')
185+
186+
187+
163188
.. _groupby.attributes:
164189

165190
GroupBy object attributes
@@ -183,14 +208,6 @@ the length of the ``groups`` dict, so it is largely just a convenience:
183208
grouped.groups
184209
len(grouped)
185210
186-
By default the group keys are sorted during the groupby operation. You may
187-
however pass ``sort=False`` for potential speedups:
188-
189-
.. ipython:: python
190-
191-
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
192-
df2.groupby(['X'], sort=True).sum()
193-
df2.groupby(['X'], sort=False).sum()
194211
195212
.. _groupby.tabcompletion:
196213

pandas/core/generic.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -3252,11 +3252,13 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
32523252
index. Only relevant for DataFrame input. as_index=False is
32533253
effectively "SQL-style" grouped output
32543254
sort : boolean, default True
3255-
Sort group keys. Get better performance by turning this off
3255+
Sort group keys. Get better performance by turning this off.
3256+
Note this does not influence the order of observations within each group.
3257+
groupby preserves the order of rows within each group.
32563258
group_keys : boolean, default True
32573259
When calling apply, add group keys to index to identify pieces
32583260
squeeze : boolean, default False
3259-
reduce the dimensionaility of the return type if possible,
3261+
reduce the dimensionality of the return type if possible,
32603262
otherwise return a consistent type
32613263
32623264
Examples

pandas/tests/test_groupby.py

+26
Original file line numberDiff line numberDiff line change
@@ -5436,6 +5436,32 @@ def test_first_last_max_min_on_time_data(self):
54365436
assert_frame_equal(grouped_ref.first(),grouped_test.first())
54375437
assert_frame_equal(grouped_ref.last(),grouped_test.last())
54385438

5439+
def test_groupby_preserves_sort(self):
5440+
# Test to ensure that groupby always preserves sort order of original
5441+
# object. Issue #8588 and #9651
5442+
5443+
df = DataFrame({'int_groups':[3,1,0,1,0,3,3,3],
5444+
'string_groups':['z','a','z','a','a','g','g','g'],
5445+
'ints':[8,7,4,5,2,9,1,1],
5446+
'floats':[2.3,5.3,6.2,-2.4,2.2,1.1,1.1,5],
5447+
'strings':['z','d','a','e','word','word2','42','47']})
5448+
5449+
# Try sorting on different types and with different group types
5450+
for sort_column in ['ints', 'floats', 'strings', ['ints','floats'],
5451+
['ints','strings']]:
5452+
for group_column in ['int_groups', 'string_groups',
5453+
['int_groups','string_groups']]:
5454+
5455+
df = df.sort_values(by=sort_column)
5456+
5457+
g = df.groupby(group_column)
5458+
5459+
def test_sort(x):
5460+
assert_frame_equal(x, x.sort_values(by=sort_column))
5461+
5462+
g.apply(test_sort)
5463+
5464+
54395465
def assert_fp_equal(a, b):
54405466
assert (np.abs(a - b) < 1e-12).all()
54415467

0 commit comments

Comments
 (0)