Skip to content

API: update nth to use the _set_selection_from_grouper makes first==nth(0) and last==nth(-1) #7044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 12, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -397,7 +397,7 @@ index are the group names and whose values are the sizes of each group.
named *columns*.

Aggregating functions are ones that reduce the dimension of the returned objects,
for example: ``mean, sum, size, count, std, var, describe, first, last, min, max``. This is
for example: ``mean, sum, size, count, std, var, describe, first, last, nth, min, max``. This is
what happens when you do for example ``DataFrame.sum()`` and get back a ``Series``.

.. _groupby.aggregate.multifunc:
Expand Down Expand Up @@ -613,7 +613,7 @@ For dataframes with multiple columns, filters should explicitly specify a column
a reduced shape of the original (and potentitally eliminating groups), but with the index unchanged.
Passing ``as_index=False`` will not affect these transformation methods.

For example: ``head, tail nth``.
For example: ``head, tail``.

.. ipython:: python

Expand Down
2 changes: 1 addition & 1 deletion doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ API Changes
validation warnings in :func:`read_csv`/:func:`read_table` (:issue:`6607`)
- Raise a ``TypeError`` when ``DataFrame`` is passed an iterator as the
``data`` argument (:issue:`5357`)
- groupby will now not return the grouped column for non-cython functions (:issue:`5610`, :issue:`5614`),
- groupby will now not return the grouped column for non-cython functions (:issue:`5610`, :issue:`5614`, :issue:`6732`),
as its already the index
- ``DataFrame.plot`` and ``Series.plot`` now supports area plot with specifying ``kind='area'`` (:issue:`6656`)
- Line plot can be stacked by ``stacked=True``. (:issue:`6656`)
Expand Down
109 changes: 58 additions & 51 deletions doc/source/v0.14.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ users upgrade to this version.

- :ref:`API Changes <whatsnew_0140.api>`

- :ref:`Groupby API Changes <whatsnew_0140.groupby>`

- :ref:`Performance Improvements <whatsnew_0140.performance>`

- :ref:`Prior Deprecations <whatsnew_0140.prior_deprecations>`
Expand Down Expand Up @@ -95,57 +97,6 @@ API changes

- Add ``is_month_start``, ``is_month_end``, ``is_quarter_start``, ``is_quarter_end``, ``is_year_start``, ``is_year_end`` accessors for ``DateTimeIndex`` / ``Timestamp`` which return a boolean array of whether the timestamp(s) are at the start/end of the month/quarter/year defined by the frequency of the ``DateTimeIndex`` / ``Timestamp`` (:issue:`4565`, :issue:`6998`)

- More consistent behaviour for some groupby methods:

groupby ``head`` and ``tail`` now act more like ``filter`` rather than an aggregation:

.. ipython:: python

df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
g.head(1) # filters DataFrame

g.apply(lambda x: x.head(1)) # used to simply fall-through

groupby head and tail respect column selection:

.. ipython:: python

g[['B']].head(1)

groupby ``nth`` now filters by default, with optional dropna argument to ignore
NaN (to replicate the previous behaviour.), See :ref:`the docs <groupby.nth>`.

.. ipython:: python

df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
g.nth(0) # can also use negative ints

g.nth(0, dropna='any') # similar to old behaviour

groupby will now not return the grouped column for non-cython functions (:issue:`5610`, :issue:`5614`),
as its already the index

.. ipython:: python

df = DataFrame([[1, np.nan], [1, 4], [5, 6], [5, 8]], columns=['A', 'B'])
g = df.groupby('A')
g.count()
g.describe()

passing ``as_index`` will leave the grouped column in-place (this is not change in 0.14.0)

.. ipython:: python

df = DataFrame([[1, np.nan], [1, 4], [5, 6], [5, 8]], columns=['A', 'B'])
g = df.groupby('A',as_index=False)
g.count()
g.describe()

- Allow specification of a more complex groupby via ``pd.Grouper``, such as grouping
by a Time and a string field simultaneously. See :ref:`the docs <groupby.specify>`. (:issue:`3794`)

- Local variable usage has changed in
:func:`pandas.eval`/:meth:`DataFrame.eval`/:meth:`DataFrame.query`
(:issue:`5987`). For the :class:`~pandas.DataFrame` methods, two things have
Expand Down Expand Up @@ -247,6 +198,62 @@ API changes
from 0.13.1
- Added ``factorize`` functions to ``Index`` and ``Series`` to get indexer and unique values (:issue:`7090`)

.. _whatsnew_0140.groupby:

Groupby API Changes
~~~~~~~~~~~~~~~~~~~

More consistent behaviour for some groupby methods:

- groupby ``head`` and ``tail`` now act more like ``filter`` rather than an aggregation:

.. ipython:: python

df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
g.head(1) # filters DataFrame

g.apply(lambda x: x.head(1)) # used to simply fall-through

- groupby head and tail respect column selection:

.. ipython:: python

g[['B']].head(1)

- groupby ``nth`` now filters by default, with optional dropna argument to ignore
NaN (to replicate the previous behaviour.), See :ref:`the docs <groupby.nth>`.

.. ipython:: python

df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
g.nth(0) # can also use negative ints

g.nth(0, dropna='any') # similar to old behaviour

- groupby will now not return the grouped column for non-cython functions (:issue:`5610`, :issue:`5614`, :issue:`6732`),
as its already the index

.. ipython:: python

df = DataFrame([[1, np.nan], [1, 4], [5, 6], [5, 8]], columns=['A', 'B'])
g = df.groupby('A')
g.count()
g.describe()

- passing ``as_index`` will leave the grouped column in-place (this is not change in 0.14.0)

.. ipython:: python

df = DataFrame([[1, np.nan], [1, 4], [5, 6], [5, 8]], columns=['A', 'B'])
g = df.groupby('A',as_index=False)
g.count()
g.describe()

- Allow specification of a more complex groupby via ``pd.Grouper``, such as grouping
by a Time and a string field simultaneously. See :ref:`the docs <groupby.specify>`. (:issue:`3794`)

.. _whatsnew_0140.sql:

SQL
Expand Down
26 changes: 23 additions & 3 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ class SpecificationError(GroupByError):
def _groupby_function(name, alias, npfunc, numeric_only=True,
_convert=False):
def f(self):
self._set_selection_from_grouper()
try:
return self._cython_agg_general(alias, numeric_only=numeric_only)
except AssertionError as e:
Expand Down Expand Up @@ -356,6 +357,7 @@ class GroupBy(PandasObject):
_apply_whitelist = _common_apply_whitelist
_internal_names = ['_cache']
_internal_names_set = set(_internal_names)
_group_selection = None

def __init__(self, obj, keys=None, axis=0, level=None,
grouper=None, exclusions=None, selection=None, as_index=True,
Expand Down Expand Up @@ -454,18 +456,20 @@ def _selection_list(self):
def _selected_obj(self):

if self._selection is None or isinstance(self.obj, Series):
if self._group_selection is not None:
return self.obj[self._group_selection]
return self.obj
else:
return self.obj[self._selection]

def _set_selection_from_grouper(self):
""" we may need create a selection if we have non-level groupers """
grp = self.grouper
if self._selection is None and self.as_index and getattr(grp,'groupings',None) is not None:
if self.as_index and getattr(grp,'groupings',None) is not None:
ax = self.obj._info_axis
groupers = [ g.name for g in grp.groupings if g.level is None and g.name is not None and g.name in ax ]
if len(groupers):
self._selection = (ax-Index(groupers)).tolist()
self._group_selection = (ax-Index(groupers)).tolist()

def _local_dir(self):
return sorted(set(self.obj._local_dir() + list(self._apply_whitelist)))
Expand Down Expand Up @@ -776,6 +780,7 @@ def nth(self, n, dropna=None):

"""

self._set_selection_from_grouper()
if not dropna: # good choice
m = self.grouper._max_groupsize
if n >= m or n < -m:
Expand All @@ -787,7 +792,21 @@ def nth(self, n, dropna=None):
else:
rng[- n - 1] = True
is_nth = self._cumcount_array(rng, ascending=False)
return self._selected_obj[is_nth]

result = self._selected_obj[is_nth]

# the result index
if self.as_index:
ax = self.obj._info_axis
names = self.grouper.names
if all([ n in ax for n in names ]):
result.index = Index(self.obj[names][is_nth].values.ravel()).set_names(names)
elif self._group_selection is not None:
result.index = self.obj._get_axis(self.axis)[is_nth]

result = result.sort_index()

return result

if (isinstance(self._selected_obj, DataFrame)
and dropna not in ['any', 'all']):
Expand Down Expand Up @@ -853,6 +872,7 @@ def cumcount(self, **kwargs):
dtype: int64

"""
self._set_selection_from_grouper()
ascending = kwargs.pop('ascending', True)

index = self._selected_obj.index
Expand Down
66 changes: 45 additions & 21 deletions pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,18 +166,27 @@ def test_first_last_nth(self):
# tests for first / last / nth
grouped = self.df.groupby('A')
first = grouped.first()
expected = self.df.ix[[1, 0], ['B', 'C', 'D']]
expected.index = ['bar', 'foo']
assert_frame_equal(first, expected, check_names=False)
expected = self.df.ix[[1, 0], ['B','C','D']]
expected.index = Index(['bar', 'foo'],name='A')
expected = expected.sort_index()
assert_frame_equal(first, expected)

nth = grouped.nth(0)
assert_frame_equal(nth, expected)

last = grouped.last()
expected = self.df.ix[[5, 7], ['B', 'C', 'D']]
expected.index = ['bar', 'foo']
assert_frame_equal(last, expected, check_names=False)
expected = self.df.ix[[5, 7], ['B','C','D']]
expected.index = Index(['bar', 'foo'],name='A')
assert_frame_equal(last, expected)

nth = grouped.nth(-1)
assert_frame_equal(nth, expected)

nth = grouped.nth(1)
expected = self.df.iloc[[2, 3]]
assert_frame_equal(nth, expected, check_names=False)
expected = self.df.ix[[2, 3],['B','C','D']].copy()
expected.index = Index(['foo', 'bar'],name='A')
expected = expected.sort_index()
assert_frame_equal(nth, expected)

# it works!
grouped['B'].first()
Expand All @@ -189,6 +198,17 @@ def test_first_last_nth(self):
self.assert_(com.isnull(grouped['B'].last()['foo']))
self.assert_(com.isnull(grouped['B'].nth(0)[0])) # not sure what this is testing

# v0.14.0 whatsnew
df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')
result = g.first()
expected = df.iloc[[1,2]].set_index('A')
assert_frame_equal(result, expected)

expected = df.iloc[[1,2]].set_index('A')
result = g.nth(0,dropna='any')
assert_frame_equal(result, expected)

def test_first_last_nth_dtypes(self):

df = self.df_mixed_floats.copy()
Expand All @@ -199,17 +219,21 @@ def test_first_last_nth_dtypes(self):
grouped = df.groupby('A')
first = grouped.first()
expected = df.ix[[1, 0], ['B', 'C', 'D', 'E', 'F']]
expected.index = ['bar', 'foo']
assert_frame_equal(first, expected, check_names=False)
expected.index = Index(['bar', 'foo'], name='A')
expected = expected.sort_index()
assert_frame_equal(first, expected)

last = grouped.last()
expected = df.ix[[5, 7], ['B', 'C', 'D', 'E', 'F']]
expected.index = ['bar', 'foo']
assert_frame_equal(last, expected, check_names=False)
expected.index = Index(['bar', 'foo'], name='A')
expected = expected.sort_index()
assert_frame_equal(last, expected)

nth = grouped.nth(1)
expected = df.iloc[[2, 3]]
assert_frame_equal(nth, expected, check_names=False)
expected = df.ix[[3, 2],['B', 'C', 'D', 'E', 'F']]
expected.index = Index(['bar', 'foo'], name='A')
expected = expected.sort_index()
assert_frame_equal(nth, expected)

# GH 2763, first/last shifting dtypes
idx = lrange(10)
Expand All @@ -223,15 +247,15 @@ def test_nth(self):
df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
g = df.groupby('A')

assert_frame_equal(g.nth(0), df.iloc[[0, 2]])
assert_frame_equal(g.nth(1), df.iloc[[1]])
assert_frame_equal(g.nth(2), df.loc[[]])
assert_frame_equal(g.nth(-1), df.iloc[[1, 2]])
assert_frame_equal(g.nth(-2), df.iloc[[0]])
assert_frame_equal(g.nth(-3), df.loc[[]])
assert_frame_equal(g.nth(0), df.iloc[[0, 2]].set_index('A'))
assert_frame_equal(g.nth(1), df.iloc[[1]].set_index('A'))
assert_frame_equal(g.nth(2), df.loc[[],['B']])
assert_frame_equal(g.nth(-1), df.iloc[[1, 2]].set_index('A'))
assert_frame_equal(g.nth(-2), df.iloc[[0]].set_index('A'))
assert_frame_equal(g.nth(-3), df.loc[[],['B']])
assert_series_equal(g.B.nth(0), df.B.iloc[[0, 2]])
assert_series_equal(g.B.nth(1), df.B.iloc[[1]])
assert_frame_equal(g[['B']].nth(0), df.ix[[0, 2], ['B']])
assert_frame_equal(g[['B']].nth(0), df.ix[[0, 2], ['A', 'B']].set_index('A'))

exp = df.set_index('A')
assert_frame_equal(g.nth(0, dropna='any'), exp.iloc[[1, 2]])
Expand Down