BUG: Ensure the right values are set in SeriesGroupBy.nunique

Diego Fernandez · jreback · commit 5a8883b96561 · 2017-02-16T09:27:12.000-05:00
closes #13453 Author: Diego Fernandez <difernan@redhat.com> Closes #15418 from aiguofer/gh_13453 and squashes the following commits: c53bd70 [Diego Fernandez] Add test for #13453 in test_resample and add note to whatsnew 0daab80 [Diego Fernandez] Ensure the right values are set in SeriesGroupBy.nunique
diff --git a/doc/source/whatsnew/v0.20.0.txt b/doc/source/whatsnew/v0.20.0.txt
@@ -418,6 +418,7 @@ New Behavior:
 Other API Changes
 ^^^^^^^^^^^^^^^^^
 
+- ``numexpr`` version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (:issue:`15213`).
 - ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv`` and will be removed in the future (:issue:`12665`)
 - ``SparseArray.cumsum()`` and ``SparseSeries.cumsum()`` will now always return ``SparseArray`` and ``SparseSeries`` respectively (:issue:`12855`)
 - ``DataFrame.applymap()`` with an empty ``DataFrame`` will return a copy of the empty ``DataFrame`` instead of a ``Series`` (:issue:`8222`)
@@ -428,9 +429,8 @@ Other API Changes
 - ``inplace`` arguments now require a boolean value, else a ``ValueError`` is thrown (:issue:`14189`)
 - ``pandas.api.types.is_datetime64_ns_dtype`` will now report ``True`` on a tz-aware dtype, similar to ``pandas.api.types.is_datetime64_any_dtype``
 - ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
-- The :func:`pd.read_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`, :issue:`14305`).
+- The :func:`pd.read_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no longer casted to ``int64`` which also caused precision loss (:issue:`14064`, :issue:`14305`).
 - Reorganization of timeseries development tests (:issue:`14854`)
-- ``numexpr`` version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (:issue:`15213`).
 
 .. _whatsnew_0200.deprecations:
 
@@ -473,7 +473,7 @@ Performance Improvements
   (or with ``compat_x=True``) (:issue:`15073`).
 - Improved performance of ``groupby().cummin()`` and ``groupby().cummax()`` (:issue:`15048`, :issue:`15109`)
 - Improved performance and reduced memory when indexing with a ``MultiIndex`` (:issue:`15245`)
-- When reading buffer object in ``read_sas()`` method without specified format, filepath string is inferred rather than buffer object.
+- When reading buffer object in ``read_sas()`` method without specified format, filepath string is inferred rather than buffer object. (:issue:`14947`)
 
 
 
@@ -553,6 +553,7 @@ Bug Fixes
 
 - Bug in ``DataFrame.groupby().describe()`` when grouping on ``Index`` containing tuples (:issue:`14848`)
 - Bug in creating a ``MultiIndex`` with tuples and not passing a list of names; this will now raise ``ValueError`` (:issue:`15110`)
+- Bug in ``groupby().nunique()`` with a datetimelike-grouper where bins counts were incorrect (:issue:`13453`)
 
 - Bug in catching an overflow in ``Timestamp`` + ``Timedelta/Offset`` operations (:issue:`15126`)
 - Bug in the HTML display with with a ``MultiIndex`` and truncation (:issue:`14882`)
diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
@@ -3032,7 +3032,7 @@ def nunique(self, dropna=True):
         # we might have duplications among the bins
         if len(res) != len(ri):
             res, out = np.zeros(len(ri), dtype=out.dtype), res
-            res[ids] = out
+            res[ids[idx]] = out
 
         return Series(res,
                       index=ri,
diff --git a/pandas/tests/groupby/test_groupby.py b/pandas/tests/groupby/test_groupby.py
@@ -4159,6 +4159,19 @@ def test_nunique_with_empty_series(self):
         expected = pd.Series(name='name', dtype='int64')
         tm.assert_series_equal(result, expected)
 
+    def test_nunique_with_timegrouper(self):
+        # GH 13453
+        test = pd.DataFrame({
+            'time': [Timestamp('2016-06-28 09:35:35'),
+                     Timestamp('2016-06-28 16:09:30'),
+                     Timestamp('2016-06-28 16:46:28')],
+            'data': ['1', '2', '3']}).set_index('time')
+        result = test.groupby(pd.TimeGrouper(freq='h'))['data'].nunique()
+        expected = test.groupby(
+            pd.TimeGrouper(freq='h')
+        )['data'].apply(pd.Series.nunique)
+        tm.assert_series_equal(result, expected)
+
     def test_numpy_compat(self):
         # see gh-12811
         df = pd.DataFrame({'A': [1, 2, 1], 'B': [1, 2, 3]})
diff --git a/pandas/tests/tseries/test_resample.py b/pandas/tests/tseries/test_resample.py
@@ -1939,6 +1939,26 @@ def test_resample_nunique(self):
         result = df.ID.groupby(pd.Grouper(freq='D')).nunique()
         assert_series_equal(result, expected)
 
+    def test_resample_nunique_with_date_gap(self):
+        # GH 13453
+        index = pd.date_range('1-1-2000', '2-15-2000', freq='h')
+        index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h')
+        index3 = index.append(index2)
+        s = pd.Series(range(len(index3)), index=index3)
+        r = s.resample('M')
+
+        # Since all elements are unique, these should all be the same
+        results = [
+            r.count(),
+            r.nunique(),
+            r.agg(pd.Series.nunique),
+            r.agg('nunique')
+        ]
+
+        assert_series_equal(results[0], results[1])
+        assert_series_equal(results[0], results[2])
+        assert_series_equal(results[0], results[3])
+
     def test_resample_group_info(self):  # GH10914
         for n, k in product((10000, 100000), (10, 100, 1000)):
             dr = date_range(start='2015-08-27', periods=n // 10, freq='T')