Skip to content

Commit 5a8883b

Browse files
Diego Fernandezjreback
Diego Fernandez
authored andcommitted
BUG: Ensure the right values are set in SeriesGroupBy.nunique
closes #13453 Author: Diego Fernandez <[email protected]> Closes #15418 from aiguofer/gh_13453 and squashes the following commits: c53bd70 [Diego Fernandez] Add test for #13453 in test_resample and add note to whatsnew 0daab80 [Diego Fernandez] Ensure the right values are set in SeriesGroupBy.nunique
1 parent ddb22f5 commit 5a8883b

File tree

4 files changed

+38
-4
lines changed

4 files changed

+38
-4
lines changed

doc/source/whatsnew/v0.20.0.txt

+4-3
Original file line numberDiff line numberDiff line change
@@ -418,6 +418,7 @@ New Behavior:
418418
Other API Changes
419419
^^^^^^^^^^^^^^^^^
420420

421+
- ``numexpr`` version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (:issue:`15213`).
421422
- ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv`` and will be removed in the future (:issue:`12665`)
422423
- ``SparseArray.cumsum()`` and ``SparseSeries.cumsum()`` will now always return ``SparseArray`` and ``SparseSeries`` respectively (:issue:`12855`)
423424
- ``DataFrame.applymap()`` with an empty ``DataFrame`` will return a copy of the empty ``DataFrame`` instead of a ``Series`` (:issue:`8222`)
@@ -428,9 +429,8 @@ Other API Changes
428429
- ``inplace`` arguments now require a boolean value, else a ``ValueError`` is thrown (:issue:`14189`)
429430
- ``pandas.api.types.is_datetime64_ns_dtype`` will now report ``True`` on a tz-aware dtype, similar to ``pandas.api.types.is_datetime64_any_dtype``
430431
- ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
431-
- The :func:`pd.read_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`, :issue:`14305`).
432+
- The :func:`pd.read_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no longer casted to ``int64`` which also caused precision loss (:issue:`14064`, :issue:`14305`).
432433
- Reorganization of timeseries development tests (:issue:`14854`)
433-
- ``numexpr`` version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (:issue:`15213`).
434434

435435
.. _whatsnew_0200.deprecations:
436436

@@ -473,7 +473,7 @@ Performance Improvements
473473
(or with ``compat_x=True``) (:issue:`15073`).
474474
- Improved performance of ``groupby().cummin()`` and ``groupby().cummax()`` (:issue:`15048`, :issue:`15109`)
475475
- Improved performance and reduced memory when indexing with a ``MultiIndex`` (:issue:`15245`)
476-
- When reading buffer object in ``read_sas()`` method without specified format, filepath string is inferred rather than buffer object.
476+
- When reading buffer object in ``read_sas()`` method without specified format, filepath string is inferred rather than buffer object. (:issue:`14947`)
477477

478478

479479

@@ -553,6 +553,7 @@ Bug Fixes
553553

554554
- Bug in ``DataFrame.groupby().describe()`` when grouping on ``Index`` containing tuples (:issue:`14848`)
555555
- Bug in creating a ``MultiIndex`` with tuples and not passing a list of names; this will now raise ``ValueError`` (:issue:`15110`)
556+
- Bug in ``groupby().nunique()`` with a datetimelike-grouper where bins counts were incorrect (:issue:`13453`)
556557

557558
- Bug in catching an overflow in ``Timestamp`` + ``Timedelta/Offset`` operations (:issue:`15126`)
558559
- Bug in the HTML display with with a ``MultiIndex`` and truncation (:issue:`14882`)

pandas/core/groupby.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -3032,7 +3032,7 @@ def nunique(self, dropna=True):
30323032
# we might have duplications among the bins
30333033
if len(res) != len(ri):
30343034
res, out = np.zeros(len(ri), dtype=out.dtype), res
3035-
res[ids] = out
3035+
res[ids[idx]] = out
30363036

30373037
return Series(res,
30383038
index=ri,

pandas/tests/groupby/test_groupby.py

+13
Original file line numberDiff line numberDiff line change
@@ -4159,6 +4159,19 @@ def test_nunique_with_empty_series(self):
41594159
expected = pd.Series(name='name', dtype='int64')
41604160
tm.assert_series_equal(result, expected)
41614161

4162+
def test_nunique_with_timegrouper(self):
4163+
# GH 13453
4164+
test = pd.DataFrame({
4165+
'time': [Timestamp('2016-06-28 09:35:35'),
4166+
Timestamp('2016-06-28 16:09:30'),
4167+
Timestamp('2016-06-28 16:46:28')],
4168+
'data': ['1', '2', '3']}).set_index('time')
4169+
result = test.groupby(pd.TimeGrouper(freq='h'))['data'].nunique()
4170+
expected = test.groupby(
4171+
pd.TimeGrouper(freq='h')
4172+
)['data'].apply(pd.Series.nunique)
4173+
tm.assert_series_equal(result, expected)
4174+
41624175
def test_numpy_compat(self):
41634176
# see gh-12811
41644177
df = pd.DataFrame({'A': [1, 2, 1], 'B': [1, 2, 3]})

pandas/tests/tseries/test_resample.py

+20
Original file line numberDiff line numberDiff line change
@@ -1939,6 +1939,26 @@ def test_resample_nunique(self):
19391939
result = df.ID.groupby(pd.Grouper(freq='D')).nunique()
19401940
assert_series_equal(result, expected)
19411941

1942+
def test_resample_nunique_with_date_gap(self):
1943+
# GH 13453
1944+
index = pd.date_range('1-1-2000', '2-15-2000', freq='h')
1945+
index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h')
1946+
index3 = index.append(index2)
1947+
s = pd.Series(range(len(index3)), index=index3)
1948+
r = s.resample('M')
1949+
1950+
# Since all elements are unique, these should all be the same
1951+
results = [
1952+
r.count(),
1953+
r.nunique(),
1954+
r.agg(pd.Series.nunique),
1955+
r.agg('nunique')
1956+
]
1957+
1958+
assert_series_equal(results[0], results[1])
1959+
assert_series_equal(results[0], results[2])
1960+
assert_series_equal(results[0], results[3])
1961+
19421962
def test_resample_group_info(self): # GH10914
19431963
for n, k in product((10000, 100000), (10, 100, 1000)):
19441964
dr = date_range(start='2015-08-27', periods=n // 10, freq='T')

0 commit comments

Comments
 (0)