Skip to content

Commit 2740fb4

Browse files
authored
BUG: Groupby quantiles incorrect bins #33200 (#33644)
1 parent d75100d commit 2740fb4

File tree

3 files changed

+31
-7
lines changed

3 files changed

+31
-7
lines changed

doc/source/whatsnew/v1.1.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -826,6 +826,7 @@ Groupby/resample/rolling
826826
- Bug in :meth:`DataFrame.resample` where an ``AmbiguousTimeError`` would be raised when the resulting timezone aware :class:`DatetimeIndex` had a DST transition at midnight (:issue:`25758`)
827827
- Bug in :meth:`DataFrame.groupby` where a ``ValueError`` would be raised when grouping by a categorical column with read-only categories and ``sort=False`` (:issue:`33410`)
828828
- Bug in :meth:`GroupBy.first` and :meth:`GroupBy.last` where None is not preserved in object dtype (:issue:`32800`)
829+
- Bug in :meth:`GroupBy.quantile` causes the quantiles to be shifted when the ``by`` axis contains ``NaN`` (:issue:`33200`, :issue:`33569`)
829830
- Bug in :meth:`Rolling.min` and :meth:`Rolling.max`: Growing memory usage after multiple calls when using a fixed window (:issue:`30726`)
830831
- Bug in :meth:`Series.groupby` would raise ``ValueError`` when grouping by :class:`PeriodIndex` level (:issue:`34010`)
831832
- Bug in :meth:`GroupBy.agg`, :meth:`GroupBy.transform`, and :meth:`GroupBy.resample` where subclasses are not preserved (:issue:`28330`)

pandas/_libs/groupby.pyx

+7-1
Original file line numberDiff line numberDiff line change
@@ -777,7 +777,13 @@ def group_quantile(ndarray[float64_t] out,
777777
non_na_counts[lab] += 1
778778

779779
# Get an index of values sorted by labels and then values
780-
order = (values, labels)
780+
if labels.any():
781+
# Put '-1' (NaN) labels as the last group so it does not interfere
782+
# with the calculations.
783+
labels_for_lexsort = np.where(labels == -1, labels.max() + 1, labels)
784+
else:
785+
labels_for_lexsort = labels
786+
order = (values, labels_for_lexsort)
781787
sort_arr = np.lexsort(order).astype(np.int64, copy=False)
782788

783789
with nogil:

pandas/tests/groupby/test_quantile.py

+23-6
Original file line numberDiff line numberDiff line change
@@ -181,15 +181,32 @@ def test_quantile_missing_group_values_no_segfaults():
181181
grp.quantile()
182182

183183

184-
def test_quantile_missing_group_values_correct_results():
185-
# GH 28662
186-
data = np.array([1.0, np.nan, 3.0, np.nan])
187-
df = pd.DataFrame(dict(key=data, val=range(4)))
184+
@pytest.mark.parametrize(
185+
"key, val, expected_key, expected_val",
186+
[
187+
([1.0, np.nan, 3.0, np.nan], range(4), [1.0, 3.0], [0.0, 2.0]),
188+
([1.0, np.nan, 2.0, 2.0], range(4), [1.0, 2.0], [0.0, 2.5]),
189+
(["a", "b", "b", np.nan], range(4), ["a", "b"], [0, 1.5]),
190+
([0], [42], [0], [42.0]),
191+
([], [], np.array([], dtype="float64"), np.array([], dtype="float64")),
192+
],
193+
)
194+
def test_quantile_missing_group_values_correct_results(
195+
key, val, expected_key, expected_val
196+
):
197+
# GH 28662, GH 33200, GH 33569
198+
df = pd.DataFrame({"key": key, "val": val})
188199

189-
result = df.groupby("key").quantile()
190200
expected = pd.DataFrame(
191-
[1.0, 3.0], index=pd.Index([1.0, 3.0], name="key"), columns=["val"]
201+
expected_val, index=pd.Index(expected_key, name="key"), columns=["val"]
192202
)
203+
204+
grp = df.groupby("key")
205+
206+
result = grp.quantile(0.5)
207+
tm.assert_frame_equal(result, expected)
208+
209+
result = grp.quantile()
193210
tm.assert_frame_equal(result, expected)
194211

195212

0 commit comments

Comments
 (0)