-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fixed groupby quantile for listlike q #27827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fixed groupby quantile for listlike q #27827
Conversation
@@ -118,6 +118,7 @@ Plotting | |||
Groupby/resample/rolling | |||
^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
|||
- Fixed regression in :meth:`pands.core.groupby.DataFrameGroupBy.quantile` raising when multiple quantiles are given (:issue:`27526`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple -> list-like
pandas/core/groupby/groupby.py
Outdated
) | ||
for qi in q | ||
] | ||
# fix levels to place quantiles on the inside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would welcome improvements to this. Essentially, we have two things to do after concating the results
- Reorder the levels. By default
concat
places thekeys
on the outermost level, we need it on the innermost - Re-sort things. Initially we're sorted by quantile (the
q
-th quantile for every group is together, then the next quantile). We need to be grouped by group key, then quantiles. We can't just usesort_index
, since we need to preserve the user-provided group key order withsort=False
and the order of the quintile list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah this isn't fun. just checking: we can have mixed dtypes here? if we had a single dtype then stack/unstack would be cheap and there might be a prettier workaround
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concat(results, axis=1, keys=q).stack(0)
gives the same answer for at least one of the test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that sorts the axis. I theory, we should be able to follow it up with a loc
to restore the correct order on that level,
diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index c00bd8398..617466b8d 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -1928,23 +1928,10 @@ class GroupBy(_GroupBy):
for qi in q
]
# fix levels to place quantiles on the inside
- result = concat(results, axis=0, keys=q)
- order = np.roll(list(range(result.index.nlevels)), -1)
- result = result.reorder_levels(order)
- result = result.reindex(q, level=-1)
-
- # fix order.
- hi = len(q) * self.ngroups
- arr = np.arange(0, hi, self.ngroups)
- arrays = []
-
- for i in range(self.ngroups):
- arr = arr + i
- arrays.append(arr)
-
- indices = np.concatenate(arrays)
- assert len(indices) == len(result)
- return result.take(indices)
+ result = concat(results, axis=1, keys=q).stack(0)
+ slices = [slice(None)] * result.index.ndim
+ result = result.loc[tuple(slices), :]
+ return result
@Substitution(name="groupby")
def ngroup(self, ascending=True):
But that runs into the (I think known) issue with loc & a MultiIndex
In [21]: s = pd.Series(list(range(12)), index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2, 3, 4]]))
In [22]: s.loc[pd.IndexSlice[:, [3, 1, 2, 4]]]
Out[22]:
a 1 0
2 1
3 2
4 3
b 1 4
2 5
3 6
4 7
c 1 8
2 9
3 10
4 11
dtype: int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn. Can you add a my-idea-doesnt-work case to the tests
@WillAyd @jbrockmendel do you have time to take a look at this? |
will look in AM |
@jbrockmendel I added your alternative implementation as a TODO once the MultiIndex issue (#10710) is fixed.
It's covered. |
CI failure is just codecov. |
Merging in 1 hour if there aren't any objections (only CI failure is codecov). |
* BUG: Fixed groupby quantile for listlike q Closes pandas-dev#27526
* master: (40 commits) DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988) Remove Encoding of values in char** For Labels (pandas-dev#27618) TYPING: more type hints for io.formats.printing (pandas-dev#27765) TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051) DOC: Start 0.25.2 (pandas-dev#28111) DOC: Fix docstrings lack of punctuation (pandas-dev#28031) DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082) DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844) BUG: timedelta merge asof with tolerance (pandas-dev#27650) BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814) Preserve index when setting new column on empty dataframe. (pandas-dev#26471) BUG: Fixed groupby quantile for listlike q (pandas-dev#27827) BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074) TST: non-strict xfail for period test (pandas-dev#28072) DOC: Update whatsnew (pandas-dev#28073) CI: disable codecov (pandas-dev#28065) CI: Set SHA for codecov upload (pandas-dev#28067) BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059) CI: Add pip dependence explicitly (pandas-dev#28008) DOC: Change document code prun in a row (pandas-dev#28029) ...
Closes #27526
Just a WIP right now. This is under tested for
groupby(sort=False)
q
groupby(multiple levels)