Skip to content

BUG: Fixed groupby quantile for listlike q #27827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TomAugspurger
Copy link
Contributor

Closes #27526

Just a WIP right now. This is under tested for

  • groupby(sort=False)
  • non-sorted q
  • groupby(multiple levels)

@TomAugspurger TomAugspurger added this to the 0.25.1 milestone Aug 8, 2019
@@ -118,6 +118,7 @@ Plotting
Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^

- Fixed regression in :meth:`pands.core.groupby.DataFrameGroupBy.quantile` raising when multiple quantiles are given (:issue:`27526`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: pandas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple -> list-like

)
for qi in q
]
# fix levels to place quantiles on the inside
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would welcome improvements to this. Essentially, we have two things to do after concating the results

  1. Reorder the levels. By default concat places the keys on the outermost level, we need it on the innermost
  2. Re-sort things. Initially we're sorted by quantile (the q-th quantile for every group is together, then the next quantile). We need to be grouped by group key, then quantiles. We can't just use sort_index, since we need to preserve the user-provided group key order with sort=False and the order of the quintile list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah this isn't fun. just checking: we can have mixed dtypes here? if we had a single dtype then stack/unstack would be cheap and there might be a prettier workaround

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat(results, axis=1, keys=q).stack(0) gives the same answer for at least one of the test cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that sorts the axis. I theory, we should be able to follow it up with a loc to restore the correct order on that level,

diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index c00bd8398..617466b8d 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -1928,23 +1928,10 @@ class GroupBy(_GroupBy):
                 for qi in q
             ]
             # fix levels to place quantiles on the inside
-            result = concat(results, axis=0, keys=q)
-            order = np.roll(list(range(result.index.nlevels)), -1)
-            result = result.reorder_levels(order)
-            result = result.reindex(q, level=-1)
-
-            # fix order.
-            hi = len(q) * self.ngroups
-            arr = np.arange(0, hi, self.ngroups)
-            arrays = []
-
-            for i in range(self.ngroups):
-                arr = arr + i
-                arrays.append(arr)
-
-            indices = np.concatenate(arrays)
-            assert len(indices) == len(result)
-            return result.take(indices)
+            result = concat(results, axis=1, keys=q).stack(0)
+            slices = [slice(None)] * result.index.ndim
+            result = result.loc[tuple(slices), :]
+            return result
 
     @Substitution(name="groupby")
     def ngroup(self, ascending=True):

But that runs into the (I think known) issue with loc & a MultiIndex

In [21]: s = pd.Series(list(range(12)), index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2, 3, 4]]))

In [22]: s.loc[pd.IndexSlice[:, [3, 1, 2, 4]]]
Out[22]:
a  1     0
   2     1
   3     2
   4     3
b  1     4
   2     5
   3     6
   4     7
c  1     8
   2     9
   3    10
   4    11
dtype: int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn. Can you add a my-idea-doesnt-work case to the tests

@TomAugspurger
Copy link
Contributor Author

@WillAyd @jbrockmendel do you have time to take a look at this?

@jbrockmendel
Copy link
Member

will look in AM

@TomAugspurger
Copy link
Contributor Author

@jbrockmendel I added your alternative implementation as a TODO once the MultiIndex issue (#10710) is fixed.

Darn. Can you add a my-idea-doesnt-work case to the tests

It's covered.

@TomAugspurger
Copy link
Contributor Author

CI failure is just codecov.

@TomAugspurger
Copy link
Contributor Author

Merging in 1 hour if there aren't any objections (only CI failure is codecov).

@TomAugspurger TomAugspurger merged commit 8f6118c into pandas-dev:master Aug 22, 2019
@TomAugspurger TomAugspurger deleted the 27526-groupby-quantile-array branch August 22, 2019 11:28
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 22, 2019
galuhsahid pushed a commit to galuhsahid/pandas that referenced this pull request Aug 25, 2019
* BUG: Fixed groupby quantile for listlike q

Closes pandas-dev#27526
galuhsahid added a commit to galuhsahid/pandas that referenced this pull request Aug 25, 2019
* master: (40 commits)
  DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988)
  Remove Encoding of values in char** For Labels (pandas-dev#27618)
  TYPING: more type hints for io.formats.printing (pandas-dev#27765)
  TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051)
  DOC: Start 0.25.2 (pandas-dev#28111)
  DOC: Fix docstrings lack of punctuation (pandas-dev#28031)
  DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082)
  DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844)
  BUG: timedelta merge asof with tolerance (pandas-dev#27650)
  BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814)
  Preserve index when setting new column on empty dataframe. (pandas-dev#26471)
  BUG: Fixed groupby quantile for listlike q (pandas-dev#27827)
  BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074)
  TST: non-strict xfail for period test (pandas-dev#28072)
  DOC: Update whatsnew (pandas-dev#28073)
  CI: disable codecov (pandas-dev#28065)
  CI: Set SHA for codecov upload (pandas-dev#28067)
  BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059)
  CI: Add pip dependence explicitly (pandas-dev#28008)
  DOC: Change document code prun in a row (pandas-dev#28029)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby Array-Type Quantiles Broken in 0.25.0
2 participants