BUG in DataFrameGroupBy.rank raising an obscure TypeError #11918

nbonnotte · 2015-12-28T19:11:55Z

Fix issue #11759

Two things:

Series.rank and DataFrame.rank have been moved to generic.py and get the same signature [done with PR CLN: Moving Series.rank and DataFrame.rank to generic.py #11924]
It can happen that aggregate functions are applied to grouped object in a recursive way. Instead of raising a plain ValueError when that fails, and catching it afterwards indistinctly, I propose to raise an _ErrorContainer exception containing the last meaningful error to have been raised. Thus, when everything else has failed, we can still recover a meaningful exception and throw it back at the user.

nbonnotte · 2015-12-28T22:15:00Z

@jreback a test fails because test_rank_methods_series() in tests/test_stats.py calls Series.rank with a non-keyword first argument, which is mistakenly assigned to axis (as this is the first parameter in the new signature, which is the one coming from DataFrame.rank) instead of method (which was the first parameter in the old signature of Series.rank).

So there is a change in the API. Can I alter the test? Is there a warning I should put somewhere?

jreback · 2015-12-29T19:35:07Z

pandas/core/generic.py

+
+        # this method does not make sense when ndim > 2
+        if self.ndim > 2:
+            raise NotImplementedError


use a nice message

nbonnotte · 2015-12-29T20:20:04Z

@jreback The problem is that, to perform the aggregation, different methods are tried (when one fails, the exception is caught and the next is tried), and at a last resort the aggregation is tried "item by item" recursively. For each item, we thus start over, until we obtain something or there is no item left to work with in the final recursive attempt. If any exception is raised, the current item is discarded.

Currently, when there is no child item left to work with a ValueError is raised, which is caught as any other error by the previous level in the recursion. The parent item which caused it is thus discarded. In fact, in the test I wrote (.test_rank() in pandas/tests/test_groupby.py) all items are discarded and we end up with an empty frame, instead of getting a meaningful error.

Fundamentally, the issue is that we need to be able to tell the difference between a real ValueError and a ValueError which means "sorry no item left", so that we may prevent returning an empty frame and throw a good error.

I'm going to propose another implementation, which I hope will be clearer.

jreback · 2015-12-29T21:24:09Z

@nbonnotte yes, don't mix 2 issues within a PR like this. too confusing what you are trying to do.

nbonnotte · 2015-12-29T21:30:53Z

@jreback Ok, I will do another PR with only the part where I put Series.rank and DataFrame.rank into generic.py, so that we can move on from this.

Note that it will not solve issue #11759, but it will be the first step.

jreback · 2015-12-29T21:35:03Z

@nbonnotte prob the way to actually solve this is to define .rank as a groupby function which handles this in the correct manner.

nbonnotte · 2016-02-01T14:36:07Z

@jreback before defining .rank as a groupby function as you suggested, I'm trying this last approach, which seems to work.

But to be honest, I'm not sure I understand the reason behind each piece of code I touched, and I'm trusting Travis to tell me if I did something wrong. So let me know if I've done something I shouldn't have.

But if it seems OK to you, I'll squash the commits, with a proper commit message.

(Uh, the numpy_dev tests fail, but I can't see if that's related to my modifications)

nbonnotte · 2016-02-04T09:11:34Z

All green

jreback · 2016-02-08T15:43:59Z

pandas/core/groupby.py

-                                                            *args, **kwargs)
-                    except (AttributeError):
-                        raise ValueError
+                    # this can be called recursively, so need to raise a


I don't like this at all, what exactly are you trying to do here? why are you not simply defining .rank on the groupby methods and passing thru args?

If you don't like this proposition, I'll do that... but I wanted to try this last thing first, because I'd feel better having made the general code somewhat better rather than patching with an ad-hoc solution for .rank.

Because, right now, I feel the code is flawed. Maybe it's just that I am not experienced enough to understand the whole architecture, but... isn't it a bit hazardous to use generic exceptions such as ValueError to signal stuff internally, and then catch all the ValueError blindly? Here the result (if I'm not wrong) is that we try to rank on each item, but each fails, but we catch all the exceptions blindly — because we are not able to distinguish between the things that fail because they should fail and need therefore to be discarded, and the things that fail because there is a problem that need to be signalled to the user — and so we end up with an empty dataframe.

So what I did here was just to make sure that at least we would not forget any meaningful error message (hence the ValueError(e) below), and then when we end up with an empty dataframe, reraise the last error. This is the most simple way I could find, so if you find it weird I'm not going to keep pushing, it just means I still have to learn a lot :)

What is it that you don't like?

BTW, I understand I'm taking your time by trying stuff like that and expecting you to have a look each time, I appreciate. So feel free to tell me if you think my approach is taking too much energy for both of us, I'll do a patch for .rank, and maybe start another issue for a more comprehensive solution. But maybe, if I can understand better what is it that seems wrong to you, and if it is a minor thing, I can still fix it? ^^

jreback · 2016-03-12T17:52:54Z

ok, think we are ready for this. pls rebase/update

nbonnotte · 2016-03-13T10:20:53Z

@jreback done (and it's green)

jreback · 2016-03-13T16:44:06Z

you haven't changed like I indicated. you can simply add a .rank(...) function to groupby.

nbonnotte · 2016-03-13T21:21:02Z

@jreback uh, sorry. Like that?

I still need to include the docstring, though. I guess I can pick it up from somewhere?...

jreback · 2016-03-13T21:30:34Z

you need to define a .rank(..) method on groupby.

nbonnotte · 2016-03-14T08:08:51Z

So that I can put a docstring just for GroupBy.rank? I see.

But I still need to have different SeriesGroupBy.rank, DataFrameGroupBy.rank, PanelGroupBy.rank method.

I've added a commit, let me know what you think of it.

jreback · 2016-03-14T13:03:35Z

@nbonnotte this needs to define a .rank function and remove all of the changes you have made. IOW the try: except: stuff.

nbonnotte · 2016-03-14T15:02:40Z

I don't understand how to do that without creating inconsistencies.

Currently on master:

In [4]: df = pd.DataFrame({'a':['A1', 'A1', 'A1'], 'b':['B1','B1','B2'], 'c':1})

In [5]: df.set_index('a').groupby('c').rank(method='first')
Out[5]:
Empty DataFrame
Columns: []
Index: []

In [6]: df.set_index('a').groupby('c').mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-6-b948f8cac388> in <module>()
----> 1 df.set_index('a').groupby('c').mean()

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in mean(self)
    933         """
    934         try:
--> 935             return self._cython_agg_general('mean')
    936         except GroupByError:
    937             raise

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
   3046     def _cython_agg_general(self, how, numeric_only=True):
   3047         new_items, new_blocks = self._cython_agg_blocks(
-> 3048             how, numeric_only=numeric_only)
   3049         return self._wrap_agged_blocks(new_items, new_blocks)
   3050

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in _cython_agg_blocks(self, how, numeric_only)
   3092
   3093         if len(new_blocks) == 0:
-> 3094             raise DataError('No numeric types to aggregate')
   3095
   3096         return data.items, new_blocks

DataError: No numeric types to aggregate

To be consistent, we'd like to throw an error. But things get tricky when we add another numeric column. Currently:

In [12]: dg = pd.DataFrame({'a': ['A1', 'A1', 'A1'],
                         'b': ['B1', 'B1', 'B2'],
                         'c': 1, 'd':2})

In [13]: dg.set_index('a').groupby('c').mean()
Out[13]:
   d
c
1  2

In [14]: dg.set_index('a').groupby('c').rank(method='first')
Out[14]:
    d
a
A1  1
A1  2
A1  3

But, with a simple GroupBy.rank without any try ... except ... as you suggest, I don't see how to avoid:

In [22]: dg = pd.DataFrame({'a': ['A1', 'A1', 'A1'],
                         'b': ['B1', 'B1', 'B2'],
                         'c': 1, 'd':2})

In [23]: dg.set_index('a').groupby('c').mean()
Out[23]:
   d
c
1  2

In [24]: dg.set_index('a').groupby('c').rank(method='first')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-8fd2be978f65> in <module>()
----> 1 dg.set_index('a').groupby('c').rank(method='first')

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in rank(self, axis, method, numeric_only, na_option, ascending, pct)
   1426                               ascending=ascending, pct=pct)
   1427
-> 1428         return self.apply(curried)
   1429
   1430

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in apply(self, func, *args, **kwargs)
    638         # ignore SettingWithCopy here in case the user mutates
    639         with option_context('mode.chained_assignment', None):
--> 640             return self._python_apply_general(f)
    641
    642     def _python_apply_general(self, f):

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in _python_apply_general(self, f)
    642     def _python_apply_general(self, f):
    643         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 644                                                    self.axis)
    645
    646         return self._wrap_applied_output(keys, values,

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in apply(self, f, data, axis)
   1534             # group might be modified
   1535             group_axes = _get_axes(group)
-> 1536             res = f(group)
   1537             if not _is_indexed_like(res, group_axes):
   1538                 mutated = True

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in f(g)
    634         @wraps(func)
    635         def f(g):
--> 636             return func(g, *args, **kwargs)
    637
    638         # ignore SettingWithCopy here in case the user mutates

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in curried(x)
   1424             return klass.rank(x, axis=0, method=method,
   1425                               numeric_only=numeric_only, na_option=na_option,
-> 1426                               ascending=ascending, pct=pct)
   1427
   1428         return self.apply(curried)

/Users/bonnotte/Git/pandas/pandas/core/generic.py in rank(self, axis, method, numeric_only, na_option, ascending, pct)
   4065         if numeric_only is None:
   4066             try:
-> 4067                 return ranker(self)
   4068             except TypeError:
   4069                 numeric_only = True

/Users/bonnotte/Git/pandas/pandas/core/generic.py in ranker(data)
   4057             ranks = algos.rank(data.values, axis=axis, method=method,
   4058                                ascending=ascending, na_option=na_option,
-> 4059                                pct=pct)
   4060             ranks = self._constructor(ranks, **data._construct_axes_dict())
   4061             return ranks.__finalize__(self)

/Users/bonnotte/Git/pandas/pandas/core/algorithms.py in rank(values, axis, method, na_option, ascending, pct)
    397         f, values = _get_data_algo(values, _rank2d_functions)
    398         ranks = f(values, axis=axis, ties_method=method,
--> 399                   ascending=ascending, na_option=na_option, pct=pct)
    400
    401     return ranks

/Users/bonnotte/Git/pandas/pandas/algos.pyx in pandas.algos.rank_2d_generic (pandas/algos.c:18472)()
    693                         ranks[i, argsorted[i, z]] = j + 1
    694                 elif tiebreak == TIEBREAK_FIRST:
--> 695                     raise ValueError('first not supported for '
    696                                      'non-numeric data')
    697                 elif tiebreak == TIEBREAK_DENSE:

ValueError: first not supported for non-numeric data

What do you suggest I do, precisely?

jreback · 2016-03-14T15:26:06Z

as I said you need to add a .rank(...) method on groupby that handles the kw args. see how .first() works now

nbonnotte · 2016-03-14T16:06:36Z

Now I'm lost. Are you talking about NDFrame.first?

jreback · 2016-03-14T16:42:21Z

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1029

nbonnotte · 2016-03-14T17:36:22Z

Ok... I don't really understand how all this work (yet), but I'll try.

So basically .first() is defined with _groupby_function. If I try to do the same thing for rank, since that is not a cython function (is it?) I'm gonna end up with a variation of self.aggregate(lambda x: algos.rank(x, ...)). Right?

Now, I have a difficulty: aggregate seems to expect that rank will yield an aggregated value. But it is not an aggregation function... Clearly I'm not going in the right direction.

Could you be a little more specific, please?

jreback · 2016-03-14T17:43:34Z

@nbonnotte just trace some other functions thru. .rank is a transformer.

nbonnotte · 2016-03-14T18:25:38Z

The other functions are sum, prod, min, max, last, which all give aggregated values. What other functions are you alluding to?

jreback · 2016-03-14T18:53:46Z

search for transform. eg. cumprod

jreback · 2016-03-14T21:15:57Z

pandas/core/groupby.py

+        try:
+            return self.transform(f)
+        except ValueError as e:
+            obj = self._obj_with_exclusions


why are you adding this????

you simply need to handle numeric_only THEN rank.

nbonnotte · 2016-03-22T07:17:46Z

@jreback what do you think of this last version?

In any case, there are still a few other things I'd like to test. I'm not sure I'm backward compatible, and at least I'd like to know if, how, and why.

jreback · 2016-03-22T13:32:03Z

pandas/core/groupby.py

+            if data.size == 0:
+                raise DataError('No numeric types to aggregate')
+            data = data.groupby(self.grouper)
+        else:


shouldnt' data = data.groupby(self.grouper) be True irrespective of numeric_only?

What do you mean?

this numeric_only flag is already handled at a higher level. There is no reason to duplicate this here.

I'm not sure I understand what you are saying. If I remove this section, so that the method becomes

def rank(self, axis=0, method='average', numeric_only=True, na_option='keep', ascending=True, pct=False): data = self def wrapper(values): return values.rank(axis=axis, method=method, na_option=na_option, ascending=ascending, pct=pct) return data.transform(wrapper)

then I get:

ERROR: test_rank (pandas.tests.test_groupby.TestGroupBy) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/nicolas/Projects/pandas/pandas/tests/test_groupby.py", line 3603, in test_rank self.assertRaises(DataError, dg.rank, method='first') File "/home/nicolas/Projects/pandas/pandas/util/testing.py", line 2300, in assertRaises _callable(*args, **kwargs) File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 1372, in rank return data.transform(wrapper) File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3496, in transform return self._transform_general(func, *args, **kwargs) File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3444, in _transform_general raise ValueError(msg) ValueError: transform must return a scalar value for each group

If instead I define the method as:

def rank(self, axis=0, method='average', numeric_only=True, na_option='keep', ascending=True, pct=False): if numeric_only: data = self._obj_with_exclusions._get_numeric_data() if data.size == 0: raise DataError('No numeric types to aggregate') def wrapper(values): return values.rank(axis=axis, method=method, na_option=na_option, ascending=ascending, pct=pct) return self.transform(wrapper)

then I get

File "/home/nicolas/Projects/pandas/pandas/tests/test_groupby.py", line 3612, in test_rank result = df.groupby('c').rank(method='first') File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 1369, in rank return self.transform(wrapper) File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3493, in transform return self._transform_general(func, *args, **kwargs) File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3441, in _transform_general raise ValueError(msg) ValueError: transform must return a scalar value for each group

What do you mean when you say the flag is handled at a higher level?

exactly what I said. The data is already selected by the time this is called (its called on each block). So the numeric_only flag is basically useless for this. you can skip things, but you are just creating more complicated code.

How can I handle the ValueError that I manage to produce in my tests then?

no idea, you have to have a simpler soln.

There is still something I don't understand, but maybe that's because I'm not familiar enough with how pandas works internally.

Correct me, but from what I've understood the following piece of code should be useless

data = self._obj_with_exclusions._get_numeric_data() data = data.groupby(self.grouper)

because at that point we should already have numeric values only.

And yet, with

dg = DataFrame({'a': ['A1', 'A1', 'A1'], 'b': ['B1', 'B1', 'B2'], 'c': 1.}).set_index('a').groupby('c')

that I use in the tests, if I add the following prints inside the method

print 'self', self._obj_with_exclusions print 'data', data._obj_with_exclusions

I get

-------------------- >> begin captured stdout << --------------------- self b a A1 B1 A1 B1 A1 B2 data Empty DataFrame Columns: [] Index: [A1, A1, A1] --------------------- >> end captured stdout << ----------------------

So now I'm really confused :S

jreback · 2016-03-22T13:37:12Z

ok looking a lot better! a few comments

jreback · 2016-04-04T12:47:10Z

can you rebase

codecov-io · 2016-05-05T17:19:31Z

Current coverage is 85.32% (diff: 94.11%)

Merging #11918 into master will increase coverage by <.01%

@@             master     #11918   diff @@
==========================================
  Files           141        141          
  Lines         50679      50696    +17   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43240      43256    +16   
- Misses         7439       7440     +1   
  Partials          0          0

Powered by Codecov. Last update 474fd05...f432b2b

nbonnotte · 2016-05-05T17:19:31Z

I've rebased, but I still need to work a bit on this.

jreback · 2016-05-25T15:59:29Z

can you rebase / update?

jreback · 2016-05-31T15:22:56Z

can you update

nbonnotte · 2016-06-01T19:16:49Z

I've just updated. I'll resume working on this soon. :)

BTW, I don't understand the previous coverage report: it says some files that are not in diff were modified/created, but those are not the one I've touched...

jreback · 2016-06-02T18:45:29Z

not sure, we went thru a bunch of iterations of the coverage. just rebase on master and should be ok.

nbonnotte · 2016-06-19T12:56:56Z

@jreback I've added more tests, and slightly redesigned the .rank method. I hope this addresses your previous remarks.

jreback · 2016-06-19T18:53:00Z

you are adding way complication here. something like this should work.

+    def rank(self, axis=0, method='average', numeric_only=True,
+             na_option='keep', ascending=True, pct=False):
+        """Compute numerical data ranks (1 through n) along axis.
+        """
+
+        return self._selected_obj.rank(axis=axis, method=method, na_option=na_option,
+                                       ascending=ascending, pct=pct)
+

nbonnotte · 2016-06-19T20:29:24Z

I don't understand: isn't self._selected_obj the initial dataframe? The one on which .groupby has been called?

If I define

        df = DataFrame({'a': ['A1', 'A1', 'A2', 'A1', 'A1', 'A3'],
                        'b': [2, 1, 1, 2, 1, 1],
                        'c': [1., 1., 1., 2., 2., 2.]})
        df = df.set_index('a')

then df.groupby('c').rank(method='first') should give me

       b
a      
A1  3.0
A1  1.0
A2  2.0
A1  3.0
A1  1.0
A3  2.0

while what you're suggesting gives

       b    c
a           
A1  5.0  1.0
A1  1.0  2.0
A2  2.0  3.0
A1  6.0  4.0
A1  3.0  5.0
A3  4.0  6.0

which is indeed the same as df.rank(method='first')

So maybe I need to do something slightly different in that direction, but you'll have to point me more precisely in the right direction...

jreback · 2016-06-19T20:32:36Z

what I pointed at is very close to the solution. You are jumping thru hoops here. You are using a very high level API in lower level code, which is not how the code works. Follow how the other methods work, maybe .tail/head which do filtering and model after those.

fixes #11759

jreback · 2016-07-29T10:56:12Z

pandas/core/groupby.py

+
+        def wrapper(values):
+            return values.rank(axis=axis, method=method, na_option=na_option,
+                               ascending=ascending, pct=pct)


I don't think you need to try this first, rather just act on the numeric_only select FIRST (eg. where you have the if numeric_only, that should be it.

I don't understand what you are suggesting, exactly.

If I do

if numeric_only: data = self._obj_with_exclusions._get_numeric_data() if data.size == 0: raise DataError('No numeric types to aggregate') data = data.groupby(self.grouper) return data.transform(wrapper) try: return self.transform(wrapper) except ValueError: if not numeric_only and method == 'first': raise ValueError('first not supported for non-numeric data') # such a ValueError is raised by pandas.algos.rank_2d_generic # for regular (non-grouped) dataframes raise

then the following test fails:

df = DataFrame({'a': ['A1', 'A1', 'A1'], 'b': [datetime(2002, 2, 2), datetime(2001, 1, 1), datetime(2001, 1, 1)], 'c': 1.}) df = df.set_index('a') dg = df.groupby('c') result = dg.rank(method='first')

because DataError('No numeric types to aggregate') is raised

jreback · 2016-09-09T22:37:17Z

can you rebase / update?

jreback · 2016-11-22T11:53:33Z

I am not sure of the relevancy of this PR as the issue seems ok. At this point, let's re-evaluate the original issue.

jreback reviewed Dec 29, 2015
View reviewed changes

jreback added Groupby Error Reporting Incorrect or improved errors from pandas labels Dec 29, 2015

nbonnotte mentioned this pull request Dec 29, 2015

CLN: Moving Series.rank and DataFrame.rank to generic.py #11924

Closed

jreback reviewed Feb 8, 2016
View reviewed changes

jreback reviewed Mar 14, 2016
View reviewed changes

jreback reviewed Mar 22, 2016
View reviewed changes

nbonnotte added 3 commits July 24, 2016 21:37

BUG in DataFrameGroupBy.rank returning empty frame #11759

3b7831a

fixes #11759

More tests

6f1e356

More tests: categorical and datetime

f432b2b

jreback reviewed Jul 29, 2016
View reviewed changes

jreback closed this Nov 22, 2016

BUG in DataFrameGroupBy.rank raising an obscure TypeError #11918

BUG in DataFrameGroupBy.rank raising an obscure TypeError #11918

Conversation

nbonnotte commented Dec 28, 2015

nbonnotte commented Dec 28, 2015

Choose a reason for hiding this comment

nbonnotte commented Dec 29, 2015

jreback commented Dec 29, 2015

nbonnotte commented Dec 29, 2015

jreback commented Dec 29, 2015

nbonnotte commented Feb 1, 2016

nbonnotte commented Feb 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 12, 2016

nbonnotte commented Mar 13, 2016

jreback commented Mar 13, 2016

nbonnotte commented Mar 13, 2016

jreback commented Mar 13, 2016

nbonnotte commented Mar 14, 2016

jreback commented Mar 14, 2016

nbonnotte commented Mar 14, 2016

jreback commented Mar 14, 2016

nbonnotte commented Mar 14, 2016

jreback commented Mar 14, 2016

nbonnotte commented Mar 14, 2016

jreback commented Mar 14, 2016

nbonnotte commented Mar 14, 2016

jreback commented Mar 14, 2016

Choose a reason for hiding this comment

nbonnotte commented Mar 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 22, 2016

jreback commented Apr 4, 2016

codecov-io commented May 5, 2016 • edited Loading

Current coverage is 85.32% (diff: 94.11%)

nbonnotte commented May 5, 2016

jreback commented May 25, 2016

jreback commented May 31, 2016

nbonnotte commented Jun 1, 2016

jreback commented Jun 2, 2016

nbonnotte commented Jun 19, 2016

jreback commented Jun 19, 2016

nbonnotte commented Jun 19, 2016

jreback commented Jun 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 9, 2016

jreback commented Nov 22, 2016

codecov-io commented May 5, 2016 •

edited

Loading