Skip to content

BUG in DataFrameGroupBy.rank raising an obscure TypeError #11918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

BUG in DataFrameGroupBy.rank raising an obscure TypeError #11918

wants to merge 3 commits into from

Conversation

nbonnotte
Copy link
Contributor

Fix issue #11759

Two things:

  1. Series.rank and DataFrame.rank have been moved to generic.py and get the same signature [done with PR CLN: Moving Series.rank and DataFrame.rank to generic.py #11924]
  2. It can happen that aggregate functions are applied to grouped object in a recursive way. Instead of raising a plain ValueError when that fails, and catching it afterwards indistinctly, I propose to raise an _ErrorContainer exception containing the last meaningful error to have been raised. Thus, when everything else has failed, we can still recover a meaningful exception and throw it back at the user.

@nbonnotte
Copy link
Contributor Author

@jreback a test fails because test_rank_methods_series() in tests/test_stats.py calls Series.rank with a non-keyword first argument, which is mistakenly assigned to axis (as this is the first parameter in the new signature, which is the one coming from DataFrame.rank) instead of method (which was the first parameter in the old signature of Series.rank).

So there is a change in the API. Can I alter the test? Is there a warning I should put somewhere?


# this method does not make sense when ndim > 2
if self.ndim > 2:
raise NotImplementedError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a nice message

@jreback jreback added Groupby Error Reporting Incorrect or improved errors from pandas labels Dec 29, 2015
@nbonnotte
Copy link
Contributor Author

@jreback The problem is that, to perform the aggregation, different methods are tried (when one fails, the exception is caught and the next is tried), and at a last resort the aggregation is tried "item by item" recursively. For each item, we thus start over, until we obtain something or there is no item left to work with in the final recursive attempt. If any exception is raised, the current item is discarded.

Currently, when there is no child item left to work with a ValueError is raised, which is caught as any other error by the previous level in the recursion. The parent item which caused it is thus discarded. In fact, in the test I wrote (.test_rank() in pandas/tests/test_groupby.py) all items are discarded and we end up with an empty frame, instead of getting a meaningful error.

Fundamentally, the issue is that we need to be able to tell the difference between a real ValueError and a ValueError which means "sorry no item left", so that we may prevent returning an empty frame and throw a good error.

I'm going to propose another implementation, which I hope will be clearer.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2015

@nbonnotte yes, don't mix 2 issues within a PR like this. too confusing what you are trying to do.

@nbonnotte
Copy link
Contributor Author

@jreback Ok, I will do another PR with only the part where I put Series.rank and DataFrame.rank into generic.py, so that we can move on from this.

Note that it will not solve issue #11759, but it will be the first step.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2015

@nbonnotte prob the way to actually solve this is to define .rank as a groupby function which handles this in the correct manner.

@nbonnotte
Copy link
Contributor Author

@jreback before defining .rank as a groupby function as you suggested, I'm trying this last approach, which seems to work.

But to be honest, I'm not sure I understand the reason behind each piece of code I touched, and I'm trusting Travis to tell me if I did something wrong. So let me know if I've done something I shouldn't have.

But if it seems OK to you, I'll squash the commits, with a proper commit message.

(Uh, the numpy_dev tests fail, but I can't see if that's related to my modifications)

@nbonnotte
Copy link
Contributor Author

All green

*args, **kwargs)
except (AttributeError):
raise ValueError
# this can be called recursively, so need to raise a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this at all, what exactly are you trying to do here? why are you not simply defining .rank on the groupby methods and passing thru args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't like this proposition, I'll do that... but I wanted to try this last thing first, because I'd feel better having made the general code somewhat better rather than patching with an ad-hoc solution for .rank.

Because, right now, I feel the code is flawed. Maybe it's just that I am not experienced enough to understand the whole architecture, but... isn't it a bit hazardous to use generic exceptions such as ValueError to signal stuff internally, and then catch all the ValueError blindly? Here the result (if I'm not wrong) is that we try to rank on each item, but each fails, but we catch all the exceptions blindly — because we are not able to distinguish between the things that fail because they should fail and need therefore to be discarded, and the things that fail because there is a problem that need to be signalled to the user — and so we end up with an empty dataframe.

So what I did here was just to make sure that at least we would not forget any meaningful error message (hence the ValueError(e) below), and then when we end up with an empty dataframe, reraise the last error. This is the most simple way I could find, so if you find it weird I'm not going to keep pushing, it just means I still have to learn a lot :)

What is it that you don't like?

BTW, I understand I'm taking your time by trying stuff like that and expecting you to have a look each time, I appreciate. So feel free to tell me if you think my approach is taking too much energy for both of us, I'll do a patch for .rank, and maybe start another issue for a more comprehensive solution. But maybe, if I can understand better what is it that seems wrong to you, and if it is a minor thing, I can still fix it? ^^

@jreback
Copy link
Contributor

jreback commented Mar 12, 2016

ok, think we are ready for this. pls rebase/update

@nbonnotte
Copy link
Contributor Author

@jreback done (and it's green)

@jreback
Copy link
Contributor

jreback commented Mar 13, 2016

you haven't changed like I indicated. you can simply add a .rank(...) function to groupby.

@nbonnotte
Copy link
Contributor Author

@jreback uh, sorry. Like that?

I still need to include the docstring, though. I guess I can pick it up from somewhere?...

@jreback
Copy link
Contributor

jreback commented Mar 13, 2016

you need to define a .rank(..) method on groupby.

@nbonnotte
Copy link
Contributor Author

So that I can put a docstring just for GroupBy.rank? I see.

But I still need to have different SeriesGroupBy.rank, DataFrameGroupBy.rank, PanelGroupBy.rank method.

I've added a commit, let me know what you think of it.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

@nbonnotte this needs to define a .rank function and remove all of the changes you have made. IOW the try: except: stuff.

@nbonnotte
Copy link
Contributor Author

I don't understand how to do that without creating inconsistencies.

Currently on master:

In [4]: df = pd.DataFrame({'a':['A1', 'A1', 'A1'], 'b':['B1','B1','B2'], 'c':1})

In [5]: df.set_index('a').groupby('c').rank(method='first')
Out[5]:
Empty DataFrame
Columns: []
Index: []

In [6]: df.set_index('a').groupby('c').mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-6-b948f8cac388> in <module>()
----> 1 df.set_index('a').groupby('c').mean()

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in mean(self)
    933         """
    934         try:
--> 935             return self._cython_agg_general('mean')
    936         except GroupByError:
    937             raise

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
   3046     def _cython_agg_general(self, how, numeric_only=True):
   3047         new_items, new_blocks = self._cython_agg_blocks(
-> 3048             how, numeric_only=numeric_only)
   3049         return self._wrap_agged_blocks(new_items, new_blocks)
   3050

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in _cython_agg_blocks(self, how, numeric_only)
   3092
   3093         if len(new_blocks) == 0:
-> 3094             raise DataError('No numeric types to aggregate')
   3095
   3096         return data.items, new_blocks

DataError: No numeric types to aggregate

To be consistent, we'd like to throw an error. But things get tricky when we add another numeric column. Currently:

In [12]: dg = pd.DataFrame({'a': ['A1', 'A1', 'A1'],
                         'b': ['B1', 'B1', 'B2'],
                         'c': 1, 'd':2})

In [13]: dg.set_index('a').groupby('c').mean()
Out[13]:
   d
c
1  2

In [14]: dg.set_index('a').groupby('c').rank(method='first')
Out[14]:
    d
a
A1  1
A1  2
A1  3

But, with a simple GroupBy.rank without any try ... except ... as you suggest, I don't see how to avoid:

In [22]: dg = pd.DataFrame({'a': ['A1', 'A1', 'A1'],
                         'b': ['B1', 'B1', 'B2'],
                         'c': 1, 'd':2})

In [23]: dg.set_index('a').groupby('c').mean()
Out[23]:
   d
c
1  2

In [24]: dg.set_index('a').groupby('c').rank(method='first')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-8fd2be978f65> in <module>()
----> 1 dg.set_index('a').groupby('c').rank(method='first')

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in rank(self, axis, method, numeric_only, na_option, ascending, pct)
   1426                               ascending=ascending, pct=pct)
   1427
-> 1428         return self.apply(curried)
   1429
   1430

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in apply(self, func, *args, **kwargs)
    638         # ignore SettingWithCopy here in case the user mutates
    639         with option_context('mode.chained_assignment', None):
--> 640             return self._python_apply_general(f)
    641
    642     def _python_apply_general(self, f):

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in _python_apply_general(self, f)
    642     def _python_apply_general(self, f):
    643         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 644                                                    self.axis)
    645
    646         return self._wrap_applied_output(keys, values,

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in apply(self, f, data, axis)
   1534             # group might be modified
   1535             group_axes = _get_axes(group)
-> 1536             res = f(group)
   1537             if not _is_indexed_like(res, group_axes):
   1538                 mutated = True

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in f(g)
    634         @wraps(func)
    635         def f(g):
--> 636             return func(g, *args, **kwargs)
    637
    638         # ignore SettingWithCopy here in case the user mutates

/Users/bonnotte/Git/pandas/pandas/core/groupby.py in curried(x)
   1424             return klass.rank(x, axis=0, method=method,
   1425                               numeric_only=numeric_only, na_option=na_option,
-> 1426                               ascending=ascending, pct=pct)
   1427
   1428         return self.apply(curried)

/Users/bonnotte/Git/pandas/pandas/core/generic.py in rank(self, axis, method, numeric_only, na_option, ascending, pct)
   4065         if numeric_only is None:
   4066             try:
-> 4067                 return ranker(self)
   4068             except TypeError:
   4069                 numeric_only = True

/Users/bonnotte/Git/pandas/pandas/core/generic.py in ranker(data)
   4057             ranks = algos.rank(data.values, axis=axis, method=method,
   4058                                ascending=ascending, na_option=na_option,
-> 4059                                pct=pct)
   4060             ranks = self._constructor(ranks, **data._construct_axes_dict())
   4061             return ranks.__finalize__(self)

/Users/bonnotte/Git/pandas/pandas/core/algorithms.py in rank(values, axis, method, na_option, ascending, pct)
    397         f, values = _get_data_algo(values, _rank2d_functions)
    398         ranks = f(values, axis=axis, ties_method=method,
--> 399                   ascending=ascending, na_option=na_option, pct=pct)
    400
    401     return ranks

/Users/bonnotte/Git/pandas/pandas/algos.pyx in pandas.algos.rank_2d_generic (pandas/algos.c:18472)()
    693                         ranks[i, argsorted[i, z]] = j + 1
    694                 elif tiebreak == TIEBREAK_FIRST:
--> 695                     raise ValueError('first not supported for '
    696                                      'non-numeric data')
    697                 elif tiebreak == TIEBREAK_DENSE:

ValueError: first not supported for non-numeric data

What do you suggest I do, precisely?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

as I said you need to add a .rank(...) method on groupby that handles the kw args. see how .first() works now

@nbonnotte
Copy link
Contributor Author

Now I'm lost. Are you talking about NDFrame.first?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

@nbonnotte
Copy link
Contributor Author

Ok... I don't really understand how all this work (yet), but I'll try.

So basically .first() is defined with _groupby_function. If I try to do the same thing for rank, since that is not a cython function (is it?) I'm gonna end up with a variation of self.aggregate(lambda x: algos.rank(x, ...)). Right?

Now, I have a difficulty: aggregate seems to expect that rank will yield an aggregated value. But it is not an aggregation function... Clearly I'm not going in the right direction.

Could you be a little more specific, please?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

@nbonnotte just trace some other functions thru. .rank is a transformer.

@nbonnotte
Copy link
Contributor Author

The other functions are sum, prod, min, max, last, which all give aggregated values. What other functions are you alluding to?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

search for transform. eg. cumprod

try:
return self.transform(f)
except ValueError as e:
obj = self._obj_with_exclusions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding this????

you simply need to handle numeric_only THEN rank.

@nbonnotte
Copy link
Contributor Author

@jreback what do you think of this last version?

In any case, there are still a few other things I'd like to test. I'm not sure I'm backward compatible, and at least I'd like to know if, how, and why.

if data.size == 0:
raise DataError('No numeric types to aggregate')
data = data.groupby(self.grouper)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt' data = data.groupby(self.grouper) be True irrespective of numeric_only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this numeric_only flag is already handled at a higher level. There is no reason to duplicate this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you are saying. If I remove this section, so that the method becomes

    def rank(self, axis=0, method='average', numeric_only=True,
             na_option='keep', ascending=True, pct=False):

        data = self

        def wrapper(values):
            return values.rank(axis=axis, method=method, na_option=na_option,
                               ascending=ascending, pct=pct)

        return data.transform(wrapper)

then I get:

ERROR: test_rank (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/nicolas/Projects/pandas/pandas/tests/test_groupby.py", line 3603, in test_rank
    self.assertRaises(DataError, dg.rank, method='first')
  File "/home/nicolas/Projects/pandas/pandas/util/testing.py", line 2300, in assertRaises
    _callable(*args, **kwargs)
  File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 1372, in rank
    return data.transform(wrapper)
  File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3496, in transform
    return self._transform_general(func, *args, **kwargs)
  File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3444, in _transform_general
    raise ValueError(msg)
ValueError: transform must return a scalar value for each group

If instead I define the method as:

    def rank(self, axis=0, method='average', numeric_only=True,
             na_option='keep', ascending=True, pct=False):

        if numeric_only:
            data = self._obj_with_exclusions._get_numeric_data()
            if data.size == 0:
                raise DataError('No numeric types to aggregate')

        def wrapper(values):
            return values.rank(axis=axis, method=method, na_option=na_option,
                               ascending=ascending, pct=pct)

        return self.transform(wrapper)

then I get

  File "/home/nicolas/Projects/pandas/pandas/tests/test_groupby.py", line 3612, in test_rank
    result = df.groupby('c').rank(method='first')
  File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 1369, in rank
    return self.transform(wrapper)
  File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3493, in transform
    return self._transform_general(func, *args, **kwargs)
  File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3441, in _transform_general
    raise ValueError(msg)
ValueError: transform must return a scalar value for each group

What do you mean when you say the flag is handled at a higher level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly what I said. The data is already selected by the time this is called (its called on each block). So the numeric_only flag is basically useless for this. you can skip things, but you are just creating more complicated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I handle the ValueError that I manage to produce in my tests then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, you have to have a simpler soln.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still something I don't understand, but maybe that's because I'm not familiar enough with how pandas works internally.

Correct me, but from what I've understood the following piece of code should be useless

data = self._obj_with_exclusions._get_numeric_data()
data = data.groupby(self.grouper)

because at that point we should already have numeric values only.

And yet, with

dg = DataFrame({'a': ['A1', 'A1', 'A1'], 'b': ['B1', 'B1', 'B2'], 'c': 1.}).set_index('a').groupby('c')

that I use in the tests, if I add the following prints inside the method

print 'self', self._obj_with_exclusions
print 'data', data._obj_with_exclusions

I get

-------------------- >> begin captured stdout << ---------------------
self      b
a     
A1  B1
A1  B1
A1  B2
data Empty DataFrame
Columns: []
Index: [A1, A1, A1]
--------------------- >> end captured stdout << ----------------------

So now I'm really confused :S

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

ok looking a lot better! a few comments

@jreback
Copy link
Contributor

jreback commented Apr 4, 2016

can you rebase

@codecov-io
Copy link

codecov-io commented May 5, 2016

Current coverage is 85.32% (diff: 94.11%)

Merging #11918 into master will increase coverage by <.01%

@@             master     #11918   diff @@
==========================================
  Files           141        141          
  Lines         50679      50696    +17   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43240      43256    +16   
- Misses         7439       7440     +1   
  Partials          0          0          

Powered by Codecov. Last update 474fd05...f432b2b

@nbonnotte
Copy link
Contributor Author

I've rebased, but I still need to work a bit on this.

@jreback
Copy link
Contributor

jreback commented May 25, 2016

can you rebase / update?

@jreback
Copy link
Contributor

jreback commented May 31, 2016

can you update

@nbonnotte
Copy link
Contributor Author

I've just updated. I'll resume working on this soon. :)

BTW, I don't understand the previous coverage report: it says some files that are not in diff were modified/created, but those are not the one I've touched...

@jreback
Copy link
Contributor

jreback commented Jun 2, 2016

not sure, we went thru a bunch of iterations of the coverage. just rebase on master and should be ok.

@nbonnotte
Copy link
Contributor Author

@jreback I've added more tests, and slightly redesigned the .rank method. I hope this addresses your previous remarks.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2016

you are adding way complication here. something like this should work.

+    def rank(self, axis=0, method='average', numeric_only=True,
+             na_option='keep', ascending=True, pct=False):
+        """Compute numerical data ranks (1 through n) along axis.
+        """
+
+        return self._selected_obj.rank(axis=axis, method=method, na_option=na_option,
+                                       ascending=ascending, pct=pct)
+

@nbonnotte
Copy link
Contributor Author

I don't understand: isn't self._selected_obj the initial dataframe? The one on which .groupby has been called?

If I define

        df = DataFrame({'a': ['A1', 'A1', 'A2', 'A1', 'A1', 'A3'],
                        'b': [2, 1, 1, 2, 1, 1],
                        'c': [1., 1., 1., 2., 2., 2.]})
        df = df.set_index('a')

then df.groupby('c').rank(method='first') should give me

       b
a      
A1  3.0
A1  1.0
A2  2.0
A1  3.0
A1  1.0
A3  2.0

while what you're suggesting gives

       b    c
a           
A1  5.0  1.0
A1  1.0  2.0
A2  2.0  3.0
A1  6.0  4.0
A1  3.0  5.0
A3  4.0  6.0

which is indeed the same as df.rank(method='first')

So maybe I need to do something slightly different in that direction, but you'll have to point me more precisely in the right direction...

@jreback
Copy link
Contributor

jreback commented Jun 19, 2016

what I pointed at is very close to the solution. You are jumping thru hoops here. You are using a very high level API in lower level code, which is not how the code works. Follow how the other methods work, maybe .tail/head which do filtering and model after those.


def wrapper(values):
return values.rank(axis=axis, method=method, na_option=na_option,
ascending=ascending, pct=pct)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to try this first, rather just act on the numeric_only select FIRST (eg. where you have the if numeric_only, that should be it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are suggesting, exactly.

If I do

        if numeric_only:
            data = self._obj_with_exclusions._get_numeric_data()
            if data.size == 0:
                raise DataError('No numeric types to aggregate')
            data = data.groupby(self.grouper)
            return data.transform(wrapper)

        try:
            return self.transform(wrapper)
        except ValueError:
            if not numeric_only and method == 'first':
                raise ValueError('first not supported for non-numeric data')
                # such a ValueError is raised by pandas.algos.rank_2d_generic
                # for regular (non-grouped) dataframes
            raise

then the following test fails:

        df = DataFrame({'a': ['A1', 'A1', 'A1'],
                        'b': [datetime(2002, 2, 2), datetime(2001, 1, 1),
                              datetime(2001, 1, 1)],
                        'c': 1.})
        df = df.set_index('a')
        dg = df.groupby('c')

        result = dg.rank(method='first')

because DataError('No numeric types to aggregate') is raised

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

@jreback
Copy link
Contributor

jreback commented Nov 22, 2016

I am not sure of the relevancy of this PR as the issue seems ok. At this point, let's re-evaluate the original issue.

@jreback jreback closed this Nov 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants