-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG in DataFrameGroupBy.rank raising an obscure TypeError #11918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jreback a test fails because So there is a change in the API. Can I alter the test? Is there a warning I should put somewhere? |
|
||
# this method does not make sense when ndim > 2 | ||
if self.ndim > 2: | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a nice message
@jreback The problem is that, to perform the aggregation, different methods are tried (when one fails, the exception is caught and the next is tried), and at a last resort the aggregation is tried "item by item" recursively. For each item, we thus start over, until we obtain something or there is no item left to work with in the final recursive attempt. If any exception is raised, the current item is discarded. Currently, when there is no child item left to work with a Fundamentally, the issue is that we need to be able to tell the difference between a real I'm going to propose another implementation, which I hope will be clearer. |
@nbonnotte yes, don't mix 2 issues within a PR like this. too confusing what you are trying to do. |
@nbonnotte prob the way to actually solve this is to define |
@jreback before defining But to be honest, I'm not sure I understand the reason behind each piece of code I touched, and I'm trusting Travis to tell me if I did something wrong. So let me know if I've done something I shouldn't have. But if it seems OK to you, I'll squash the commits, with a proper commit message. (Uh, the numpy_dev tests fail, but I can't see if that's related to my modifications) |
All green |
*args, **kwargs) | ||
except (AttributeError): | ||
raise ValueError | ||
# this can be called recursively, so need to raise a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this at all, what exactly are you trying to do here? why are you not simply defining .rank
on the groupby methods and passing thru args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't like this proposition, I'll do that... but I wanted to try this last thing first, because I'd feel better having made the general code somewhat better rather than patching with an ad-hoc solution for .rank
.
Because, right now, I feel the code is flawed. Maybe it's just that I am not experienced enough to understand the whole architecture, but... isn't it a bit hazardous to use generic exceptions such as ValueError
to signal stuff internally, and then catch all the ValueError
blindly? Here the result (if I'm not wrong) is that we try to rank on each item, but each fails, but we catch all the exceptions blindly — because we are not able to distinguish between the things that fail because they should fail and need therefore to be discarded, and the things that fail because there is a problem that need to be signalled to the user — and so we end up with an empty dataframe.
So what I did here was just to make sure that at least we would not forget any meaningful error message (hence the ValueError(e)
below), and then when we end up with an empty dataframe, reraise the last error. This is the most simple way I could find, so if you find it weird I'm not going to keep pushing, it just means I still have to learn a lot :)
What is it that you don't like?
BTW, I understand I'm taking your time by trying stuff like that and expecting you to have a look each time, I appreciate. So feel free to tell me if you think my approach is taking too much energy for both of us, I'll do a patch for .rank
, and maybe start another issue for a more comprehensive solution. But maybe, if I can understand better what is it that seems wrong to you, and if it is a minor thing, I can still fix it? ^^
ok, think we are ready for this. pls rebase/update |
@jreback done (and it's green) |
you haven't changed like I indicated. you can simply add a |
@jreback uh, sorry. Like that? I still need to include the docstring, though. I guess I can pick it up from somewhere?... |
you need to define a |
So that I can put a docstring just for But I still need to have different I've added a commit, let me know what you think of it. |
@nbonnotte this needs to define a |
I don't understand how to do that without creating inconsistencies. Currently on master:
To be consistent, we'd like to throw an error. But things get tricky when we add another numeric column. Currently:
But, with a simple
What do you suggest I do, precisely? |
as I said you need to add a |
Now I'm lost. Are you talking about |
Ok... I don't really understand how all this work (yet), but I'll try. So basically Now, I have a difficulty: Could you be a little more specific, please? |
@nbonnotte just trace some other functions thru. |
The other functions are |
search for transform. eg. |
try: | ||
return self.transform(f) | ||
except ValueError as e: | ||
obj = self._obj_with_exclusions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding this????
you simply need to handle numeric_only
THEN rank.
@jreback what do you think of this last version? In any case, there are still a few other things I'd like to test. I'm not sure I'm backward compatible, and at least I'd like to know if, how, and why. |
if data.size == 0: | ||
raise DataError('No numeric types to aggregate') | ||
data = data.groupby(self.grouper) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt' data = data.groupby(self.grouper)
be True irrespective of numeric_only
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this numeric_only
flag is already handled at a higher level. There is no reason to duplicate this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you are saying. If I remove this section, so that the method becomes
def rank(self, axis=0, method='average', numeric_only=True,
na_option='keep', ascending=True, pct=False):
data = self
def wrapper(values):
return values.rank(axis=axis, method=method, na_option=na_option,
ascending=ascending, pct=pct)
return data.transform(wrapper)
then I get:
ERROR: test_rank (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/nicolas/Projects/pandas/pandas/tests/test_groupby.py", line 3603, in test_rank
self.assertRaises(DataError, dg.rank, method='first')
File "/home/nicolas/Projects/pandas/pandas/util/testing.py", line 2300, in assertRaises
_callable(*args, **kwargs)
File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 1372, in rank
return data.transform(wrapper)
File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3496, in transform
return self._transform_general(func, *args, **kwargs)
File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3444, in _transform_general
raise ValueError(msg)
ValueError: transform must return a scalar value for each group
If instead I define the method as:
def rank(self, axis=0, method='average', numeric_only=True,
na_option='keep', ascending=True, pct=False):
if numeric_only:
data = self._obj_with_exclusions._get_numeric_data()
if data.size == 0:
raise DataError('No numeric types to aggregate')
def wrapper(values):
return values.rank(axis=axis, method=method, na_option=na_option,
ascending=ascending, pct=pct)
return self.transform(wrapper)
then I get
File "/home/nicolas/Projects/pandas/pandas/tests/test_groupby.py", line 3612, in test_rank
result = df.groupby('c').rank(method='first')
File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 1369, in rank
return self.transform(wrapper)
File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3493, in transform
return self._transform_general(func, *args, **kwargs)
File "/home/nicolas/Projects/pandas/pandas/core/groupby.py", line 3441, in _transform_general
raise ValueError(msg)
ValueError: transform must return a scalar value for each group
What do you mean when you say the flag is handled at a higher level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly what I said. The data is already selected by the time this is called (its called on each block). So the numeric_only
flag is basically useless for this. you can skip things, but you are just creating more complicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I handle the ValueError
that I manage to produce in my tests then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea, you have to have a simpler soln.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still something I don't understand, but maybe that's because I'm not familiar enough with how pandas works internally.
Correct me, but from what I've understood the following piece of code should be useless
data = self._obj_with_exclusions._get_numeric_data()
data = data.groupby(self.grouper)
because at that point we should already have numeric values only.
And yet, with
dg = DataFrame({'a': ['A1', 'A1', 'A1'], 'b': ['B1', 'B1', 'B2'], 'c': 1.}).set_index('a').groupby('c')
that I use in the tests, if I add the following prints inside the method
print 'self', self._obj_with_exclusions
print 'data', data._obj_with_exclusions
I get
-------------------- >> begin captured stdout << ---------------------
self b
a
A1 B1
A1 B1
A1 B2
data Empty DataFrame
Columns: []
Index: [A1, A1, A1]
--------------------- >> end captured stdout << ----------------------
So now I'm really confused :S
ok looking a lot better! a few comments |
can you rebase |
Current coverage is 85.32% (diff: 94.11%)@@ master #11918 diff @@
==========================================
Files 141 141
Lines 50679 50696 +17
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43240 43256 +16
- Misses 7439 7440 +1
Partials 0 0
|
I've rebased, but I still need to work a bit on this. |
can you rebase / update? |
can you update |
I've just updated. I'll resume working on this soon. :) BTW, I don't understand the previous coverage report: it says some files that are not in diff were modified/created, but those are not the one I've touched... |
not sure, we went thru a bunch of iterations of the coverage. just rebase on master and should be ok. |
@jreback I've added more tests, and slightly redesigned the |
you are adding way complication here. something like this should work.
|
I don't understand: isn't If I define
then
while what you're suggesting gives
which is indeed the same as So maybe I need to do something slightly different in that direction, but you'll have to point me more precisely in the right direction... |
what I pointed at is very close to the solution. You are jumping thru hoops here. You are using a very high level API in lower level code, which is not how the code works. Follow how the other methods work, maybe |
|
||
def wrapper(values): | ||
return values.rank(axis=axis, method=method, na_option=na_option, | ||
ascending=ascending, pct=pct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to try this first, rather just act on the numeric_only
select FIRST (eg. where you have the if numeric_only
, that should be it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you are suggesting, exactly.
If I do
if numeric_only:
data = self._obj_with_exclusions._get_numeric_data()
if data.size == 0:
raise DataError('No numeric types to aggregate')
data = data.groupby(self.grouper)
return data.transform(wrapper)
try:
return self.transform(wrapper)
except ValueError:
if not numeric_only and method == 'first':
raise ValueError('first not supported for non-numeric data')
# such a ValueError is raised by pandas.algos.rank_2d_generic
# for regular (non-grouped) dataframes
raise
then the following test fails:
df = DataFrame({'a': ['A1', 'A1', 'A1'],
'b': [datetime(2002, 2, 2), datetime(2001, 1, 1),
datetime(2001, 1, 1)],
'c': 1.})
df = df.set_index('a')
dg = df.groupby('c')
result = dg.rank(method='first')
because DataError('No numeric types to aggregate')
is raised
can you rebase / update? |
I am not sure of the relevancy of this PR as the issue seems ok. At this point, let's re-evaluate the original issue. |
Fix issue #11759
Two things:
Series.rank
andDataFrame.rank
have been moved togeneric.py
and get the same signature [done with PR CLN: Moving Series.rank and DataFrame.rank to generic.py #11924]ValueError
when that fails, and catching it afterwards indistinctly, I propose to raise an_ErrorContainer
exception containing the last meaningful error to have been raised. Thus, when everything else has failed, we can still recover a meaningful exception and throw it back at the user.