Skip to content

BUG: groupby.transform doesn't raise on missing arguments #26840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jun 14, 2019 · 8 comments · Fixed by #52230
Closed

BUG: groupby.transform doesn't raise on missing arguments #26840

ghost opened this issue Jun 14, 2019 · 8 comments · Fixed by #52230
Assignees
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions

Comments

@ghost
Copy link

ghost commented Jun 14, 2019

For Example, fillna requires method kwarg to be specified. the frame method
raises an exception if it is missing, but transform silently returns an empty frame

In [22]: df=pd.DataFrame(dict(price=[10,10,20,20,30,30],color=[10,10,20,20,30,30],cost=(100,200,300,400,500,600)))
    ...: g=df.groupby(['price'])
    ...: try:
    ...:     df.fillna()
    ...: except:
    ...:     print("Raised: OK")
    ...: try:
    ...:     g.fillna()
    ...: except:
    ...:     print("Raised: OK")
    ...: else:
    ...:     print("Raised: No, Failed")    
    ...: try:    
    ...:     g.transform('fillna')
    ...: except:    
    ...:     print("Raised: OK")
    ...: else:
    ...:     print("Raised: No, Failed")
Raised: OK
Raised: No, Failed
Raised: No, Failed
@WillAyd
Copy link
Member

WillAyd commented Jun 14, 2019 via email

@mroeschke
Copy link
Member

Looks to work on master now. Could use a test

df=pd.DataFrame(dict(price=[10,10,20,20,30,30],color=[10,10,20,20,30,30],cost=(100,200,300,400,500,600)))
In [4]: g=df.groupby(['price']).transform('fillna')
ValueError: Must specify a fill 'value' or 'method'.

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 10, 2021
@michael-michelotti
Copy link

take

@michael-michelotti
Copy link

Hey @mroeschke

I want to try to take this on. This will be my first open source contribution, so I just want to make sure I do this right.

I do not see any specific file devoted to testing It looks to me like a test would best be housed at /pandas/tests/groupby/transform/test_transform.py

I don't see anything already testing this condition, or a similar test that this could be added to. Based on that, it seems like it should be its own test case. My inclination is to write one test function that requires positional arguments, one that requires keyword arguments, and one that requires both. That way, the test is not coupled to the fillna function, or any one function.

What is the thought on adding such test functions as fixtures in /pandas/tests/groupby/conftest.py? They may not be reusable in any way to justify a fixture, though. It seems like it would be easy enough to feed those fixtures into the test case and verify that the error types and messages match those of the test functions, though.

@mroeschke
Copy link
Member

/pandas/tests/groupby/transform/test_transform.py sounds like a good location for the test. I suggest just start writing the testing function and we can review in a pull request whether a fixture is appropriate

@michael-michelotti
Copy link

michael-michelotti commented Jul 21, 2021

@mroeschke Sorry for length of the essay to follow

The Issue Reported Here

The specific issue reported here has to do with using a built-in function referenced with a string. Those functions are defined in pandas.core.groupby.base.transform_kernel_allowlist or pandas.core.groupby.base.cythonized_kernels.

Crucially, the transform method takes totally different routes depending on whether the function it receives is a function or a string. You can see that on lines 1431-1439 in core/groupby/groupby.py

if not isinstance(func, str):
    return self._transform_general(func, *args, **kwargs)

elif func not in base.transform_kernel_allowlist:
    msg = f"'{func}' is not a valid function name for transform(name)"
    raise ValueError(msg)
elif func in base.cythonized_kernels or func in base.transformation_kernels:
    # cythonized transform or canned "agg+broadcast"
    return getattr(self, func)(*args, **kwargs)

If the input is a string like 'fillna', then the corresponding function is pulled from the DataFrameGroupBy class itself and called directly with args and kwargs, and that value is immediately returned. In other words, all validation of the args and kwargs is delegated to fillna, not transform, in this case.

In my opinion, that is the best way for it to be designed. However, it means that any test that tests the use of string inputs into transform will be tightly coupled with the validation functions for whichever function is chosen. In other words, you wouldn't really be testing the transform function, you would be testing the validation functions for fillna in this case. What that would do is enforce the delegation of argument validation going forward, which may be worth something.

Tangentially Related: User Generated Functions

The situation is much stickier if a function is passed into DataFrameGroupBy.transform. As seen above, the transform is delegated to the _transform_general method. I think there might be a couple of problems to address in that method.

Firstly, the method starts by defining two functions which wrap the original function. They are called fast_path and slow_path. They are defined in the _define_paths method as follows:

fast_path = lambda group: func(group, *args, **kwargs)
slow_path = lambda group: group.apply(
    lambda x: func(x, *args, **kwargs), axis=self.axis
)

Possible Problem 1: Original Function Stripped of Metadata

Going into the _define_paths method, the input function is fully intact. Coming out of it, all you have left are two lambda functions. This makes it quite confusing to work with those functions going forward as you step through the code.

It may be beneficial to rewrite these declarations as proper decorated functions. Something like the following (func is defined in the outer scope):

from functools import wraps
def fast_path(group):
    @wraps(func)
    def inner(group, *args, **kwargs):
        func(group, *args, **kwargs)
    return inner

Possible Problem 2: Fast and Slow Paths Accept Different Inputs

As you can see, the fast_path function just injects a group, args and kwargs into the original function. In this case, the group will be a DataFrame, so the first argument into the original function will be a DataFrame.

In the case slow_path, it uses the original function in a DataFrame.apply call. In this case, each input into the original function will be a Series. In other words, the input function needs to be able to handle both a DataFrame and a Series without throwing in error in order to work.

This would be fine with simple operations that can be broadcast across either a DataFrame or a Series. However, I don't think it would support more complex operations. For example, if I wanted to index a given column for every group I transformed, I would be unable to do that since I would get a KeyError when the function received a Series.

It may be beneficial to rewrite the fast_path and slow_path logic in some way to help with that.

Possible Problem 3: THE BIGGEST PROBLEM I FOUND

Once slow_path and fast_path are defined, every group in the groupby operation is iterated over and slow_path and fast_path are called on each group. That happens in the _choose_path method (line 1384 of groupby/generic.py).

The result from slow_path and fast_path are compared to each other. If they are the same, then the "chosen" path should be the fast_path. If something is wrong with the fast_path result, the "chosen" path is the slow_path.

The problem is, _choose_path is run for every group. The very first line of _choose_path sets the "chosen" path back to the slow_path, and runs both slow_path and fast_path again:

def _choose_path(self, fast_path: Callable, slow_path: Callable, group: DataFrame):
    path = slow_path
    res = slow_path(group)

    # if we make it here, test if we can use the fast path
    try:
        res_fast = fast_path(group)
    except AssertionError:
        raise  # pragma: no cover
    except Exception:
        # GH#29631 For user-defined function, we can't predict what may be
        #  raised; see test_transform.test_transform_fastpath_raises
        return path, res

In other words: both paths are currently being run on every group. Every transformation calculation is being done twice. It's possible there was supposed to be logic to only run _choose_path a single time, but it was never implemented.

Back to the Original Question: Testing

Aside from all the issues mentioned above, we may also benefit from testing to verify that user generated functions raise errors properly, even if we don't verify the built-in "string" functions like "fillna".

Right now the _transform_general function handles errors raised by the input function as follows:

  • If the provided function raises a TypeError while evaluating either a DataFrame (fast path) or a Series (slow path), then it assumes that it needs to evaluate the DataFrame series by series, and delegates to SeriesGroupBy.transform

    • Correction: This only happens if the function raises a TypeError from a Series. Any errors from evaluating a DataFrame will be silently ignored.
    • If evaluating every Series with the function returns a TypeError, then a TypeError is actually raised with the message: "Transformation function invalid for data types"
      • We might benefit from a more descriptive error here. Something like: "Provided function is invalid. It raised a TypeError for every Series in the DataFrame groups"
    • If the input function can process some Series without a TypeError, then only the processed Series are kept.
      • This is what causes incompatible Series to be automatically dropped by transform
    • Note: According to a warning, this functionality is deprecated. In the future, any single Series causing a TypeError will raise the TypeError immediately
  • If the provided function raises a ValueError while evaluating either a DataFrame (fast path) or a Series (slow path), then _transform_general raises the error immediately with the message: "transform must return a scalar value for each group"

    • Correction: This only happens if the function raises a ValueError from a Series. Any errors from evaluating a DataFrame will be silently ignored.
    • We might benefit from changing this error to something more expressive, like: `f"Provided function raised a ValueError when evaluating the {name} group."
  • If the provided function raises any error other than TypeError or ValueError, then the error will not be caught, and will be raised immediately

Ultimately, it seems to me like tests enforcing the behavior that I described above (if that is what we desire the behavior to be) could be pretty beneficial. Definitely more beneficial than I think writing tests for the "string" methods would be.

Summary/TL;DR

We could write tests for built-in functions like "fillna", but such a test would be tightly coupled to the "fillna" function, since transform delegates all of the argument validation to fillna.

Built-in functions and user-generated functions take completely different routes, and there are some possible problems I identified with the user-generated route:

  • There are some lambda functions used for wrapping that make stepping through the code confusing.
  • The function passed in is given both a DataFrame and a Series, and it needs to be able to handle both without crashing.
    • Correction: It needs to be able to handle the Series without crashing. If it crashes when dealing with a DataFrame, it just fails silently.
  • The transformation function is run twice on every group.
  • The error handling for user generated functions makes a couple of assumptions that result in possibly obscure error messages that could be made more expressive

Ultimately, tests to enforce the current behavior for user-generated functions could be useful. Possibly more useful than tests for the built-in functions.

Thanks if you read that whole thing!! Let me know what you think.

@michael-michelotti
Copy link

michael-michelotti commented Jul 21, 2021

Couple of related issues:
#41584
#42617

@rhshadrach
Copy link
Member

Problem 3 should no longer exist: #41598

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants