Skip to content

API: Specify the behaviour for operating on empty objects #47959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
ntachukwu opened this issue Aug 4, 2022 · 11 comments
Open
1 task

API: Specify the behaviour for operating on empty objects #47959

ntachukwu opened this issue Aug 4, 2022 · 11 comments
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action

Comments

@ntachukwu
Copy link
Contributor

ntachukwu commented Aug 4, 2022

There isn't an issue for empty inputs.

There is a need to specify the behaviour for empty input. Note that “empty” input can mean many things, and even combinations of them:

  • zero rows but non-zero columns
  • zero columns but non-zero rows
  • zero rows and zero columns
  • other: e.g., empty list provided to agg()

As suggested by @shwina in the Pandas Standardisation docs, the idea is to start by writing a bunch of tests across many different Pandas operations and see how many pass across libraries.

Classic example (quiz: what should this do?)

df = pd.DataFrame({"a": [], "b": []})
df.groupby(“a”).agg({})

A starting list of operations where tests are needed would be particularly helpful, I plan to start going through them and adding all the tests.

  • The above example returns a ValueError: No objects to concatenate, is this the expected behaviour?
@attack68
Copy link
Contributor

attack68 commented Aug 4, 2022

I think this concept is worth exploring. I think you might have trouble establishing a consensus view in individual cases (such as this) for what is the appropriate action. For me an analogy is 0/0 in mathematics, which may be undefined or have a some asymptotic limit as a value but it depends.

However, what it might be easier to conclude is that if two similar pandas functions give different results when operating on empty DataFrames, then one of them should really be changed. Which one to which might require discussion, but I think presenting cases like this will probably help to drive forward the issue.

@rhshadrach
Copy link
Member

I believe this issue is meant only for methods that take a UDF (e.g. agg, apply, transform), is that correct? For other methods (e.g. sum, mean, fillna) I think there is a well-defined answer. It's only when given an unknown UDF that there is no "right" answer, but we should certainly strive for a consistent one.

One idea expressed in another issue (I'll have to track down where it is) is to call the UDF with an empty object, returning a default result if it raises. Something very roughly like:

# data is an empty DataFrame
try:
    result = self.call_method(data)
except Exception:
    result = data.copy()
return result

and documenting this behavior for working with empty objects. I'm attracted to the idea because it would allow users to write UDFs as:

def my_udf(x):
    if x.empty:
        return my_default_object
    ...

allowing them to have complete control over the result. Also, if a user does not opt-in to implementing the "empty-object" path, it returns a default which is at least easy to reason about.

For agg we would use .grouper.result_index and for transform we could use .obj.index for a more accurate result. But for apply there is no clear default because it can be used with reducers, transformers, and anything else for that matter.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action labels Aug 8, 2022
@ntachukwu
Copy link
Contributor Author

Here's a doc where I am compiling a table that shows some of pandas operations with an example and the result from running the example.

I am starting with group_by but I hope to check most of Pandas operations.
Please leave comments for suggestions and insights ( I will also like to know if this is an appropriate method of going about this issue).

@ntachukwu
Copy link
Contributor Author

The table has been completed for GroupBy methods. Please add comments and suggestions to the doc.
Screenshot 2022-08-24 at 16 58 19

@asmeurer
Copy link

CC @jakirkham @jcrist. I know that Dask uses empty DataFrames for metadata. Are there any Dask specific uses of empty dataframes that should be kept in mind here, or any other comments you have on how empty dataframes should behave?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 5, 2022

@Th3nn3ss thanks a lot for that google doc! That's a useful overview, and I left a few comments.

In general, I think it's certainly useful to start adding tests for all those cases (like you started in #48327). Even if there is some behaviour that we would rather want to change, it's still good to have a test for it (easier to update it later if we change the behaviour), and such PRs could also trigger some discussion on specific corner cases.

One scope related suggestion: can we limit this issue to behaviour of operating on empty DataFrames/Series? (and leave the construction with empty input(eg #48330) for another issue)

@jorisvandenbossche
Copy link
Member

[@rhshadrach] I believe this issue is meant only for methods that take a UDF (e.g. agg, apply, transform), is that correct? For other methods (e.g. sum, mean, fillna) I think there is a well-defined answer. It's only when given an unknown UDF that there is no "right" answer, but we should certainly strive for a consistent one.

It's true that in general it will mostly be for UDFs that it is hard (or impossible) to know what the correct output shape will be if there is no group to try the function on, while for most built-in methods the result will often be well defined. But, for some of our groupby methods, we also internally basically use the UDF path and apply some method, and so those could still give errors / inconsistencies as well.
For example, the take() mentioned in the google doc:

>>> df = pd.DataFrame({'a': [], 'b': []})
>>> df.groupby('a').take([0])
...
IndexError: indices are out-of-bounds

It's certainly debatable what the expected result should be, but the current error comes from calling the take() method on an empty DataFrame to be able to infer how the result should look like when there are no groups. For example in this case, I think it would be sensible to return an empty result (there is no group to take from, so taking the 0th element from that non-existing group also shouldn't raise, and since we know this just filters rows, we know what the result shape should look like)

Of course, we could certainly use your suggestion of checking for empty objects inside the UDF for the internal ones as well.

@jorisvandenbossche jorisvandenbossche changed the title Specify the behaviour for empty input Specify the behaviour for operating on empty objects Sep 5, 2022
@jorisvandenbossche jorisvandenbossche changed the title Specify the behaviour for operating on empty objects API: Specify the behaviour for operating on empty objects Sep 5, 2022
@rhshadrach
Copy link
Member

@jorisvandenbossche

But, for some of our groupby methods, we also internally basically use the UDF path and apply some method, and so those could still give errors / inconsistencies as well.

Agreed - if I had to guess there are likely many bugs or at least undesirable results here. But I'm suggesting that for those operations, we tackle them in a separate issues as the nature of the operation could be taken into consideration in determining the correct result. For a UDF on the other hand, we do not know the nature of the operation, and so we need to decide on a behavior without considering it. It's for this reason why I suggest to make this issue just for UDFs, to focus the scope.

@jorisvandenbossche
Copy link
Member

For future reference, the specific example of sample() with an empty dataframe has been tackled in #48484

For a UDF on the other hand, we do not know the nature of the operation, and so we need to decide on a behavior without considering it. It's for this reason why I suggest to make this issue just for UDFs, to focus the scope.

Maybe we could also open a dedicated issue to discuss this specific issue for groupby with UDFs, and keep this as a general overview issue for working on empty dataframes (my understanding is that it is not meant to be limited to groupby)

@rhshadrach
Copy link
Member

@jorisvandenbossche

Maybe we could also open a dedicated issue to discuss this specific issue for groupby with UDFs, and keep this as a general overview issue for working on empty dataframes (my understanding is that it is not meant to be limited to groupby)

Just to be certain, I've been advocating making this issue about all methods that take UDFs, not just groupby. I personally find "handling of empty DataFrames" to be too wide of scope for discussion on a single issue (but would make sense to me as a tracking issue that links to other issues). However I certainly have no real objection as to its existence if that's what others want to pursue. I will point out that if we were to make an issue on how pandas methods that take UDFs handle empty objects, we now have two places where discussion could take place and that seems undesirable to me.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 14, 2023

One idea (proposed in #50588 (comment) and I think some other places) is to allow users specify a second UDF or object to handle the empty case. This seems more explicit than having users bake special cases into the UDF itself, and is a more explicit change from an API standpoint (adding a new argument rather than just changing behavior). However, I don't think we should ever require such a function, so even if we do add it we're still left to decide the behavior when one isn't provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action
Projects
None yet
5 participants