API: Specify the behaviour for operating on empty objects #47959

ntachukwu · 2022-08-04T07:28:48Z

There isn't an issue for empty inputs.

There is a need to specify the behaviour for empty input. Note that “empty” input can mean many things, and even combinations of them:

zero rows but non-zero columns
zero columns but non-zero rows
zero rows and zero columns
other: e.g., empty list provided to agg()

As suggested by @shwina in the Pandas Standardisation docs, the idea is to start by writing a bunch of tests across many different Pandas operations and see how many pass across libraries.

Classic example (quiz: what should this do?)

df = pd.DataFrame({"a": [], "b": []})
df.groupby(“a”).agg({})

A starting list of operations where tests are needed would be particularly helpful, I plan to start going through them and adding all the tests.

The above example returns a ValueError: No objects to concatenate, is this the expected behaviour?

The text was updated successfully, but these errors were encountered:

attack68 · 2022-08-04T15:48:55Z

I think this concept is worth exploring. I think you might have trouble establishing a consensus view in individual cases (such as this) for what is the appropriate action. For me an analogy is 0/0 in mathematics, which may be undefined or have a some asymptotic limit as a value but it depends.

However, what it might be easier to conclude is that if two similar pandas functions give different results when operating on empty DataFrames, then one of them should really be changed. Which one to which might require discussion, but I think presenting cases like this will probably help to drive forward the issue.

rhshadrach · 2022-08-08T21:06:38Z

I believe this issue is meant only for methods that take a UDF (e.g. agg, apply, transform), is that correct? For other methods (e.g. sum, mean, fillna) I think there is a well-defined answer. It's only when given an unknown UDF that there is no "right" answer, but we should certainly strive for a consistent one.

One idea expressed in another issue (I'll have to track down where it is) is to call the UDF with an empty object, returning a default result if it raises. Something very roughly like:

# data is an empty DataFrame
try:
    result = self.call_method(data)
except Exception:
    result = data.copy()
return result

and documenting this behavior for working with empty objects. I'm attracted to the idea because it would allow users to write UDFs as:

def my_udf(x):
    if x.empty:
        return my_default_object
    ...

allowing them to have complete control over the result. Also, if a user does not opt-in to implementing the "empty-object" path, it returns a default which is at least easy to reason about.

For agg we would use .grouper.result_index and for transform we could use .obj.index for a more accurate result. But for apply there is no clear default because it can be used with reducers, transformers, and anything else for that matter.

ntachukwu · 2022-08-15T15:27:26Z

Here's a doc where I am compiling a table that shows some of pandas operations with an example and the result from running the example.

I am starting with group_by but I hope to check most of Pandas operations.
Please leave comments for suggestions and insights ( I will also like to know if this is an appropriate method of going about this issue).

ntachukwu · 2022-08-24T16:02:47Z

The table has been completed for GroupBy methods. Please add comments and suggestions to the doc.

asmeurer · 2022-08-25T16:01:03Z

CC @jakirkham @jcrist. I know that Dask uses empty DataFrames for metadata. Are there any Dask specific uses of empty dataframes that should be kept in mind here, or any other comments you have on how empty dataframes should behave?

jorisvandenbossche · 2022-09-05T15:36:01Z

@Th3nn3ss thanks a lot for that google doc! That's a useful overview, and I left a few comments.

In general, I think it's certainly useful to start adding tests for all those cases (like you started in #48327). Even if there is some behaviour that we would rather want to change, it's still good to have a test for it (easier to update it later if we change the behaviour), and such PRs could also trigger some discussion on specific corner cases.

One scope related suggestion: can we limit this issue to behaviour of operating on empty DataFrames/Series? (and leave the construction with empty input(eg #48330) for another issue)

jorisvandenbossche · 2022-09-05T15:45:50Z

[@rhshadrach] I believe this issue is meant only for methods that take a UDF (e.g. agg, apply, transform), is that correct? For other methods (e.g. sum, mean, fillna) I think there is a well-defined answer. It's only when given an unknown UDF that there is no "right" answer, but we should certainly strive for a consistent one.

It's true that in general it will mostly be for UDFs that it is hard (or impossible) to know what the correct output shape will be if there is no group to try the function on, while for most built-in methods the result will often be well defined. But, for some of our groupby methods, we also internally basically use the UDF path and apply some method, and so those could still give errors / inconsistencies as well.
For example, the take() mentioned in the google doc:

>>> df = pd.DataFrame({'a': [], 'b': []})
>>> df.groupby('a').take([0])
...
IndexError: indices are out-of-bounds

It's certainly debatable what the expected result should be, but the current error comes from calling the take() method on an empty DataFrame to be able to infer how the result should look like when there are no groups. For example in this case, I think it would be sensible to return an empty result (there is no group to take from, so taking the 0th element from that non-existing group also shouldn't raise, and since we know this just filters rows, we know what the result shape should look like)

Of course, we could certainly use your suggestion of checking for empty objects inside the UDF for the internal ones as well.

rhshadrach · 2022-09-05T16:03:20Z

@jorisvandenbossche

But, for some of our groupby methods, we also internally basically use the UDF path and apply some method, and so those could still give errors / inconsistencies as well.

Agreed - if I had to guess there are likely many bugs or at least undesirable results here. But I'm suggesting that for those operations, we tackle them in a separate issues as the nature of the operation could be taken into consideration in determining the correct result. For a UDF on the other hand, we do not know the nature of the operation, and so we need to decide on a behavior without considering it. It's for this reason why I suggest to make this issue just for UDFs, to focus the scope.

jorisvandenbossche · 2022-09-14T09:56:39Z

For future reference, the specific example of sample() with an empty dataframe has been tackled in #48484

For a UDF on the other hand, we do not know the nature of the operation, and so we need to decide on a behavior without considering it. It's for this reason why I suggest to make this issue just for UDFs, to focus the scope.

Maybe we could also open a dedicated issue to discuss this specific issue for groupby with UDFs, and keep this as a general overview issue for working on empty dataframes (my understanding is that it is not meant to be limited to groupby)

rhshadrach · 2022-09-14T21:13:30Z

@jorisvandenbossche

Maybe we could also open a dedicated issue to discuss this specific issue for groupby with UDFs, and keep this as a general overview issue for working on empty dataframes (my understanding is that it is not meant to be limited to groupby)

Just to be certain, I've been advocating making this issue about all methods that take UDFs, not just groupby. I personally find "handling of empty DataFrames" to be too wide of scope for discussion on a single issue (but would make sense to me as a tracking issue that links to other issues). However I certainly have no real objection as to its existence if that's what others want to pursue. I will point out that if we were to make an issue on how pandas methods that take UDFs handle empty objects, we now have two places where discussion could take place and that seems undesirable to me.

rhshadrach · 2023-01-14T13:54:59Z

One idea (proposed in #50588 (comment) and I think some other places) is to allow users specify a second UDF or object to handle the empty case. This seems more explicit than having users bake special cases into the UDF itself, and is a more explicit change from an API standpoint (adding a new argument rather than just changing behavior). However, I don't think we should ever require such a function, so even if we do add it we're still left to decide the behavior when one isn't provided.

attack68 mentioned this issue Aug 5, 2022

BUG: Pandas 1.4.x - calling groupBy(...).apply(func) on an empty dataframe invokes func #47985

Closed

rhshadrach added Apply Apply, Aggregate, Transform, Map API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action labels Aug 8, 2022

mroeschke mentioned this issue Aug 26, 2022

Inconsistent index between plain DataFrame and read_sql DataFrame #48193

Closed

5 tasks

jorisvandenbossche changed the title ~~Specify the behaviour for empty input~~ Specify the behaviour for operating on empty objects Sep 5, 2022

jorisvandenbossche changed the title ~~Specify the behaviour for operating on empty objects~~ API: Specify the behaviour for operating on empty objects Sep 5, 2022

ntachukwu mentioned this issue Sep 16, 2022

BUG: Inconsistency when empty inputs are passed to agg for groupby #48581

Open

3 tasks

rhshadrach mentioned this issue Dec 16, 2022

BUG: assign with apply over axis=1 sometimes fails when the DataFrame has zero rows #50244

Open

3 tasks

This was referenced Jan 5, 2023

BUG: Inconsistent return type of df.apply on empty df #50588

Closed

BUG: Inconsistent result dtypes for aggregation functions (e.g. sum, mean, .apply(np.sum), ...) #50628

Open

rhshadrach mentioned this issue Nov 3, 2023

BUG: DataFrame.apply with empty dataframe applies the function #49200

Closed

3 tasks

rhshadrach mentioned this issue Nov 26, 2024

BUG: DataFrame.agg() on empty dataframe returns unexpected result #60424

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Specify the behaviour for operating on empty objects #47959

API: Specify the behaviour for operating on empty objects #47959

ntachukwu commented Aug 4, 2022 •

edited

Loading

attack68 commented Aug 4, 2022

rhshadrach commented Aug 8, 2022

ntachukwu commented Aug 15, 2022

ntachukwu commented Aug 24, 2022

asmeurer commented Aug 25, 2022

jorisvandenbossche commented Sep 5, 2022 •

edited

Loading

jorisvandenbossche commented Sep 5, 2022

rhshadrach commented Sep 5, 2022

jorisvandenbossche commented Sep 14, 2022

rhshadrach commented Sep 14, 2022

rhshadrach commented Jan 14, 2023 •

edited

Loading

API: Specify the behaviour for operating on empty objects #47959

API: Specify the behaviour for operating on empty objects #47959

Comments

ntachukwu commented Aug 4, 2022 • edited Loading

attack68 commented Aug 4, 2022

rhshadrach commented Aug 8, 2022

ntachukwu commented Aug 15, 2022

ntachukwu commented Aug 24, 2022

asmeurer commented Aug 25, 2022

jorisvandenbossche commented Sep 5, 2022 • edited Loading

jorisvandenbossche commented Sep 5, 2022

rhshadrach commented Sep 5, 2022

jorisvandenbossche commented Sep 14, 2022

rhshadrach commented Sep 14, 2022

rhshadrach commented Jan 14, 2023 • edited Loading

ntachukwu commented Aug 4, 2022 •

edited

Loading

jorisvandenbossche commented Sep 5, 2022 •

edited

Loading

rhshadrach commented Jan 14, 2023 •

edited

Loading