Skip to content

REGR: FutureWarning issued and empty DataFrame returned where no numeric types to aggregate #43501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
simonjayhawkins opened this issue Sep 10, 2021 · 14 comments
Closed
3 tasks done
Labels
Bug Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Regression Functionality that used to work in a prior pandas version Warnings Warnings that appear or should be added to pandas

Comments

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Sep 10, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
frame = pd.DataFrame({"a": np.random.randint(0, 5, 50), "b": ["foo", "bar"] * 25})
frame[["b"]].groupby(frame["a"]).mean()

Issue Description

the code sample was raising DataError: No numeric types to aggregate in 1.2.5 but in 1.3.x this now issues a warning that it will raise in the future. (and also appears that the stacklevel is incorrect)

on master

>>> frame[["b"]].groupby(frame["a"]).mean()
sys:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

The change to a empty dataframe was in #41706 with the warning being added in #43154

cc @jbrockmendel @Dr-Irv

Expected Behavior

same as 1.2.5

>>> frame[["b"]].groupby(frame["a"]).mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simon/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1496, in mean
    return self._cython_agg_general(
  File "/home/simon/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1015, in _cython_agg_general
    agg_mgr = self._cython_agg_blocks(
  File "/home/simon/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1121, in _cython_agg_blocks
    raise DataError("No numeric types to aggregate")
pandas.core.base.DataError: No numeric types to aggregate

Installed Versions

pd.show_versions()

@simonjayhawkins simonjayhawkins added Bug Groupby Regression Functionality that used to work in a prior pandas version Warnings Warnings that appear or should be added to pandas Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 10, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Sep 10, 2021
@simonjayhawkins simonjayhawkins changed the title REGR: FutureWarning issued and empty DataFRa REGR: FutureWarning issued and empty DataFrame returned where no numeric types to aggregate Sep 10, 2021
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 10, 2021

@simonjayhawkins It's not clear to me how you're suggesting this should be changed. Are you saying that the FutureWarning should be removed and the old behavior (raising a DataError) restored for 1.3.x? Are you also saying that we need to decide what will happen in 1.4 and beyond?

@simonjayhawkins
Copy link
Member Author

If it was raising DataError in 1.2.5 and we want to raise TypeError in the future, we could probably make that change without a warning (and a change in behavior to return an empty DataFrame in the interim)

I think that changing the exception type may not be an api change. changing the behavior is.

so for 1.3.x we could either restore the DataError (or maybe change to TypeError)

@jbrockmendel
Copy link
Member

and also appears that the stacklevel is incorrect

there's now a find_stack_level that can avoid these headaches

IIRC this was the best option available for ensuring internal consistency until the deprecation is enforced. e.g. we want frame[["b"]].groupby(frame["a"]).mean() to match frame[["b"]].groupby(frame["a"]).agg("mean") and frame[["b"]].groupby(frame["a"]).apply(np.mean)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 10, 2021

Here's a table contrasting the behaviors using the examples @jbrockmendel listed between 1.2.5 and 1.3.2, and master:

Code 1.2.5 Result 1.3.2 Result 1.4.0.dev0+647.g0072fa8cb5 Result
frame[["b"]].groupby(frame["a"]).mean() DataError exception Empty DataFrame FutureWarning and Empty DataFrame
frame[["b"]].groupby(frame["a"]).agg("mean") DataError exception Empty DataFrame FutureWarning and Empty DataFrame
frame[["b"]].groupby(frame["a"]).apply(np.mean) Empty DataFrame FutureWarning and Empty DataFrame FutureWarning and Empty DataFrame
frame[["b"]].groupby(frame["a"]).sum() DataFrame with values Empty DataFrame DataFrame with values
frame[["b"]].groupby(frame["a"]).agg("sum") DataFrame with values Empty DataFrame DataFrame with values
frame[["b"]].groupby(frame["a"]).apply(np.sum) DataFrame with values DataFrame with values DataFrame with values

With the fix in #43154 , we now have restored the 1.2.x behavior in the case of sum . I think the DataError was correct in 1.2.x for mean, but then we have to figure out what to do in the case of apply(np.mean), because returning an empty DF in that case could be considered a bug (i.e., we should have raised). I'm not sure how easy that would be to fix. OTOH, putting in FutureWarning for apply(np.mean) makes sense, but maybe for the .mean() and .agg("mean") cases, we should restore the DataError behavior and not put in a warning.

Results above produced with following program:

long python code. click to expand
import pandas as pd
import numpy as np

frame = pd.DataFrame({"a": np.random.randint(0, 5, 50), "b": ["foo", "bar"] * 25})

print("Using mean")


print("calling mean directly")
try:
    print(frame[["b"]].groupby(frame["a"]).mean())
except Exception as e:
    print("exception raised ", str(e))

print("calling agg('mean')")
try:
    print(frame[["b"]].groupby(frame["a"]).agg("mean"))
except Exception as e:
    print("exception raised ", str(e))

print("calling apply(np.mean)")
try:
    print(frame[["b"]].groupby(frame["a"]).apply(np.mean))
except Exception as e:
    print("exception raised ", str(e))

print("calling sum directly")
try:
    print(frame[["b"]].groupby(frame["a"]).sum())
except Exception as e:
    print("exception raised ", str(e))

print("calling agg('sum')")
try:
    print(frame[["b"]].groupby(frame["a"]).agg("sum"))
except Exception as e:
    print("exception raised ", str(e))

print("calling apply(np.sum)")
try:
    print(frame[["b"]].groupby(frame["a"]).apply(np.sum))
except Exception as e:
    print("exception raised ", str(e))

@jbrockmendel
Copy link
Member

I think the DataError was correct in 1.2.x for mean, but then we have to figure out what to do in the case of apply(np.mean), because returning an empty DF in that case could be considered a bug (i.e., we should have raised)

The apply-that-is-an-agg part of this seems like something to ask @rhshadrach about.

@rhshadrach
Copy link
Member

rhshadrach commented Sep 10, 2021

The relevant parts of the algorithm works in two steps:

  • Subset the columns to numeric only
  • Operate on the resulting DataFrame

To separate these, consider changing the example in the question to:

frame = pd.DataFrame({"a": np.random.randint(0, 5, 50), "b": ["foo", "bar"] * 25})
res = frame[[]].groupby(frame["a"]).mean()
print(res)

This raises on 1.2.x, but not on master. I believe raising in this case is a bug - it is inconsistent with other reductions (e.g. sum, min, max, prod). Making it so this op doesn't raise from 1.2 -> 1.3 seems appropriate to me.

Now consider the example:

frame = pd.DataFrame({"a": np.random.randint(0, 5, 50), "b": ["foo", "bar"] * 25})
res = frame[["b"]].groupby(frame["a"]).mean(numeric_only=True)
print(res)

If we agree that not raising on an empty frame is correct, then it seems to me to be reasonable to have the same behavior when numeric_only subsets the frame to become empty. In other words, I think this shouldn't raise as well.

Finally, back to the original example which is the same as above but numeric_only is left unspecified, it did raise in 1.2.x, it no longer raises but will raise a TypeError in the future. This is one case where I could see reverting to raising instead, but also think it would be okay to leave this behavior as is.

I think the DataError was correct in 1.2.x for mean, but then we have to figure out what to do in the case of apply(np.mean), because returning an empty DF in that case could be considered a bug (i.e., we should have raised)

I think we currently treat .apply(np.mean) et al the same as any other UDF - i.e. as a black box (correct me if this is wrong!). With this, I don't think any special consideration should be given that we don't give to other UDFs. In other words, we should not go out of our way to ensure apply(np.mean) is consistent with .mean() or .agg('mean'). With this last statement - I most certainly think there is room for disagreement. That is, I think there is an argument to make that numpy functions are special and should be given special consideration. My opinion is that, at least right now, the extra logic and maintenance burden is not worth it.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 11, 2021

@rshadrach so I've taken the table I made above and added a "Proposal" column, which I think is what you are suggesting above.

Code 1.2.5 Result 1.3.2 Result 1.4.0.dev0+647.g0072fa8cb5 Result Proposal
frame[["b"]].groupby(frame["a"]).mean() DataError exception Empty DataFrame FutureWarning and Empty DataFrame Empty DataFrame
frame[["b"]].groupby(frame["a"]).agg("mean") DataError exception Empty DataFrame FutureWarning and Empty DataFrame Empty DataFrame
frame[["b"]].groupby(frame["a"]).apply(np.mean) Empty DataFrame FutureWarning and Empty DataFrame FutureWarning and Empty DataFrame Empty DataFrame
frame[["b"]].groupby(frame["a"]).sum() DataFrame with values Empty DataFrame DataFrame with values DataFrame with values
frame[["b"]].groupby(frame["a"]).agg("sum") DataFrame with values Empty DataFrame DataFrame with values DataFrame with values
frame[["b"]].groupby(frame["a"]).apply(np.sum) DataFrame with values DataFrame with values DataFrame with values DataFrame with values

That would replicate the 1.2.5 behavior except where we were raising the DataError, replacing that with an empty DataFrame. It also seems that this would be the easiest fix to make??

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@rhshadrach
Copy link
Member

@Dr-Irv - Is "Proposal" what we consider to be 2.0 behavior? In such a case, the top two lines of Proposal should raise a TypeError. For the third line, it should be whatever error numpy raises I think (assuming the first sentence of the last paragraph in #43501 (comment) is correct).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 14, 2021

@Dr-Irv - Is "Proposal" what we consider to be 2.0 behavior? In such a case, the top two lines of Proposal should raise a TypeError.

@rhshadrach I think I was proposing 1.3.4 (or 1.4) behavior, based on when you wrote this:

his raises on 1.2.x, but not on master. I believe raising in this case is a bug - it is inconsistent with other reductions (e.g. sum, min, max, prod). Making it so this op doesn't raise from 1.2 -> 1.3 seems appropriate to me.

@rhshadrach
Copy link
Member

@Dr-Irv - the paragraph you quoted was considering the example

frame[[]].groupby(frame["a"]).mean()

For the top two lines, these will raise a TypeError in pandas 2.0, and so need to have the FutureWarning in 1.3.x/1.4.

@simonjayhawkins
Copy link
Member Author

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@simonjayhawkins
Copy link
Member Author

Finally, back to the original example which is the same as above but numeric_only is left unspecified, it did raise in 1.2.x, it no longer raises but will raise a TypeError in the future. This is one case where I could see reverting to raising instead, but also think it would be okay to leave this behavior as is.

There is also the issue of the FutureWarning being added in #43154 which was released in 1.3.3 violating the version policy https://pandas.pydata.org/pandas-docs/dev/development/policies.html

We will not introduce new deprecations in patch releases.

If we leave as is, the issue probably becomes irrelevant after 1.3.5 and can be closed as no action.

If we want to revert to raising instead, it is only appropriate if done before 1.3.5? (assuming no further releases on 1.3.x)

@jreback
Copy link
Contributor

jreback commented Nov 27, 2021

the future warning was shown because of a 1.2.x regression

-1 on any further changes in 1.3.x

@simonjayhawkins
Copy link
Member Author

sure. closing as no action.

@simonjayhawkins simonjayhawkins removed this from the 1.3.5 milestone Nov 27, 2021
@simonjayhawkins simonjayhawkins added this to the No action milestone Nov 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Regression Functionality that used to work in a prior pandas version Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

No branches or pull requests

5 participants