Skip to content

ERR: Shorten traceback in groupby _cython_agg_general #52992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 19, 2023

Conversation

rhshadrach
Copy link
Member

The code

df = pd.DataFrame({"a": [1, 2, 3], "b": 1, "c": "aaaa"})
df.groupby("a").mean(numeric_only=False)

now raises

Traceback (most recent call last):
  File "/home/richard/scratch.py", line 59, in <module>
    df.groupby("a").mean(numeric_only=False)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1971, in mean
    result = self._cython_agg_general(
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1651, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func)
  File "/home/richard/dev/pandas/pandas/core/internals/managers.py", line 1497, in grouped_reduce
    applied = sb.apply(func)
  File "/home/richard/dev/pandas/pandas/core/internals/blocks.py", line 328, in apply
    result = func(self.values, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1648, in array_func
    result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1604, in _agg_py_fallback
    res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 841, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 862, in _aggregate_series_pure_python
    res = func(group)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1973, in <lambda>
    alt=lambda x: Series(x).mean(numeric_only=numeric_only),
  File "/home/richard/dev/pandas/pandas/core/series.py", line 5954, in mean
    return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11436, in mean
    return self._stat_function(
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11393, in _stat_function
    return self._reduce(
  File "/home/richard/dev/pandas/pandas/core/series.py", line 5862, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 148, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 720, in nanmean
    the_sum = _ensure_numeric(the_sum)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 1686, in _ensure_numeric
    raise TypeError(f"Could not convert string '{x}' to numeric")
TypeError: Could not convert string 'aaaa' to numeric

@rhshadrach rhshadrach requested a review from phofl April 29, 2023 00:11
@rhshadrach rhshadrach added Groupby Error Reporting Incorrect or improved errors from pandas Clean and removed Clean labels Apr 29, 2023
@rhshadrach rhshadrach changed the title CLN: Shorten traceback in groupby _cython_agg_general ERR: Shorten traceback in groupby _cython_agg_general Apr 29, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Apr 29, 2023
@phofl
Copy link
Member

phofl commented Apr 30, 2023

What's your opinion on raising

Traceback (most recent call last):
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1643, in array_func
    result = self.grouper._cython_operation(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 806, in _cython_operation
    return cy_op.cython_operation(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 531, in cython_operation
    return self._cython_op_ndim_compat(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 335, in _cython_op_ndim_compat
    return self._call_cython_op(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 392, in _call_cython_op
    func = self._get_cython_function(self.kind, self.how, values.dtype, is_numeric)
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 195, in _get_cython_function
    raise NotImplementedError(
NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]

instead? This seems more deliberate than an internal bug like th other does to me

@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 30, 2023

I think that message would be misleading, e.g.

df = pd.DataFrame({"a": [1, 2, 3], "b": 1, "c": 1}, dtype=object)
print(df.groupby("a").mean(numeric_only=False))
#      b    c
# a          
# 1  1.0  1.0
# 2  1.0  1.0
# 3  1.0  1.0

mean on dtype object is implemented, it just takes the Python fallback path.

This seems more deliberate than an internal bug like th other does to me

Not sure what you mean by internal bug here.

@phofl
Copy link
Member

phofl commented May 1, 2023

If I get an error message like the above, my first instinct would be that the maintainers of the library made a mistake, not that my computation is wrong.

Wdyt about an error message like: Operation foo is not supported for this type?

@rhshadrach
Copy link
Member Author

rhshadrach commented May 2, 2023

TypeError: Could not convert string 'aaaa' to numeric

My reaction to this is "of course you can't convert, I shouldn't be trying an operations that tries to convert the string to numeric".

Wdyt about an error message like: Operation foo is not supported for this type?

What is "type" referring to here? It seems to me users could interpret this as object dtype which is technically not accurate.

@phofl
Copy link
Member

phofl commented May 2, 2023

Yeah this is the tricky part, specifying what exactly is wrong.

I think I illustrated my example in a bad way. It's hard to see which operation was wrong with this error. For example:

df = pd.DataFrame(
    {
        "a": 1,
        "int": [1, 2],
        "str": "a"
    }
)

df.groupby("a").agg({"int": "sum", "float": "str"})

No reference to column/operation and dtype isn't great I think

@rhshadrach
Copy link
Member Author

rhshadrach commented May 2, 2023

I think your last line should be:

df.groupby("a").agg({"int": "sum", "str": "mean"})

Agreed this should be improved. I'd like to handle it in a separate PR since it's a different path. I think it can be handled here, and improve this case as well.

@rhshadrach
Copy link
Member Author

@phofl - how does something like this look:

Traceback (most recent call last):
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1598, in _agg_py_fallback
    res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 847, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 868, in _aggregate_series_pure_python
    res = func(group)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1971, in <lambda>
    alt=lambda x: Series(x).mean(numeric_only=numeric_only),
  File "/home/richard/dev/pandas/pandas/core/series.py", line 6030, in mean
    return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11510, in mean
    return self._stat_function(
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11467, in _stat_function
    return self._reduce(
  File "/home/richard/dev/pandas/pandas/core/series.py", line 5938, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 402, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 718, in nanmean
    the_sum = _ensure_numeric(the_sum)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 1687, in _ensure_numeric
    raise TypeError(f"Could not convert string '{x}' to numeric")
TypeError: Could not convert string 'aaaa' to numeric

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/richard/.config/JetBrains/PyCharmCE2022.2/scratches/scratch.py", line 46, in <module>
    df.groupby("a").mean(numeric_only=False)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1969, in mean
    result = self._cython_agg_general(
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1649, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func)
  File "/home/richard/dev/pandas/pandas/core/internals/managers.py", line 1511, in grouped_reduce
    applied = sb.apply(func)
  File "/home/richard/dev/pandas/pandas/core/internals/blocks.py", line 330, in apply
    result = func(self.values, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1646, in array_func
    result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1601, in _agg_py_fallback
    raise NotImplementedError(msg) from err
NotImplementedError: function is not implemented for [how->mean,dtype->object]

Process finished with exit code 1

The negatives in my opinion are that it's back to being rather long, and it seems to me the user could think it means the function is not implemented for object dtype (which can be incorrect). However, it specifies the method of aggregation, the dtype, and the value on which it failed which can help users understand where it's coming from.

@phofl
Copy link
Member

phofl commented May 17, 2023

I'd prefer that one. In general, I am not a fan that this works sometimes for object, but this is a different discussion. This message is much clearer based on what I'd expect

@rhshadrach rhshadrach marked this pull request as draft May 24, 2023 11:14
@rhshadrach rhshadrach marked this pull request as ready for review May 27, 2023 13:01
@rhshadrach
Copy link
Member Author

@phofl - I reworked the raising here. We now raise the same type of exception as main; I think this is better than raising a NotImplementedError in general, and has the added benefit of being less likely to break user code.

@rhshadrach
Copy link
Member Author

@phofl - friendly ping.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sorry I was travelling the last 3 weeks

msg = r"Cannot convert \['one' 'three' 'two'\] to numeric"
if agg_function in ("std", "sem"):
klass = ValueError
msg = "could not convert string to float: 'one'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the difference intended here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we checked multiple possibilities for the message, of which this was one. By separating out the ValueError from the TypeError cases, we only have to check the message associated with the ValuError.

At least, I think that's what you're asking about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant that std and see have different errors

Copy link
Member Author

@rhshadrach rhshadrach Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! I do think there is something to improve here. If the algorithm actually tries to convert the value to a numeric one and fails, that to me is a ValueError. If, on the other hand, it sees that the object is a string and then doesn't even make an attempt to convert it, that would be a TypeError.

It would be good if we made this consistent by either (a) always trying to convert or (b) never trying to convert. However, even without this consistency, currently mean et al don't try to make the conversion and still raise "Could not convert":

df = DataFrame({'a': [1, 1, 2], 'b': list('456')})
gb = df.groupby('a')
gb.mean()
# TypeError: Could not convert string '45' to numeric

This should be improved, but separate from this PR.

@phofl phofl merged commit d06f2d3 into pandas-dev:main Jun 19, 2023
@phofl
Copy link
Member

phofl commented Jun 19, 2023

thx @rhshadrach

@rhshadrach rhshadrach deleted the groupby_mean_stack branch June 19, 2023 20:47
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
@IcedCereal
Copy link

How would you be able to solve this?

import pandas as pd
df = pd.read_csv('salaries_by_college_major.csv')
df.head()
df.shape
df['Starting Median Salary']
df.loc[43]
print(df['Mid-Career Median Salary'].max())
print(f"Index for the max mid career salary: {df['Mid-Career Median Salary'].idxmax()}")
df['Undergraduate Major'][8]
print(df["Starting Median Salary"].min())
df['Undergraduate Major'].loc[df['Starting Median Salary'].idxmin()]
df.loc[df['Mid-Career Median Salary'].idxmin()]
df['Mid-Career 90th Percentile Salary'] - df['Mid-Career 10th Percentile Salary']
spread_col = df['Mid-Career 90th Percentile Salary'] - df['Mid-Career 10th Percentile Salary']
df.insert(1, 'Spread', spread_col)
df.head()
low_risk = df.sort_values('Spread')
low_risk[['Undergraduate Major', 'Spread']].head()
highest_potential = df.sort_values('Mid-Career 90th Percentile Salary', ascending=False)
highest_potential[['Undergraduate Major', 'Mid-Career 90th Percentile Salary']].head()

highest_spread = df.sort_values('Spread', ascending=False)
highest_spread[['Undergraduate Major', 'Spread']].head()
df.groupby('Group').mean()
pd.options.display.float_format = '{:,.2f}'.format

ERR: TypeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/groupby.py in _agg_py_fallback(self, how, values, ndim, alt)
1873 try:
-> 1874 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
1875 except Exception as err:

15 frames
TypeError: Cannot convert ['Accounting' 'Agriculture' 'Architecture' 'Business Management'
'Construction' 'Economics' 'Finance' 'Forestry'
'Health Care Administration' 'Hospitality & Tourism' 'Marketing'
'Nursing'] to numeric

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/groupby.py in _agg_py_fallback(self, how, values, ndim, alt)
1876 msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
1877 # preserve the kind of exception that raised
-> 1878 raise type(err)(msg) from err
1879
1880 if ser.dtype == object:

TypeError: agg function failed [how->median,dtype->object]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: GroupBy.foo raises confusing error message
3 participants