ERR: Shorten traceback in groupby _cython_agg_general #52992

rhshadrach · 2023-04-29T00:11:05Z

closes ERR: GroupBy.foo raises confusing error message #52930 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The code

df = pd.DataFrame({"a": [1, 2, 3], "b": 1, "c": "aaaa"})
df.groupby("a").mean(numeric_only=False)

now raises

Traceback (most recent call last):
  File "/home/richard/scratch.py", line 59, in <module>
    df.groupby("a").mean(numeric_only=False)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1971, in mean
    result = self._cython_agg_general(
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1651, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func)
  File "/home/richard/dev/pandas/pandas/core/internals/managers.py", line 1497, in grouped_reduce
    applied = sb.apply(func)
  File "/home/richard/dev/pandas/pandas/core/internals/blocks.py", line 328, in apply
    result = func(self.values, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1648, in array_func
    result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1604, in _agg_py_fallback
    res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 841, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 862, in _aggregate_series_pure_python
    res = func(group)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1973, in <lambda>
    alt=lambda x: Series(x).mean(numeric_only=numeric_only),
  File "/home/richard/dev/pandas/pandas/core/series.py", line 5954, in mean
    return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11436, in mean
    return self._stat_function(
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11393, in _stat_function
    return self._reduce(
  File "/home/richard/dev/pandas/pandas/core/series.py", line 5862, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 148, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 720, in nanmean
    the_sum = _ensure_numeric(the_sum)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 1686, in _ensure_numeric
    raise TypeError(f"Could not convert string '{x}' to numeric")
TypeError: Could not convert string 'aaaa' to numeric

…agg_general

phofl · 2023-04-30T10:02:25Z

What's your opinion on raising

Traceback (most recent call last):
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/groupby.py", line 1643, in array_func
    result = self.grouper._cython_operation(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 806, in _cython_operation
    return cy_op.cython_operation(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 531, in cython_operation
    return self._cython_op_ndim_compat(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 335, in _cython_op_ndim_compat
    return self._call_cython_op(
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 392, in _call_cython_op
    func = self._get_cython_function(self.kind, self.how, values.dtype, is_numeric)
  File "/Users/patrick/PycharmProjects/pandas/pandas/core/groupby/ops.py", line 195, in _get_cython_function
    raise NotImplementedError(
NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]

instead? This seems more deliberate than an internal bug like th other does to me

rhshadrach · 2023-04-30T12:21:00Z

I think that message would be misleading, e.g.

df = pd.DataFrame({"a": [1, 2, 3], "b": 1, "c": 1}, dtype=object)
print(df.groupby("a").mean(numeric_only=False))
#      b    c
# a          
# 1  1.0  1.0
# 2  1.0  1.0
# 3  1.0  1.0

mean on dtype object is implemented, it just takes the Python fallback path.

This seems more deliberate than an internal bug like th other does to me

Not sure what you mean by internal bug here.

phofl · 2023-05-01T21:46:25Z

If I get an error message like the above, my first instinct would be that the maintainers of the library made a mistake, not that my computation is wrong.

Wdyt about an error message like: Operation foo is not supported for this type?

rhshadrach · 2023-05-02T01:04:13Z

TypeError: Could not convert string 'aaaa' to numeric

My reaction to this is "of course you can't convert, I shouldn't be trying an operations that tries to convert the string to numeric".

Wdyt about an error message like: Operation foo is not supported for this type?

What is "type" referring to here? It seems to me users could interpret this as object dtype which is technically not accurate.

phofl · 2023-05-02T08:13:19Z

Yeah this is the tricky part, specifying what exactly is wrong.

I think I illustrated my example in a bad way. It's hard to see which operation was wrong with this error. For example:

df = pd.DataFrame(
    {
        "a": 1,
        "int": [1, 2],
        "str": "a"
    }
)

df.groupby("a").agg({"int": "sum", "float": "str"})

No reference to column/operation and dtype isn't great I think

…pby_mean_stack

rhshadrach · 2023-05-02T21:54:53Z

I think your last line should be:

df.groupby("a").agg({"int": "sum", "str": "mean"})

Agreed this should be improved. ~~I'd like to handle it in a separate PR since it's a different path.~~ I think it can be handled here, and improve this case as well.

…pby_mean_stack

rhshadrach · 2023-05-17T22:15:40Z

@phofl - how does something like this look:

Traceback (most recent call last):
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1598, in _agg_py_fallback
    res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 847, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
  File "/home/richard/dev/pandas/pandas/core/groupby/ops.py", line 868, in _aggregate_series_pure_python
    res = func(group)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1971, in <lambda>
    alt=lambda x: Series(x).mean(numeric_only=numeric_only),
  File "/home/richard/dev/pandas/pandas/core/series.py", line 6030, in mean
    return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11510, in mean
    return self._stat_function(
  File "/home/richard/dev/pandas/pandas/core/generic.py", line 11467, in _stat_function
    return self._reduce(
  File "/home/richard/dev/pandas/pandas/core/series.py", line 5938, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 402, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 718, in nanmean
    the_sum = _ensure_numeric(the_sum)
  File "/home/richard/dev/pandas/pandas/core/nanops.py", line 1687, in _ensure_numeric
    raise TypeError(f"Could not convert string '{x}' to numeric")
TypeError: Could not convert string 'aaaa' to numeric

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/richard/.config/JetBrains/PyCharmCE2022.2/scratches/scratch.py", line 46, in <module>
    df.groupby("a").mean(numeric_only=False)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1969, in mean
    result = self._cython_agg_general(
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1649, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func)
  File "/home/richard/dev/pandas/pandas/core/internals/managers.py", line 1511, in grouped_reduce
    applied = sb.apply(func)
  File "/home/richard/dev/pandas/pandas/core/internals/blocks.py", line 330, in apply
    result = func(self.values, **kwargs)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1646, in array_func
    result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
  File "/home/richard/dev/pandas/pandas/core/groupby/groupby.py", line 1601, in _agg_py_fallback
    raise NotImplementedError(msg) from err
NotImplementedError: function is not implemented for [how->mean,dtype->object]

Process finished with exit code 1

The negatives in my opinion are that it's back to being rather long, and it seems to me the user could think it means the function is not implemented for object dtype (which can be incorrect). However, it specifies the method of aggregation, the dtype, and the value on which it failed which can help users understand where it's coming from.

phofl · 2023-05-17T22:17:23Z

I'd prefer that one. In general, I am not a fan that this works sometimes for object, but this is a different discussion. This message is much clearer based on what I'd expect

…pby_mean_stack � Conflicts: � doc/source/whatsnew/v2.1.0.rst

…pby_mean_stack

rhshadrach · 2023-05-27T13:03:58Z

@phofl - I reworked the raising here. We now raise the same type of exception as main; I think this is better than raising a NotImplementedError in general, and has the added benefit of being less likely to break user code.

rhshadrach · 2023-06-15T21:00:03Z

@phofl - friendly ping.

phofl

lgtm

Sorry I was travelling the last 3 weeks

phofl · 2023-06-16T09:27:32Z

pandas/tests/groupby/test_groupby.py

-            msg = r"Cannot convert \['one' 'three' 'two'\] to numeric"
+        if agg_function in ("std", "sem"):
+            klass = ValueError
+            msg = "could not convert string to float: 'one'"


Is the difference intended here?

Previously we checked multiple possibilities for the message, of which this was one. By separating out the ValueError from the TypeError cases, we only have to check the message associated with the ValuError.

At least, I think that's what you're asking about.

Sorry, I meant that std and see have different errors

Ah, I see! I do think there is something to improve here. If the algorithm actually tries to convert the value to a numeric one and fails, that to me is a ValueError. If, on the other hand, it sees that the object is a string and then doesn't even make an attempt to convert it, that would be a TypeError.

It would be good if we made this consistent by either (a) always trying to convert or (b) never trying to convert. However, even without this consistency, currently mean et al don't try to make the conversion and still raise "Could not convert":

df = DataFrame({'a': [1, 1, 2], 'b': list('456')}) gb = df.groupby('a') gb.mean() # TypeError: Could not convert string '45' to numeric

This should be improved, but separate from this PR.

phofl · 2023-06-19T20:47:16Z

thx @rhshadrach

IcedCereal · 2024-07-30T09:12:01Z

How would you be able to solve this?

import pandas as pd
df = pd.read_csv('salaries_by_college_major.csv')
df.head()
df.shape
df['Starting Median Salary']
df.loc[43]
print(df['Mid-Career Median Salary'].max())
print(f"Index for the max mid career salary: {df['Mid-Career Median Salary'].idxmax()}")
df['Undergraduate Major'][8]
print(df["Starting Median Salary"].min())
df['Undergraduate Major'].loc[df['Starting Median Salary'].idxmin()]
df.loc[df['Mid-Career Median Salary'].idxmin()]
df['Mid-Career 90th Percentile Salary'] - df['Mid-Career 10th Percentile Salary']
spread_col = df['Mid-Career 90th Percentile Salary'] - df['Mid-Career 10th Percentile Salary']
df.insert(1, 'Spread', spread_col)
df.head()
low_risk = df.sort_values('Spread')
low_risk[['Undergraduate Major', 'Spread']].head()
highest_potential = df.sort_values('Mid-Career 90th Percentile Salary', ascending=False)
highest_potential[['Undergraduate Major', 'Mid-Career 90th Percentile Salary']].head()

highest_spread = df.sort_values('Spread', ascending=False)
highest_spread[['Undergraduate Major', 'Spread']].head()
df.groupby('Group').mean()
pd.options.display.float_format = '{:,.2f}'.format

ERR: TypeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/groupby.py in _agg_py_fallback(self, how, values, ndim, alt)
1873 try:
-> 1874 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
1875 except Exception as err:

15 frames
TypeError: Cannot convert ['Accounting' 'Agriculture' 'Architecture' 'Business Management'
'Construction' 'Economics' 'Finance' 'Forestry'
'Health Care Administration' 'Hospitality & Tourism' 'Marketing'
'Nursing'] to numeric

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/groupby.py in _agg_py_fallback(self, how, values, ndim, alt)
1876 msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
1877 # preserve the kind of exception that raised
-> 1878 raise type(err)(msg) from err
1879
1880 if ser.dtype == object:

TypeError: agg function failed [how->median,dtype->object]

CLN: Shorten traceback when Python fallback fails in groupby _cython_…

655cd67

…agg_general

rhshadrach requested a review from phofl April 29, 2023 00:11

rhshadrach added Groupby Error Reporting Incorrect or improved errors from pandas Clean and removed Clean labels Apr 29, 2023

rhshadrach changed the title ~~CLN: Shorten traceback in groupby _cython_agg_general~~ ERR: Shorten traceback in groupby _cython_agg_general Apr 29, 2023

rhshadrach added this to the 2.1 milestone Apr 29, 2023

remove comment

5c9c2c7

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

8833e05

…pby_mean_stack

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

869ee75

…pby_mean_stack

rhshadrach added 3 commits May 17, 2023 18:53

Add information to exception

4deac82

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

1e5ab87

…pby_mean_stack � Conflicts: � doc/source/whatsnew/v2.1.0.rst

Rework

252b0ca

rhshadrach marked this pull request as draft May 24, 2023 11:14

rhshadrach added 2 commits May 27, 2023 09:01

cleanups

572b59f

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

146f9cb

…pby_mean_stack

rhshadrach marked this pull request as ready for review May 27, 2023 13:01

rhshadrach and others added 3 commits May 27, 2023 11:48

Different message on Python 3.9

a093278

Merge branch 'main' into groupby_mean_stack

afcd759

Merge branch 'main' into groupby_mean_stack

d6510bd

phofl approved these changes Jun 16, 2023

View reviewed changes

phofl approved these changes Jun 19, 2023

View reviewed changes

phofl merged commit d06f2d3 into pandas-dev:main Jun 19, 2023

rhshadrach deleted the groupby_mean_stack branch June 19, 2023 20:47

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023

ERR: Shorten traceback in groupby _cython_agg_general (pandas-dev#52992)

06fc003

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: Shorten traceback in groupby _cython_agg_general #52992

ERR: Shorten traceback in groupby _cython_agg_general #52992

rhshadrach commented Apr 29, 2023

phofl commented Apr 30, 2023

rhshadrach commented Apr 30, 2023 •

edited

Loading

phofl commented May 1, 2023

rhshadrach commented May 2, 2023 •

edited

Loading

phofl commented May 2, 2023

rhshadrach commented May 2, 2023 •

edited

Loading

rhshadrach commented May 17, 2023

phofl commented May 17, 2023

rhshadrach commented May 27, 2023

rhshadrach commented Jun 15, 2023

phofl left a comment

phofl Jun 16, 2023

rhshadrach Jun 19, 2023

phofl Jun 19, 2023

rhshadrach Jun 19, 2023 •

edited

Loading

phofl commented Jun 19, 2023

IcedCereal commented Jul 30, 2024

ERR: Shorten traceback in groupby _cython_agg_general #52992

ERR: Shorten traceback in groupby _cython_agg_general #52992

Conversation

rhshadrach commented Apr 29, 2023

phofl commented Apr 30, 2023

rhshadrach commented Apr 30, 2023 • edited Loading

phofl commented May 1, 2023

rhshadrach commented May 2, 2023 • edited Loading

phofl commented May 2, 2023

rhshadrach commented May 2, 2023 • edited Loading

rhshadrach commented May 17, 2023

phofl commented May 17, 2023

rhshadrach commented May 27, 2023

rhshadrach commented Jun 15, 2023

phofl left a comment

Choose a reason for hiding this comment

phofl Jun 16, 2023

Choose a reason for hiding this comment

rhshadrach Jun 19, 2023

Choose a reason for hiding this comment

phofl Jun 19, 2023

Choose a reason for hiding this comment

rhshadrach Jun 19, 2023 • edited Loading

Choose a reason for hiding this comment

phofl commented Jun 19, 2023

IcedCereal commented Jul 30, 2024

rhshadrach commented Apr 30, 2023 •

edited

Loading

rhshadrach commented May 2, 2023 •

edited

Loading

rhshadrach commented May 2, 2023 •

edited

Loading

rhshadrach Jun 19, 2023 •

edited

Loading