Skip to content

BUG: sum() and mean() return wrong values when applied to integers as strings #44008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
ogahide opened this issue Oct 13, 2021 · 5 comments · Fixed by #52281
Closed
3 tasks done

BUG: sum() and mean() return wrong values when applied to integers as strings #44008

ogahide opened this issue Oct 13, 2021 · 5 comments · Fixed by #52281
Labels
API - Consistency Internal Consistency of API/Behavior Bug Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data

Comments

@ogahide
Copy link

ogahide commented Oct 13, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
ser = pd.Series(['1', '2'])
ser.sum() # this returns 12 instead of 3
ser.mean() # this returns 6.0 instead of 1.5

Issue Description

The sum() and mean() methods of the Series and DataFrame return wrong values when applied to integers as strings. This is silently catastrophic since the user is left with apparently good but erroneous results.

Expected Behavior

>>> import pandas as pd
>>> ser = pd.Series(['1', '2'])
>>> ser.sum()
TypeError: ...

OR

>>> ser.sum()
3

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.8.10.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : ja_JP.UTF-8
LOCALE : ja_JP.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.3
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.20
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@ogahide ogahide added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2021
@ogahide ogahide changed the title BUG: mean() returns a wrong value when applied to integers as strings BUG: sum() and mean() return wrong values when applied to integers as strings Oct 13, 2021
@CloseChoice
Copy link
Member

CloseChoice commented Oct 13, 2021

why is this unexpected? This is exactly the standard behaviour in Python:

'1' + '2'
Out[2]: '12'

Note that the output of ser.sum() is '12' not 12.

When you apply the sum to the series we first do a sum, then divide it by the number of summed elements. And somewhere in the middle happens a cast from string to numbers. That might be potentially unexpected behaviour.

We might need to adapt the documentation since from it states here:
This is equivalent to the method ``numpy.sum``.

Which is not correct, see:

np.sum(['1', '2'])
Traceback (most recent call last):
 ...
TypeError: cannot perform reduce with flexible type

Edit: Do want to change the behaviour of Series.mean() and Series.sum() to throw an error in this case? I wouldn't vote for that but the behaviour of Series.mean() is somewhat strange since it casts implicitly while Series.sum() doesn't do that.

@CloseChoice CloseChoice added Docs API - Consistency Internal Consistency of API/Behavior and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Oct 13, 2021
@phofl
Copy link
Member

phofl commented Oct 13, 2021

sum is as expected, since it behaves the same for strings like "a" and "b" but mean is off. Concatenating the strings and then casting to int and dividing by the number of ocurrences is definitiely not what I would have expected since using characters raises

@ogahide
Copy link
Author

ogahide commented Oct 13, 2021

Thank you for your comments.

The built-in function sum() and numpy.sum() return an error when arguments are strings, and so I expected the same on Pandas. It's OK, but it just looks counterintuitive to me.

>>>sum(['1', '2'])
TypeError: unsupported operand type(s) for +: 'int' and 'str'
>>>import numpy as np
>>>arr = np.array(['1', '2'])
>>>np.sum(arr) # arr.sum() returns the same error
TypeError: cannot perform reduce with flexible type

Another problem is that Pandas displays an integer and its string exactly the same (you cannot tell whether 123 is an integer or a string from its appearance) and potentially blinds the user to the pitfall.

@phofl phofl added Bug and removed Docs labels Oct 13, 2021
@mroeschke mroeschke added Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data labels Oct 16, 2021
@shubham11941140
Copy link
Contributor

To solve the issue, do I change the behaviour of pandas to numpy?

Or, do I change it to sum of string resulting in concatenation and mean of strings giving an error because dividing strings has no meaning?

@asishm
Copy link
Contributor

asishm commented Oct 24, 2021

sum is as expected, since it behaves the same for strings like "a" and "b" but mean is off.

I would say sum should not be expected either.

  1. str.cat exists for precisely this
  2. the equivalent np.array(['a', 'b']).sum() fails

@jreback jreback added this to the Contributions Welcome milestone Dec 23, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
7 participants