Dataframe.mean with numeric_only=False results in error with strings #26927

smangham · 2019-06-18T15:28:57Z

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({
    'A': [0, 1, 2], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6]
})
df.mean(axis=0, numeric_only=False, skipna=False)

Problem description

Instead of outputting a NaN for a non-numeric column of strings when trying to take the mean, it instead throws TypeError: could not convert string 'abc' to float.

Expected Output

I would expect this to output a series with values [1, NaN, 5].

My work-around is currently df.apply(pd.to_numeric, args=['coerce']).mean(axis=0, skipna=False), which outputs the expected result.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.16.0
scipy: 1.2.1
pyarrow: 0.12.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.16
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-06-18T15:59:19Z

I'm not sure I agree with NaN being the expected output for the mean of string columns.

smangham · 2019-06-18T16:27:22Z

Fair enough, though it seems to me within the context of numeric_only=False it would be a reasonable response, as the average of a non-numeric column is not (usually) a number.

I was expecting it to mostly be useful for making sure the output series length matched the column index length, taking non-numeric columns into account.

Is it the case that numeric_only is instead really intended for custom types where you've defined addition and float conversion yourself?

TomAugspurger · 2019-06-18T16:31:16Z

On Tue, Jun 18, 2019 at 11:27 AM Sam Mangham ***@***.***> wrote: Fair enough, though it seems to me within the context of numeric_only=False it would be a reasonable response, as the average of a non-numeric column is not (usually) a number.

I interpret numeric_only as an option to exclude non-numeric columns, so that the aggfunc isn't called on them. I don't think it would be appropriate for `np.mean(string_series)`, called via `DataFrame.mean(numeric_only=False)`, to differ. IMO an exception is appropriate for both.

I was expecting it to mostly be useful for making sure the output series length matched the column index length, taking non-numeric columns into account.

I'd recommend reindexing afterwards.

Is it the case that numeric_only is instead really intended for custom types where you've defined addition and float conversion yourself?

It's a bit hazy, but something like that. Extension arrays, for example, indicate whether they should be considered numeric.

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26927?email_source=notifications&email_token=AAKAOIWZIJDVAY4DVI5GSGTP3EEHHA5CNFSM4HZBIN22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX7GIYY#issuecomment-503211107>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIUYSZ777PPDUJ32V3LP3EEHHANCNFSM4HZBIN2Q> .

smangham · 2019-06-25T17:39:18Z

Would elaborating on the documentation to make the behaviour of numeric_only=False more explicit (i.e. non-numeric dtypes that do not support both the + and / operators will result in an exception) be a reasonable response?

jreback · 2019-06-25T17:41:15Z

yes and/or doc-string additions

sergeny · 2019-08-09T00:33:41Z

We have found a much bigger problem:

pd.Series(['1', '2', '3', '4']).mean() returns 308.5 instead of 2.5 (pandas 0.23.4).
This is because it divides 1234 by 4.

TomAugspurger · 2019-08-13T20:22:22Z

@sergeny that sounds like a different issue.

jsignell · 2021-06-24T14:03:39Z

It seems to me that a good behavior would be to just include a bit more information on the error. Maybe something like:

TypeError: Could not convert ['abc'] to numeric. Select only valid columns before calling the reduction or drop nuisance columns with 'numeric_only=True'.

ES208 · 2022-08-14T14:02:09Z

Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.DataFrame({
    'A': [0, 1, 2], 'B': ['a', 'b', 'c'], 'C': [4, 5, 6]
})
df.mean(axis=0, numeric_only=False, skipna=False)
Problem description

Instead of outputting a NaN for a non-numeric column of strings when trying to take the mean, it instead throws TypeError: could not convert string 'abc' to float.

Expected Output

I would expect this to output a series with values [1, NaN, 5].

My work-around is currently df.apply(pd.to_numeric, args=['coerce']).mean(axis=0, skipna=False), which outputs the expected result.

Output of pd.show_versions()

change the axis from 0 to 1 then you will get some outputs.

however, I computed the below code:
a = [1,"Name",np.nan]
b = [3,75,0]
c = [6,80,90]
df = pd.DataFrame({'A': a, 'B': b, 'C': c})
df["row_mean"] = df.mean(axis=1, numeric_only= True)

output:

A | B | C | row_mean
1 | 3 | 6 | 4.5 --> expected: 3.3333 (column A ignored)
Name | 75 | 80 | 77.5 --> correct
NaN | 0 | 90 | 45.0 --> correct

in the above case, the computation ignored column A completely.
so be careful when using (numeric_only= True) in your data frame.

miraculixx · 2023-10-25T18:41:06Z

this is now the default behavior since #49915. I'm not sure it should be since mean() fails on non-numeric values.

rhshadrach · 2023-10-26T00:42:20Z

@miraculixx - I'm not sure what you mean by "this is now the default behavior". Can you clarify?

phofl · 2024-03-18T02:12:48Z

Yeah lets close this, mean shouldn't work and definitely not return NaN for strings, rather fail loudly

TomAugspurger added Docs Dtype Conversions Unexpected or buggy dtype conversions labels Aug 13, 2019

TomAugspurger added this to the Contributions Welcome milestone Aug 13, 2019

MJafarMashhadi mentioned this issue Aug 2, 2020

BUG: Unexpected/undocumented behaviour of sum and mean aggregations on object dtypes #35512

Closed

3 tasks

jbrockmendel added Reduction Operations sum, mean, min, max, etc. Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Sep 21, 2020

kinshukdua mentioned this issue Oct 21, 2021

BUG: make mean() raise an exception for strings #44131

Closed

9 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Nov 13, 2021

jreback modified the milestones: 1.4, Contributions Welcome Dec 23, 2021

hammerdirt-analyst mentioned this issue Mar 5, 2022

Germination.ipynb - 3rd code block thorerismann/flora-buckets#40

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Oct 26, 2023

phofl closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe.mean with numeric_only=False results in error with strings #26927

Dataframe.mean with numeric_only=False results in error with strings #26927

smangham commented Jun 18, 2019

TomAugspurger commented Jun 18, 2019

smangham commented Jun 18, 2019

TomAugspurger commented Jun 18, 2019 via email

smangham commented Jun 25, 2019

jreback commented Jun 25, 2019

sergeny commented Aug 9, 2019 •

edited

Loading

TomAugspurger commented Aug 13, 2019

jsignell commented Jun 24, 2021

ES208 commented Aug 14, 2022 •

edited

Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of `pd.show_versions()`

miraculixx commented Oct 25, 2023 •

edited

Loading

rhshadrach commented Oct 26, 2023

phofl commented Mar 18, 2024

Dataframe.mean with numeric_only=False results in error with strings #26927

Dataframe.mean with numeric_only=False results in error with strings #26927

Comments

smangham commented Jun 18, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Jun 18, 2019

smangham commented Jun 18, 2019

TomAugspurger commented Jun 18, 2019 via email

smangham commented Jun 25, 2019

jreback commented Jun 25, 2019

sergeny commented Aug 9, 2019 • edited Loading

TomAugspurger commented Aug 13, 2019

jsignell commented Jun 24, 2021

ES208 commented Aug 14, 2022 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

miraculixx commented Oct 25, 2023 • edited Loading

rhshadrach commented Oct 26, 2023

phofl commented Mar 18, 2024

Output of `pd.show_versions()`

sergeny commented Aug 9, 2019 •

edited

Loading

ES208 commented Aug 14, 2022 •

edited

Loading

Output of `pd.show_versions()`

miraculixx commented Oct 25, 2023 •

edited

Loading