-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Dataframe.mean with numeric_only=False results in error with strings #26927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not sure I agree with NaN being the expected output for the mean of string columns. |
Fair enough, though it seems to me within the context of I was expecting it to mostly be useful for making sure the output series length matched the column index length, taking non-numeric columns into account. Is it the case that |
On Tue, Jun 18, 2019 at 11:27 AM Sam Mangham ***@***.***> wrote:
Fair enough, though it seems to me within the context of
numeric_only=False it would be a reasonable response, as the average of a
non-numeric column is not (usually) a number.
I interpret numeric_only as an option to exclude non-numeric columns, so
that the aggfunc isn't called on them. I don't think it would be
appropriate for `np.mean(string_series)`,
called via `DataFrame.mean(numeric_only=False)`, to differ. IMO an
exception is appropriate for both.
I was expecting it to mostly be useful for making sure the output series
length matched the column index length, taking non-numeric columns into
account.
I'd recommend reindexing afterwards.
Is it the case that numeric_only is instead really intended for custom
types where you've defined addition and float conversion yourself?
It's a bit hazy, but something like that. Extension arrays, for example,
indicate whether they should be considered numeric.
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#26927?email_source=notifications&email_token=AAKAOIWZIJDVAY4DVI5GSGTP3EEHHA5CNFSM4HZBIN22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX7GIYY#issuecomment-503211107>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIUYSZ777PPDUJ32V3LP3EEHHANCNFSM4HZBIN2Q>
.
|
Would elaborating on the documentation to make the behaviour of |
yes and/or doc-string additions |
We have found a much bigger problem:
|
@sergeny that sounds like a different issue. |
It seems to me that a good behavior would be to just include a bit more information on the error. Maybe something like: TypeError: Could not convert ['abc'] to numeric. Select only valid columns before calling the reduction or drop nuisance columns with 'numeric_only=True'. |
change the axis from 0 to 1 then you will get some outputs. however, I computed the below code: output: A | B | C | row_mean in the above case, the computation ignored column A completely. |
this is now the default behavior since #49915. I'm not sure it should be since mean() fails on non-numeric values. |
@miraculixx - I'm not sure what you mean by "this is now the default behavior". Can you clarify? |
Yeah lets close this, mean shouldn't work and definitely not return NaN for strings, rather fail loudly |
Code Sample, a copy-pastable example if possible
Problem description
Instead of outputting a
NaN
for a non-numeric column of strings when trying to take the mean, it instead throwsTypeError: could not convert string 'abc' to float
.Expected Output
I would expect this to output a series with values
[1, NaN, 5]
.My work-around is currently
df.apply(pd.to_numeric, args=['coerce']).mean(axis=0, skipna=False)
, which outputs the expected result.Output of
pd.show_versions()
pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.16.0
scipy: 1.2.1
pyarrow: 0.12.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.16
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: