numeric_only inconsistency with pandas Series #10480

Wilfred · 2015-07-01T09:32:27Z

In [1]: import pandas as pd

In [2]: pd.Series([1,2,3]).sum(numeric_only=False)
Out[2]: 6

In [3]: pd.Series([1,2,3]).sum(numeric_only=True)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-3-2c46bd289e26> in <module>()
----> 1 pd.Series([1,2,3]).sum(numeric_only=True)

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
   4253                                               skipna=skipna)
   4254                 return self._reduce(f, name, axis=axis,
-> 4255                                     skipna=skipna, numeric_only=numeric_only)
   4256             stat_func.__name__ = name
   4257             return stat_func

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/series.pyc in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   2081             if numeric_only:
   2082                 raise NotImplementedError(
-> 2083                     'Series.{0} does not implement numeric_only.'.format(name))
   2084             return op(delegate, skipna=skipna, **kwds)
   2085 

NotImplementedError: Series.sum does not implement numeric_only.

The text was updated successfully, but these errors were encountered:

Wilfred · 2015-07-01T09:36:55Z

The docstring suggests this is a legitimate argument:

Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA
level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar
numeric_only : boolean, default None
    Include only float, int, boolean data. If None, will attempt to use
    everything, then use only numeric data

Returns
-------
sum : scalar or Series (if level specified)

However, strangely, there's an explicit test that this throws an exception: https://github.com/pydata/pandas/blob/054821dc90ded4263edf7c8d5b333c1d65ff53a4/pandas/tests/test_series.py#L2724

jreback · 2015-07-01T10:38:01Z

this is just for compat as its a general parameter that matters for DataFrames. (and the function is auto-generated). If you can find a way to not-expose it without jumping thru hoops would be ok.

Wilfred · 2015-07-01T12:48:57Z

OK, so numeric_only is accepted by Series.sum simply for compatibility with DataFrame.sum. You're proposing we find a way to hide this specific parameter in the docstring.

Have I understood correctly?

jreback · 2015-07-01T13:35:12Z

Ok

smlewis · 2016-09-27T20:28:39Z

I'll freely admit I'm a pandas novice, but I ran headlong into what I think was this bug just now. I wanted numeric_only with Series.mean rather than sum; I assume that falls under this issue as well. The documentation says this option exists but the code says it doesn't. pandas version 0.18.1, documentation from a matching-version manual (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) (although obviously that link may age out).

chris-b1 · 2016-09-27T20:46:33Z

@smlewis - can you show an example of some data where you needed this and what you you expected to happen? Note that the implemented usecase is for selecting numeric columns, like

df = pd.DataFrame({'a': [2,3,4], 'b': pd.timedelta_range('1s', periods=3)})

df
Out[63]: 
   a               b
0  2 0 days 00:00:01
1  3 1 days 00:00:01
2  4 2 days 00:00:01

df.mean()
Out[65]: 
a                  3
b    1 days 00:00:01
dtype: object

df.mean(numeric_only=True)
Out[64]: 
a    3.0
dtype: float64

smlewis · 2016-09-27T21:23:58Z

The input file for my dataframe was constructed in a stupid way (by me...): several similar data sources were concatenated so I could process their averages all at once instead of running the script N times. The concatenation meant that each group had its header repeated (except the first, which I'd edited manually to properly name the column; that column was a mangling of the source filename inserted at concatenation time). So you get a data set like this:

source  score 
alpha   2 
alpha   3 
alpha   2 
beta    score 
beta    9 
beta    8 
beta    7 
gamma   score 
gamma   4 
gamma   4 
gamma   1

This snippet:

import pandas as pd

all_scores = pd.read_csv("scores_for_averaging.csv", delim_whitespace=True)

experiments = all_scores['source'].unique()

for each in experiments:
    exp_slice = all_scores.loc[all_scores['source'] == each]
    #print each, exp_slice['score'].mean(numeric_only=True) #fails: NotImplementedError: Series.mean does not implement numeric_only.
    #print each, exp_slice['score'].mean() #fails: TypeError: Could not convert score987 to numeric

failed because mean() couldn't accept numeric_only to throw out the spurious extra header line for beta, gamma, etc. I just reprocessed my input to not have the header line repeated and then it worked fine. I guess the problem is that the documentation and the code don't match?

chris-b1 · 2016-09-27T21:30:19Z

Thanks, just curious what the expected use was. Yes, the documentation/method should be updated to match, just tricky to actually do in this case (PR welcome!).

FYI, for a conversion like this (assuming you actually do have a valid mixed type object array), the function you likely want is to_numeric

pd.to_numeric(exp_slice['score'], errors='coerce').mean()

jreback · 2016-09-27T23:58:52Z

I suppose this could be better documented, but the arg is there for consistency with DataFrame. It really doesn't do anything as a Series is a single dtyped object. Either you get all elements or None (even if mixed). We don't deeply introspect mixed (or object) things.

smlewis · 2016-09-28T14:41:17Z

Thank you!

sergei-bondarenko · 2018-01-01T12:16:33Z

@jreback Why did you close this pull request? This is still not in documentation.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

smcinerney · 2022-06-27T20:20:39Z

Honestly I don't understand the reason why .sum(numeric_only=True) a series of mixed-types is hard, and this seems like both an outright functional bug, also (less important) several docbugs since the note that this still doesn't currently work on Series is extremely hidden.

It really doesn't do anything as a Series is a single dtyped object... We don't deeply introspect mixed (or object) things.

>>> pd.Series([1, 2.2, True, 'ignore_me']).sum(numeric_only=True)

NotImplementedError: Series.sum does not implement numeric_only.

1b) We can't workaround by astype coercing to float/int and ignore'ing:

>>> pd.Series([1, 2.2, True, 'ignore_me']).astype('float', 'ignore')
ValueError: could not convert string to float: 'ignore_me'

1c) ...astype() doesn't have errors='coerce' or downcast, like pd.to_numeric() does:

WORKAROUND:

pd.to_numeric(pd.Series([1, 2.2, True, 'ignore_me']), errors='coerce').sum()
4.2

jreback added API Design Compat pandas objects compatability with Numpy or Python functions labels Jul 1, 2015

jreback changed the title ~~numeric_only inconsistency with pandas Series~~ numeric_only inconsistency with pandas Series Jul 1, 2015

jreback added the Docs label Sep 27, 2016

chris-b1 mentioned this issue Sep 28, 2016

DOC: expand doc for numeric_only #14309

Closed

4 tasks

jreback added this to the 0.19.0 milestone Sep 28, 2016

jreback closed this as completed in c084bc1 Sep 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numeric_only inconsistency with pandas Series #10480

numeric_only inconsistency with pandas Series #10480

Wilfred commented Jul 1, 2015

Wilfred commented Jul 1, 2015

jreback commented Jul 1, 2015

Wilfred commented Jul 1, 2015

jreback commented Jul 1, 2015

smlewis commented Sep 27, 2016

chris-b1 commented Sep 27, 2016

smlewis commented Sep 27, 2016

chris-b1 commented Sep 27, 2016 •

edited

Loading

jreback commented Sep 27, 2016

smlewis commented Sep 28, 2016

sergei-bondarenko commented Jan 1, 2018

smcinerney commented Jun 27, 2022 •

edited

Loading

numeric_only inconsistency with pandas Series #10480

numeric_only inconsistency with pandas Series #10480

Comments

Wilfred commented Jul 1, 2015

Wilfred commented Jul 1, 2015

jreback commented Jul 1, 2015

Wilfred commented Jul 1, 2015

jreback commented Jul 1, 2015

smlewis commented Sep 27, 2016

chris-b1 commented Sep 27, 2016

smlewis commented Sep 27, 2016

chris-b1 commented Sep 27, 2016 • edited Loading

jreback commented Sep 27, 2016

smlewis commented Sep 28, 2016

sergei-bondarenko commented Jan 1, 2018

smcinerney commented Jun 27, 2022 • edited Loading

WORKAROUND:

chris-b1 commented Sep 27, 2016 •

edited

Loading

smcinerney commented Jun 27, 2022 •

edited

Loading