Skip to content

numeric_only inconsistency with pandas Series #10480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Wilfred opened this issue Jul 1, 2015 · 12 comments
Closed

numeric_only inconsistency with pandas Series #10480

Wilfred opened this issue Jul 1, 2015 · 12 comments
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Docs
Milestone

Comments

@Wilfred
Copy link
Contributor

Wilfred commented Jul 1, 2015

In [1]: import pandas as pd

In [2]: pd.Series([1,2,3]).sum(numeric_only=False)
Out[2]: 6

In [3]: pd.Series([1,2,3]).sum(numeric_only=True)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-3-2c46bd289e26> in <module>()
----> 1 pd.Series([1,2,3]).sum(numeric_only=True)

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
   4253                                               skipna=skipna)
   4254                 return self._reduce(f, name, axis=axis,
-> 4255                                     skipna=skipna, numeric_only=numeric_only)
   4256             stat_func.__name__ = name
   4257             return stat_func

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/series.pyc in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   2081             if numeric_only:
   2082                 raise NotImplementedError(
-> 2083                     'Series.{0} does not implement numeric_only.'.format(name))
   2084             return op(delegate, skipna=skipna, **kwds)
   2085 

NotImplementedError: Series.sum does not implement numeric_only.
@Wilfred
Copy link
Contributor Author

Wilfred commented Jul 1, 2015

The docstring suggests this is a legitimate argument:

Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA
level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar
numeric_only : boolean, default None
    Include only float, int, boolean data. If None, will attempt to use
    everything, then use only numeric data

Returns
-------
sum : scalar or Series (if level specified)

However, strangely, there's an explicit test that this throws an exception: https://github.com/pydata/pandas/blob/054821dc90ded4263edf7c8d5b333c1d65ff53a4/pandas/tests/test_series.py#L2724

@jreback
Copy link
Contributor

jreback commented Jul 1, 2015

this is just for compat as its a general parameter that matters for DataFrames. (and the function is auto-generated). If you can find a way to not-expose it without jumping thru hoops would be ok.

@jreback jreback added API Design Compat pandas objects compatability with Numpy or Python functions labels Jul 1, 2015
@jreback jreback changed the title numeric_only inconsistency with pandas Series numeric_only inconsistency with pandas Series Jul 1, 2015
@Wilfred
Copy link
Contributor Author

Wilfred commented Jul 1, 2015

OK, so numeric_only is accepted by Series.sum simply for compatibility with DataFrame.sum. You're proposing we find a way to hide this specific parameter in the docstring.

Have I understood correctly?

@jreback
Copy link
Contributor

jreback commented Jul 1, 2015

Ok

@smlewis
Copy link

smlewis commented Sep 27, 2016

I'll freely admit I'm a pandas novice, but I ran headlong into what I think was this bug just now. I wanted numeric_only with Series.mean rather than sum; I assume that falls under this issue as well. The documentation says this option exists but the code says it doesn't. pandas version 0.18.1, documentation from a matching-version manual (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) (although obviously that link may age out).

@chris-b1
Copy link
Contributor

@smlewis - can you show an example of some data where you needed this and what you you expected to happen? Note that the implemented usecase is for selecting numeric columns, like

df = pd.DataFrame({'a': [2,3,4], 'b': pd.timedelta_range('1s', periods=3)})

df
Out[63]: 
   a               b
0  2 0 days 00:00:01
1  3 1 days 00:00:01
2  4 2 days 00:00:01

df.mean()
Out[65]: 
a                  3
b    1 days 00:00:01
dtype: object

df.mean(numeric_only=True)
Out[64]: 
a    3.0
dtype: float64

@smlewis
Copy link

smlewis commented Sep 27, 2016

The input file for my dataframe was constructed in a stupid way (by me...): several similar data sources were concatenated so I could process their averages all at once instead of running the script N times. The concatenation meant that each group had its header repeated (except the first, which I'd edited manually to properly name the column; that column was a mangling of the source filename inserted at concatenation time). So you get a data set like this:

source  score 
alpha   2 
alpha   3 
alpha   2 
beta    score 
beta    9 
beta    8 
beta    7 
gamma   score 
gamma   4 
gamma   4 
gamma   1 

This snippet:

import pandas as pd

all_scores = pd.read_csv("scores_for_averaging.csv", delim_whitespace=True)

experiments = all_scores['source'].unique()

for each in experiments:
    exp_slice = all_scores.loc[all_scores['source'] == each]
    #print each, exp_slice['score'].mean(numeric_only=True) #fails: NotImplementedError: Series.mean does not implement numeric_only.
    #print each, exp_slice['score'].mean() #fails: TypeError: Could not convert score987 to numeric

failed because mean() couldn't accept numeric_only to throw out the spurious extra header line for beta, gamma, etc. I just reprocessed my input to not have the header line repeated and then it worked fine. I guess the problem is that the documentation and the code don't match?

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 27, 2016

Thanks, just curious what the expected use was. Yes, the documentation/method should be updated to match, just tricky to actually do in this case (PR welcome!).

FYI, for a conversion like this (assuming you actually do have a valid mixed type object array), the function you likely want is to_numeric

pd.to_numeric(exp_slice['score'], errors='coerce').mean()

@jreback jreback added the Docs label Sep 27, 2016
@jreback
Copy link
Contributor

jreback commented Sep 27, 2016

I suppose this could be better documented, but the arg is there for consistency with DataFrame. It really doesn't do anything as a Series is a single dtyped object. Either you get all elements or None (even if mixed). We don't deeply introspect mixed (or object) things.

@jreback jreback added this to the 0.19.0 milestone Sep 28, 2016
@smlewis
Copy link

smlewis commented Sep 28, 2016

Thank you!

@sergei-bondarenko
Copy link

@jreback Why did you close this pull request? This is still not in documentation.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

@smcinerney
Copy link

smcinerney commented Jun 27, 2022

  1. Honestly I don't understand the reason why .sum(numeric_only=True) a series of mixed-types is hard, and this seems like both an outright functional bug, also (less important) several docbugs since the note that this still doesn't currently work on Series is extremely hidden.

It really doesn't do anything as a Series is a single dtyped object... We don't deeply introspect mixed (or object) things.

>>> pd.Series([1, 2.2, True, 'ignore_me']).sum(numeric_only=True)

NotImplementedError: Series.sum does not implement numeric_only.

1b) We can't workaround by astype coercing to float/int and ignore'ing:

>>> pd.Series([1, 2.2, True, 'ignore_me']).astype('float', 'ignore')
ValueError: could not convert string to float: 'ignore_me'

1c) ...astype() doesn't have errors='coerce' or downcast, like pd.to_numeric() does:

WORKAROUND:

pd.to_numeric(pd.Series([1, 2.2, True, 'ignore_me']), errors='coerce').sum()
4.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants