Skip to content

COMPAT: sum/prod on all nan will remain nan regardless of bottleneck install #17630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 10, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 22, 2017

xref #15507
closes #9422

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 22, 2017
@jreback jreback added this to the 0.21.0 milestone Sep 22, 2017
@codecov
Copy link

codecov bot commented Sep 22, 2017

Codecov Report

Merging #17630 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17630      +/-   ##
==========================================
- Coverage   91.19%   91.17%   -0.03%     
==========================================
  Files         163      163              
  Lines       49652    49651       -1     
==========================================
- Hits        45282    45269      -13     
- Misses       4370     4382      +12
Flag Coverage Δ
#multiple 88.96% <100%> (-0.01%) ⬇️
#single 40.18% <23.07%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 100% <ø> (ø) ⬆️
pandas/core/nanops.py 96.67% <100%> (-0.99%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️
pandas/core/generic.py 92.04% <0%> (+0.05%) ⬆️
pandas/core/series.py 95.02% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1fe892...6f56342. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 22, 2017

Codecov Report

Merging #17630 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17630      +/-   ##
==========================================
- Coverage   91.24%   91.21%   -0.03%     
==========================================
  Files         163      163              
  Lines       49994    49993       -1     
==========================================
- Hits        45615    45602      -13     
- Misses       4379     4391      +12
Flag Coverage Δ
#multiple 89.02% <100%> (-0.01%) ⬇️
#single 40.24% <23.07%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 92.09% <ø> (+0.05%) ⬆️
pandas/util/testing.py 100% <ø> (ø) ⬆️
pandas/core/nanops.py 96.67% <100%> (-0.99%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.74% <0%> (-0.1%) ⬇️
pandas/core/series.py 94.98% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ba2cff...3db01c9. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Sep 29, 2017

cc @shoyer

after all of the conversation of #9422 dead silence here.

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2017

any objections to context given discussion in #9422

@jorisvandenbossche @shoyer

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance impact of not using bottleneck? From a quick test it seems quite a bit faster than our own implementation.
But I suppose checking for the case of all NaNs (to then not use bottleneck) will defeat this performance gain?

@@ -12,6 +12,7 @@ Highlights include:
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent without regards to `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

".. is now consistent without regards to bottleneck is installed" does not sound correct to me.
Maybe something like ".. is now consistent and does no longer depend on whether bottleneck is installed" ?

(the same for similar occurrence more below)

df = DataFrame({'A': [1, 2, 3],
'B': [1., np.nan, 3.]})
result = df.clip(1, 2)
expected = DataFrame({'A': [1, 2, 2],
expected = DataFrame({'A': [1, 2, 2.],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this related ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I think was un-xfailing a test and that's all it needed IIRC.

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2017

What is the performance impact of not using bottleneck? From a quick test it seems quite a bit faster than our own implementation.
But I suppose checking for the case of all NaNs (to then not use bottleneck) will defeat this performance gain?

there is some benefit, can be up to 2x faster (in ad-hoc tests). But again, we do more so its not entirely fair (e.g. handle proper dtypes, infs and things like that). But for some straightforward stuff its fine.

@jorisvandenbossche
Copy link
Member

there is some benefit, can be up to 2x faster (in ad-hoc tests). But again, we do more so its not entirely fair (e.g. handle proper dtypes, infs and things like that)

That is true, but depending on the dtype this extra 'work' is not needed. Eg we could special case integers? As there can never be NaN handling needed. Small benchmark:

In [24]: df = pd.DataFrame(np.random.randint(0, 1000, (10000, 10)))

In [27]: pd.options.compute.use_bottleneck = True

In [28]: %timeit df.sum()
241 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [29]: pd.options.compute.use_bottleneck = False

In [30]: %timeit df.sum()
845 µs ± 4.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it is a considerable difference.

@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2017

thats a whopping 3us. If you want to do a PR, go ahead.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 2, 2017

I can easily make that number bigger by making a bigger frame :-) it's the relative number that is relevant.
But to the point, I am not familiar with the nanops.py code, but it's not adding a is_integer_dtype check here? https://github.com/jreback/pandas/blob/3c91148b1f5b48fa97e475c63a5adfeb429fbdf6/pandas/core/nanops.py#L166 (there is already a check for object and datetime dtype)

@chris-b1
Copy link
Contributor

chris-b1 commented Oct 2, 2017

Not sure it's worth the complexity in implementation, but could optimistically use bottleneck, and only if the identity is returned (0 or 1) do a check for all nan

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2017

note by always using our own routine we also avoid some buggy versions on windows (#15507) though might be fixed in later versions

@jreback
Copy link
Contributor Author

jreback commented Oct 6, 2017

update with more documentation. @jorisvandenbossche @shoyer @TomAugspurger

@jreback
Copy link
Contributor Author

jreback commented Oct 9, 2017

@jorisvandenbossche @TomAugspurger any more comments

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just doc comments.

This behavior is now standard as of v0.21.0; previously sum/prod would give different
results if the ``bottleneck`` package was installed. See the :ref:`here <whatsnew_0210.api_breaking.bottleneck>`.

If summing a ``DataFrame``, a ``Series`` of all-``NaN``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: DataFrame or Series

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe say what the behavior is. "of all-NaN, the return value is NaN.

@@ -12,6 +12,7 @@ Highlights include:
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe state that it always returns NaN.

The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on
whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed. (:issue:`9422`, :issue:`15507`).

This now will *always* preserve information. You will get back a ``NaN``, indicating missing values in that Series,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"indicating all missing values"

whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed. (:issue:`9422`, :issue:`15507`).

This now will *always* preserve information. You will get back a ``NaN``, indicating missing values in that Series,
or if summing a ``DataFrame``, a ``Series`` of all-``NaN``. See the :ref:`docs <missing_data.numeric_sum>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what these two lines are saying. I think I'd phrase it like

For empty or all-missing ``Series`` or columns of a ``DataFrame``, these operations now return ``NaN``. See the :ref:...

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also adjust the docstring for sum/prod?

Currently, it is:

skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Maybe this could become:

skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA or empty, the result
    will be NA.
``

Actually, can confirm the behavior for summing an empty series with `skipna=False`? e.g.,  `pd.Series([]).sum(skipna=False)`?


.. warning::

These behaviors differ from the default in ``numpy`` which does not generally propagate NaNs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than "which does not generally propagate NaNs" I would say "where an empty sum returns zero"

@jreback
Copy link
Contributor Author

jreback commented Oct 9, 2017

updated

@shoyer
Copy link
Member

shoyer commented Oct 9, 2017

@jreback I had one comment in my last review that I think got lost with bad markdown formatting:

Can we confirm the behavior for summing an empty series with skipna=False? e.g., pd.Series([]).sum(skipna=False)?

This should be NaN, not 0 like NumPy.

@jreback
Copy link
Contributor Author

jreback commented Oct 10, 2017

0.20.3

# with bottleneck
In [1]: pd.Series([]).sum(skipna=True)
Out[1]: 0

In [2]: pd.Series([]).sum(skipna=False)
Out[2]: 0

In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: 0.0

In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan

# no bottleneck
In [1]: pd.Series([]).sum(skipna=True)
Out[1]: 0

In [2]: pd.Series([]).sum(skipna=False)
Out[2]: 0

In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: nan

In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan

PR

In [1]: pd.Series([]).sum(skipna=True)
Out[1]: nan

In [2]: pd.Series([]).sum(skipna=False)
Out[2]: nan

In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: nan

In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan

@shoyer
Copy link
Member

shoyer commented Oct 10, 2017 via email

@jreback
Copy link
Contributor Author

jreback commented Oct 10, 2017

k, merging on green.

@jreback jreback merged commit d12a7a0 into pandas-dev:master Oct 10, 2017
@jorisvandenbossche
Copy link
Member

@jreback Thanks a lot for pushing for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: sum of Series of all NaN should return 0 or NaN ?
5 participants