COMPAT: sum/prod on all nan will remain nan regardless of bottleneck install #17630

jreback · 2017-09-22T15:08:18Z

codecov · 2017-09-22T19:57:56Z

Codecov Report

Merging #17630 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17630      +/-   ##
==========================================
- Coverage   91.19%   91.17%   -0.03%     
==========================================
  Files         163      163              
  Lines       49652    49651       -1     
==========================================
- Hits        45282    45269      -13     
- Misses       4370     4382      +12

Flag	Coverage Δ
#multiple	`88.96% <100%> (-0.01%)`	⬇️
#single	`40.18% <23.07%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/util/testing.py	`100% <ø> (ø)`	⬆️
pandas/core/nanops.py	`96.67% <100%> (-0.99%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.77% <0%> (-0.1%)`	⬇️
pandas/core/generic.py	`92.04% <0%> (+0.05%)`	⬆️
pandas/core/series.py	`95.02% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1fe892...6f56342. Read the comment docs.

codecov · 2017-09-22T19:58:11Z

Codecov Report

Merging #17630 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17630      +/-   ##
==========================================
- Coverage   91.24%   91.21%   -0.03%     
==========================================
  Files         163      163              
  Lines       49994    49993       -1     
==========================================
- Hits        45615    45602      -13     
- Misses       4379     4391      +12

Flag	Coverage Δ
#multiple	`89.02% <100%> (-0.01%)`	⬇️
#single	`40.24% <23.07%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`92.09% <ø> (+0.05%)`	⬆️
pandas/util/testing.py	`100% <ø> (ø)`	⬆️
pandas/core/nanops.py	`96.67% <100%> (-0.99%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.74% <0%> (-0.1%)`	⬇️
pandas/core/series.py	`94.98% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ba2cff...3db01c9. Read the comment docs.

jreback · 2017-09-29T10:28:49Z

cc @shoyer

after all of the conversation of #9422 dead silence here.

jreback · 2017-10-01T15:44:15Z

any objections to context given discussion in #9422

@jorisvandenbossche @shoyer

jorisvandenbossche

What is the performance impact of not using bottleneck? From a quick test it seems quite a bit faster than our own implementation.
But I suppose checking for the case of all NaNs (to then not use bottleneck) will defeat this performance gain?

jorisvandenbossche · 2017-10-01T19:49:02Z

doc/source/whatsnew/v0.21.0.txt

@@ -12,6 +12,7 @@ Highlights include:
 - Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
 - New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
  categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
+- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent without regards to `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>`


".. is now consistent without regards to bottleneck is installed" does not sound correct to me.
Maybe something like ".. is now consistent and does no longer depend on whether bottleneck is installed" ?

(the same for similar occurrence more below)

jorisvandenbossche · 2017-10-01T20:10:16Z

pandas/tests/frame/test_analytics.py

        df = DataFrame({'A': [1, 2, 3],
                        'B': [1., np.nan, 3.]})
        result = df.clip(1, 2)
-        expected = DataFrame({'A': [1, 2, 2],
+        expected = DataFrame({'A': [1, 2, 2.],


is this related ?

no I think was un-xfailing a test and that's all it needed IIRC.

jreback · 2017-10-01T20:50:15Z

What is the performance impact of not using bottleneck? From a quick test it seems quite a bit faster than our own implementation.
But I suppose checking for the case of all NaNs (to then not use bottleneck) will defeat this performance gain?

there is some benefit, can be up to 2x faster (in ad-hoc tests). But again, we do more so its not entirely fair (e.g. handle proper dtypes, infs and things like that). But for some straightforward stuff its fine.

jorisvandenbossche · 2017-10-02T19:54:23Z

there is some benefit, can be up to 2x faster (in ad-hoc tests). But again, we do more so its not entirely fair (e.g. handle proper dtypes, infs and things like that)

That is true, but depending on the dtype this extra 'work' is not needed. Eg we could special case integers? As there can never be NaN handling needed. Small benchmark:

In [24]: df = pd.DataFrame(np.random.randint(0, 1000, (10000, 10)))

In [27]: pd.options.compute.use_bottleneck = True

In [28]: %timeit df.sum()
241 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [29]: pd.options.compute.use_bottleneck = False

In [30]: %timeit df.sum()
845 µs ± 4.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it is a considerable difference.

jreback · 2017-10-02T19:56:25Z

thats a whopping 3us. If you want to do a PR, go ahead.

jorisvandenbossche · 2017-10-02T20:07:21Z

I can easily make that number bigger by making a bigger frame :-) it's the relative number that is relevant.
But to the point, I am not familiar with the nanops.py code, but it's not adding a is_integer_dtype check here? https://github.com/jreback/pandas/blob/3c91148b1f5b48fa97e475c63a5adfeb429fbdf6/pandas/core/nanops.py#L166 (there is already a check for object and datetime dtype)

chris-b1 · 2017-10-02T20:09:27Z

Not sure it's worth the complexity in implementation, but could optimistically use bottleneck, and only if the identity is returned (0 or 1) do a check for all nan

jreback · 2017-10-03T08:23:41Z

note by always using our own routine we also avoid some buggy versions on windows (#15507) though might be fixed in later versions

jreback · 2017-10-06T10:26:15Z

update with more documentation. @jorisvandenbossche @shoyer @TomAugspurger

jreback · 2017-10-09T12:22:05Z

@jorisvandenbossche @TomAugspurger any more comments

TomAugspurger

Just doc comments.

TomAugspurger · 2017-10-09T13:28:34Z

doc/source/missing_data.rst

+   This behavior is now standard as of v0.21.0; previously sum/prod would give different
+   results if the ``bottleneck`` package was installed. See the :ref:`here <whatsnew_0210.api_breaking.bottleneck>`.
+
+If summing a ``DataFrame``, a ``Series`` of all-``NaN``.


typo: DataFrame or Series

And maybe say what the behavior is. "of all-NaN, the return value is NaN.

TomAugspurger · 2017-10-09T13:30:15Z

doc/source/whatsnew/v0.21.0.txt

@@ -12,6 +12,7 @@ Highlights include:
 - Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
 - New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
  categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
+- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>`


Maybe state that it always returns NaN.

TomAugspurger · 2017-10-09T13:30:47Z

doc/source/whatsnew/v0.21.0.txt

+The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on
+whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed. (:issue:`9422`, :issue:`15507`).
+
+This now will *always* preserve information. You will get back a ``NaN``, indicating missing values in that Series,


"indicating all missing values"

TomAugspurger · 2017-10-09T13:35:05Z

doc/source/whatsnew/v0.21.0.txt

+whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed. (:issue:`9422`, :issue:`15507`).
+
+This now will *always* preserve information. You will get back a ``NaN``, indicating missing values in that Series,
+or if summing a ``DataFrame``, a ``Series`` of all-``NaN``. See the :ref:`docs <missing_data.numeric_sum>`.


I don't follow what these two lines are saying. I think I'd phrase it like

For empty or all-missing ``Series`` or columns of a ``DataFrame``, these operations now return ``NaN``. See the :ref:...

shoyer

Can we also adjust the docstring for sum/prod?

Currently, it is:

skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA

Maybe this could become:

skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA or empty, the result
    will be NA.
``

Actually, can confirm the behavior for summing an empty series with `skipna=False`? e.g.,  `pd.Series([]).sum(skipna=False)`?

shoyer · 2017-10-09T16:04:37Z

doc/source/missing_data.rst

+
+.. warning::
+
+   These behaviors differ from the default in ``numpy`` which does not generally propagate NaNs


Rather than "which does not generally propagate NaNs" I would say "where an empty sum returns zero"

jreback · 2017-10-09T23:11:27Z

updated

shoyer · 2017-10-09T23:18:39Z

@jreback I had one comment in my last review that I think got lost with bad markdown formatting:

Can we confirm the behavior for summing an empty series with skipna=False? e.g., pd.Series([]).sum(skipna=False)?

This should be NaN, not 0 like NumPy.

jreback · 2017-10-10T03:16:18Z

0.20.3

# with bottleneck
In [1]: pd.Series([]).sum(skipna=True)
Out[1]: 0

In [2]: pd.Series([]).sum(skipna=False)
Out[2]: 0

In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: 0.0

In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan

# no bottleneck
In [1]: pd.Series([]).sum(skipna=True)
Out[1]: 0

In [2]: pd.Series([]).sum(skipna=False)
Out[2]: 0

In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: nan

In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan

PR

In [1]: pd.Series([]).sum(skipna=True)
Out[1]: nan

In [2]: pd.Series([]).sum(skipna=False)
Out[2]: nan

In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: nan

In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan

…install xref pandas-dev#15507 closes pandas-dev#9422

shoyer · 2017-10-10T03:21:07Z

Thanks, that looks right!

…

On Tue, Oct 10, 2017 at 3:16 AM Jeff Reback ***@***.***> wrote: 0.20.3 # with bottleneck In [1]: pd.Series([]).sum(skipna=True) Out[1]: 0 In [2]: pd.Series([]).sum(skipna=False) Out[2]: 0 In [3]: pd.Series([np.nan]).sum(skipna=True) Out[3]: 0.0 In [4]: pd.Series([np.nan]).sum(skipna=False) Out[4]: nan # no bottleneck In [1]: pd.Series([]).sum(skipna=True) Out[1]: 0 In [2]: pd.Series([]).sum(skipna=False) Out[2]: 0 In [3]: pd.Series([np.nan]).sum(skipna=True) Out[3]: nan In [4]: pd.Series([np.nan]).sum(skipna=False) Out[4]: nan PR In [1]: pd.Series([]).sum(skipna=True) Out[1]: nan In [2]: pd.Series([]).sum(skipna=False) Out[2]: nan In [3]: pd.Series([np.nan]).sum(skipna=True) Out[3]: nan In [4]: pd.Series([np.nan]).sum(skipna=False) Out[4]: nan — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17630 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1vzpxhQ4Zz589NUueOhyHGKt7gMMks5squGHgaJpZM4Pg12h> .

jreback · 2017-10-10T03:33:42Z

k, merging on green.

jorisvandenbossche · 2017-10-10T16:09:20Z

@jreback Thanks a lot for pushing for this!

…install (pandas-dev#17630) xref pandas-dev#15507 closes pandas-dev#9422

jreback added Compat pandas objects compatability with Numpy or Python functions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 22, 2017

jreback added this to the 0.21.0 milestone Sep 22, 2017

jreback force-pushed the nansum branch from d5c4b97 to 6f56342 Compare September 22, 2017 16:14

jreback force-pushed the nansum branch from 6f56342 to eb94d28 Compare September 23, 2017 19:49

jreback mentioned this pull request Sep 24, 2017

API: sum of Series of all NaN should return 0 or NaN ? #9422

Closed

1 task

jreback force-pushed the nansum branch 2 times, most recently from 6ae6091 to bb33485 Compare September 28, 2017 11:51

jorisvandenbossche reviewed Oct 1, 2017

View reviewed changes

jreback force-pushed the nansum branch from bb33485 to 3c91148 Compare October 2, 2017 00:00

jreback force-pushed the nansum branch from 3c91148 to b116341 Compare October 6, 2017 10:25

jreback force-pushed the nansum branch from b116341 to 0ab6cfa Compare October 6, 2017 15:41

TomAugspurger approved these changes Oct 9, 2017

View reviewed changes

shoyer reviewed Oct 9, 2017

View reviewed changes

jreback force-pushed the nansum branch from 0ab6cfa to 0f2ed96 Compare October 9, 2017 23:11

COMPAT: sum/prod on all nan will remain nan regardless of bottleneck …

3db01c9

…install xref pandas-dev#15507 closes pandas-dev#9422

jreback force-pushed the nansum branch from 0f2ed96 to 3db01c9 Compare October 10, 2017 03:17

jreback merged commit d12a7a0 into pandas-dev:master Oct 10, 2017

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017

COMPAT: sum/prod on all nan will remain nan regardless of bottleneck …

bef9bd7

…install (pandas-dev#17630) xref pandas-dev#15507 closes pandas-dev#9422

max-sixty mentioned this pull request Nov 9, 2017

Sum of an empty Series is now nan, should be 0 (regression in 0.21 from 0.20.3) #18200

Closed

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

COMPAT: sum/prod on all nan will remain nan regardless of bottleneck …

9d4bcf5

…install (pandas-dev#17630) xref pandas-dev#15507 closes pandas-dev#9422

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

COMPAT: sum/prod on all nan will remain nan regardless of bottleneck …

f436ce7

…install (pandas-dev#17630) xref pandas-dev#15507 closes pandas-dev#9422

TomAugspurger mentioned this pull request Dec 7, 2017

Sum of all NA and empty should be 0 #18678

Closed


		.. warning::

		These behaviors differ from the default in ``numpy`` which does not generally propagate NaNs

Uh oh!

COMPAT: sum/prod on all nan will remain nan regardless of bottleneck install #17630

COMPAT: sum/prod on all nan will remain nan regardless of bottleneck install #17630

Uh oh!

Conversation

jreback commented Sep 22, 2017

Uh oh!

codecov bot commented Sep 22, 2017

Codecov Report

Uh oh!

codecov bot commented Sep 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback commented Sep 29, 2017

Uh oh!

jreback commented Oct 1, 2017

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 1, 2017

Uh oh!

jorisvandenbossche commented Oct 2, 2017

Uh oh!

jreback commented Oct 2, 2017

Uh oh!

jorisvandenbossche commented Oct 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chris-b1 commented Oct 2, 2017

Uh oh!

jreback commented Oct 3, 2017

Uh oh!

jreback commented Oct 6, 2017

Uh oh!

jreback commented Oct 9, 2017

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Oct 9, 2017

Uh oh!

shoyer commented Oct 9, 2017

Uh oh!

jreback commented Oct 10, 2017

Uh oh!

shoyer commented Oct 10, 2017 via email

Uh oh!

jreback commented Oct 10, 2017

Uh oh!

jorisvandenbossche commented Oct 10, 2017

Uh oh!

Uh oh!

codecov bot commented Sep 22, 2017 •

edited

Loading

jorisvandenbossche commented Oct 2, 2017 •

edited

Loading