-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
COMPAT: sum/prod on all nan will remain nan regardless of bottleneck install #17630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17630 +/- ##
==========================================
- Coverage 91.19% 91.17% -0.03%
==========================================
Files 163 163
Lines 49652 49651 -1
==========================================
- Hits 45282 45269 -13
- Misses 4370 4382 +12
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17630 +/- ##
==========================================
- Coverage 91.24% 91.21% -0.03%
==========================================
Files 163 163
Lines 49994 49993 -1
==========================================
- Hits 45615 45602 -13
- Misses 4379 4391 +12
Continue to review full report at Codecov.
|
6ae6091
to
bb33485
Compare
any objections to context given discussion in #9422 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the performance impact of not using bottleneck? From a quick test it seems quite a bit faster than our own implementation.
But I suppose checking for the case of all NaNs (to then not use bottleneck) will defeat this performance gain?
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -12,6 +12,7 @@ Highlights include: | |||
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. | |||
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying | |||
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`. | |||
- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent without regards to `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
".. is now consistent without regards to bottleneck is installed" does not sound correct to me.
Maybe something like ".. is now consistent and does no longer depend on whether bottleneck is installed" ?
(the same for similar occurrence more below)
df = DataFrame({'A': [1, 2, 3], | ||
'B': [1., np.nan, 3.]}) | ||
result = df.clip(1, 2) | ||
expected = DataFrame({'A': [1, 2, 2], | ||
expected = DataFrame({'A': [1, 2, 2.], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this related ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I think was un-xfailing a test and that's all it needed IIRC.
there is some benefit, can be up to 2x faster (in ad-hoc tests). But again, we do more so its not entirely fair (e.g. handle proper dtypes, infs and things like that). But for some straightforward stuff its fine. |
That is true, but depending on the dtype this extra 'work' is not needed. Eg we could special case integers? As there can never be NaN handling needed. Small benchmark:
So it is a considerable difference. |
thats a whopping 3us. If you want to do a PR, go ahead. |
I can easily make that number bigger by making a bigger frame :-) it's the relative number that is relevant. |
Not sure it's worth the complexity in implementation, but could optimistically use bottleneck, and only if the identity is returned (0 or 1) do a check for all |
note by always using our own routine we also avoid some buggy versions on windows (#15507) though might be fixed in later versions |
update with more documentation. @jorisvandenbossche @shoyer @TomAugspurger |
@jorisvandenbossche @TomAugspurger any more comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just doc comments.
doc/source/missing_data.rst
Outdated
This behavior is now standard as of v0.21.0; previously sum/prod would give different | ||
results if the ``bottleneck`` package was installed. See the :ref:`here <whatsnew_0210.api_breaking.bottleneck>`. | ||
|
||
If summing a ``DataFrame``, a ``Series`` of all-``NaN``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: DataFrame
or Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe say what the behavior is. "of all-NaN
, the return value is NaN
.
@@ -12,6 +12,7 @@ Highlights include: | |||
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. | |||
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying | |||
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`. | |||
- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe state that it always returns NaN
.
doc/source/whatsnew/v0.21.0.txt
Outdated
The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on | ||
whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed. (:issue:`9422`, :issue:`15507`). | ||
|
||
This now will *always* preserve information. You will get back a ``NaN``, indicating missing values in that Series, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"indicating all missing values"
doc/source/whatsnew/v0.21.0.txt
Outdated
whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed. (:issue:`9422`, :issue:`15507`). | ||
|
||
This now will *always* preserve information. You will get back a ``NaN``, indicating missing values in that Series, | ||
or if summing a ``DataFrame``, a ``Series`` of all-``NaN``. See the :ref:`docs <missing_data.numeric_sum>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what these two lines are saying. I think I'd phrase it like
For empty or all-missing ``Series`` or columns of a ``DataFrame``, these operations now return ``NaN``. See the :ref:...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also adjust the docstring for sum
/prod
?
Currently, it is:
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
Maybe this could become:
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA or empty, the result
will be NA.
``
Actually, can confirm the behavior for summing an empty series with `skipna=False`? e.g., `pd.Series([]).sum(skipna=False)`?
doc/source/missing_data.rst
Outdated
|
||
.. warning:: | ||
|
||
These behaviors differ from the default in ``numpy`` which does not generally propagate NaNs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than "which does not generally propagate NaNs" I would say "where an empty sum returns zero"
updated |
@jreback I had one comment in my last review that I think got lost with bad markdown formatting:
This should be NaN, not 0 like NumPy. |
0.20.3
PR
|
Thanks, that looks right!
…On Tue, Oct 10, 2017 at 3:16 AM Jeff Reback ***@***.***> wrote:
0.20.3
# with bottleneck
In [1]: pd.Series([]).sum(skipna=True)
Out[1]: 0
In [2]: pd.Series([]).sum(skipna=False)
Out[2]: 0
In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: 0.0
In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan
# no bottleneck
In [1]: pd.Series([]).sum(skipna=True)
Out[1]: 0
In [2]: pd.Series([]).sum(skipna=False)
Out[2]: 0
In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: nan
In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan
PR
In [1]: pd.Series([]).sum(skipna=True)
Out[1]: nan
In [2]: pd.Series([]).sum(skipna=False)
Out[2]: nan
In [3]: pd.Series([np.nan]).sum(skipna=True)
Out[3]: nan
In [4]: pd.Series([np.nan]).sum(skipna=False)
Out[4]: nan
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17630 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1vzpxhQ4Zz589NUueOhyHGKt7gMMks5squGHgaJpZM4Pg12h>
.
|
k, merging on green. |
@jreback Thanks a lot for pushing for this! |
xref #15507
closes #9422