From 7c5f8c7b821178633575a893ff1691521fe0a7ed Mon Sep 17 00:00:00 2001 From: richard Date: Thu, 4 May 2023 22:23:26 -0400 Subject: [PATCH 1/3] PDEP-11: Change default of dropna to False --- web/pandas/pdeps/0011-dropna-default.md | 78 +++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 web/pandas/pdeps/0011-dropna-default.md diff --git a/web/pandas/pdeps/0011-dropna-default.md b/web/pandas/pdeps/0011-dropna-default.md new file mode 100644 index 0000000000000..83954166d388c --- /dev/null +++ b/web/pandas/pdeps/0011-dropna-default.md @@ -0,0 +1,78 @@ +# PDEP-11: dropna default in pandas + +- Created: 4 May 2023 +- Status: Under discussion +- Discussion: [PR ??](https://github.com/pandas-dev/pandas/pull/??) +- Authors: [Richard Shadrach](https://github.com/rhshadrach) +- Revision: 1 + +## Abstract + +Throughout pandas, almost all of the methods that have a `dropna` argument default +to `True`. Being the default, this can cause NA values to be silently dropped. +This PDEP proposes to deprecate the current default value of `True` and change it +to `False` in the next major release of pandas. + +## Motivation and Scope + +Upon seeing the output for a Series `ser`: + +```python +print(ser.value_counts()) + +1 3 +2 1 +dtype: Int64 +``` + +users may be surprised that the Series can contain NA values. By then operating +on data under the assumption NA values are not present, erroroneous results can +arise. The same issue can occur with `groupby`, which can also be used to produce +detailed summary statistics of data. We think it is not unreasonable that an +experienced pandas user seeing the code + + df[["a", "b"]].groupby("a").sum() + +would describe this operation as something like the following. + +> For each unique value in column `a`, compute the sum of corresponding values +> in column `b` and return the results in a DataFrame indexed by the unique +> values of `a`. + +This is correct, except that NA values in the column `a` will be dropped from +the computation. That pandas is taking this additional step in the computation +is not apparent from the code, and can surprise users. + +## Detailed Description + +We propose to deprecate the current default of `dropna` and change it to +`False` across all applicable methods. The following methods have a dropna +argument, those marked with a `*` already default to `False`. + +```python +Series.groupby +Series.mode +Series.nunique +*Series.to_hdf +Series.value_counts +DataFrame.groupby +DataFrame.mode +DataFrame.nunique +DataFrame.pivot_table +DataFrame.stack +*DataFrame.to_hdf +DataFrame.value_counts +SeriesGroupBy.nunique +SeriesGroupBy.value_counts +DataFrameGroupBy.nunique +DataFrameGroupBy.value_counts +``` + +## Timeline + +If accepted, the current `dropna` default would be deprecated as part of pandas +2.x and this deprecation would be enforced in pandas 3.0. + +## PDEP History + +- 4 May 2023: Initial draft From e45bfeb321d0e61c7c44592fbe3611cae7ba341f Mon Sep 17 00:00:00 2001 From: richard Date: Thu, 4 May 2023 22:29:24 -0400 Subject: [PATCH 2/3] PR #, fixups --- web/pandas/pdeps/0011-dropna-default.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/web/pandas/pdeps/0011-dropna-default.md b/web/pandas/pdeps/0011-dropna-default.md index 83954166d388c..d3afb1a852101 100644 --- a/web/pandas/pdeps/0011-dropna-default.md +++ b/web/pandas/pdeps/0011-dropna-default.md @@ -2,7 +2,7 @@ - Created: 4 May 2023 - Status: Under discussion -- Discussion: [PR ??](https://github.com/pandas-dev/pandas/pull/??) +- Discussion: [PR #53094](https://github.com/pandas-dev/pandas/pull/53094) - Authors: [Richard Shadrach](https://github.com/rhshadrach) - Revision: 1 @@ -53,14 +53,14 @@ argument, those marked with a `*` already default to `False`. Series.groupby Series.mode Series.nunique -*Series.to_hdf +Series.to_hdf* Series.value_counts DataFrame.groupby DataFrame.mode DataFrame.nunique DataFrame.pivot_table DataFrame.stack -*DataFrame.to_hdf +DataFrame.to_hdf* DataFrame.value_counts SeriesGroupBy.nunique SeriesGroupBy.value_counts From e1194a55e378e315d4c7468bd4cacc67049db3d8 Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Tue, 9 May 2023 16:56:04 -0400 Subject: [PATCH 3/3] Update from feedback --- web/pandas/pdeps/0011-dropna-default.md | 40 +++++++++++++++++++++---- 1 file changed, 35 insertions(+), 5 deletions(-) diff --git a/web/pandas/pdeps/0011-dropna-default.md b/web/pandas/pdeps/0011-dropna-default.md index d3afb1a852101..30cb9f00cc319 100644 --- a/web/pandas/pdeps/0011-dropna-default.md +++ b/web/pandas/pdeps/0011-dropna-default.md @@ -25,7 +25,8 @@ print(ser.value_counts()) dtype: Int64 ``` -users may be surprised that the Series can contain NA values. By then operating +users may be surprised that the Series can contain NA values, as is argued in +[#21890](https://github.com/pandas-dev/pandas/issues/21890). By then operating on data under the assumption NA values are not present, erroroneous results can arise. The same issue can occur with `groupby`, which can also be used to produce detailed summary statistics of data. We think it is not unreasonable that an @@ -43,11 +44,35 @@ This is correct, except that NA values in the column `a` will be dropped from the computation. That pandas is taking this additional step in the computation is not apparent from the code, and can surprise users. +### + +### Keeping the default `skipna=True` + +Many reductions methods, such as `sum`, `mean`, and `var`, have a `skipna` argument. +In such operations, setting `skipna=False` would make the output of any operation +NA if a single NA value is encountered. + +```python +df = pd.DataFrame({'a': [1, np.nan], 'b': [2, np.nan]}) +print(df.sum(skipna=False)) +# a NaN +# b NaN +# dtype: float64 +``` + +This makes `skipna=False` an undesirable default. In the methods with `dropna`, this phenomena does not occur. By defaulting to `dropna=False` in these +methods, the results when NA values are encountered do not obscure the results of non-NA values. + +### Possible deprecation of `dropna` + +This PDEP takes no position on whether some methods with a `dropna` argument should have said argument deprecated. +However, if such a deprecation is to be pursued, then we believe that the final behavior should +be that of `dropna=False` across any of the methods listed below. With this, a necessary first step +in the deprecation process would be to change the default value to `dropna=False`. + ## Detailed Description -We propose to deprecate the current default of `dropna` and change it to -`False` across all applicable methods. The following methods have a dropna -argument, those marked with a `*` already default to `False`. +The following methods have a dropna argument, those marked with a `*` already default to `False`. ```python Series.groupby @@ -68,10 +93,15 @@ DataFrameGroupBy.nunique DataFrameGroupBy.value_counts ``` +We propose to deprecate the current default of `dropna` and change it to +`False` across all methods listed above. + ## Timeline If accepted, the current `dropna` default would be deprecated as part of pandas -2.x and this deprecation would be enforced in pandas 3.0. +2.x and this deprecation would be enforced in pandas 3.0. In pandas 2.x, `FutureWarning` messages would +be emitted on any calls to these methods where the value of `dropna` is unspecified and +an NA value is present. ## PDEP History