Skip to content

PDEP-11: Change default of dropna to False #53094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions web/pandas/pdeps/0011-dropna-default.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# PDEP-11: dropna default in pandas

- Created: 4 May 2023
- Status: Under discussion
- Discussion: [PR #53094](https://github.com/pandas-dev/pandas/pull/53094)
- Authors: [Richard Shadrach](https://github.com/rhshadrach)
- Revision: 1

## Abstract

Throughout pandas, almost all of the methods that have a `dropna` argument default
to `True`. Being the default, this can cause NA values to be silently dropped.
This PDEP proposes to deprecate the current default value of `True` and change it
to `False` in the next major release of pandas.

## Motivation and Scope

Upon seeing the output for a Series `ser`:

```python
print(ser.value_counts())

1 3
2 1
dtype: Int64
```

users may be surprised that the Series can contain NA values, as is argued in
[#21890](https://github.com/pandas-dev/pandas/issues/21890). By then operating
on data under the assumption NA values are not present, erroroneous results can
arise. The same issue can occur with `groupby`, which can also be used to produce
detailed summary statistics of data. We think it is not unreasonable that an
experienced pandas user seeing the code

df[["a", "b"]].groupby("a").sum()

would describe this operation as something like the following.

> For each unique value in column `a`, compute the sum of corresponding values
> in column `b` and return the results in a DataFrame indexed by the unique
> values of `a`.

This is correct, except that NA values in the column `a` will be dropped from
the computation. That pandas is taking this additional step in the computation
is not apparent from the code, and can surprise users.

###

### Keeping the default `skipna=True`

Many reductions methods, such as `sum`, `mean`, and `var`, have a `skipna` argument.
In such operations, setting `skipna=False` would make the output of any operation
NA if a single NA value is encountered.

```python
df = pd.DataFrame({'a': [1, np.nan], 'b': [2, np.nan]})
print(df.sum(skipna=False))
# a NaN
# b NaN
# dtype: float64
```

This makes `skipna=False` an undesirable default. In the methods with `dropna`, this phenomena does not occur. By defaulting to `dropna=False` in these
methods, the results when NA values are encountered do not obscure the results of non-NA values.

### Possible deprecation of `dropna`

This PDEP takes no position on whether some methods with a `dropna` argument should have said argument deprecated.
However, if such a deprecation is to be pursued, then we believe that the final behavior should
be that of `dropna=False` across any of the methods listed below. With this, a necessary first step
in the deprecation process would be to change the default value to `dropna=False`.

## Detailed Description

The following methods have a dropna argument, those marked with a `*` already default to `False`.

```python
Series.groupby
Series.mode
Series.nunique
Series.to_hdf*
Series.value_counts
DataFrame.groupby
DataFrame.mode
DataFrame.nunique
DataFrame.pivot_table
DataFrame.stack
DataFrame.to_hdf*
DataFrame.value_counts
SeriesGroupBy.nunique
SeriesGroupBy.value_counts
DataFrameGroupBy.nunique
DataFrameGroupBy.value_counts
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be missing a couple functions here.
This is the complete list (made using the keyword inspector script I posted on slack a while back).

<class 'pandas.core.arrays.categorical.Categorical'>.value_counts
<class 'pandas.core.indexes.category.CategoricalIndex'>.nunique
<class 'pandas.core.indexes.category.CategoricalIndex'>.value_counts
<class 'pandas.core.frame.DataFrame'>.groupby
<class 'pandas.core.frame.DataFrame'>.mode
<class 'pandas.core.frame.DataFrame'>.nunique
<class 'pandas.core.frame.DataFrame'>.pivot_table
<class 'pandas.core.frame.DataFrame'>.stack
<class 'pandas.core.frame.DataFrame'>.to_hdf
<class 'pandas.core.frame.DataFrame'>.value_counts
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.nunique
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.value_counts
<class 'pandas.io.pytables.HDFStore'>.append
<class 'pandas.io.pytables.HDFStore'>.append_to_multiple
<class 'pandas.io.pytables.HDFStore'>.put
<class 'pandas.core.indexes.base.Index'>.nunique
<class 'pandas.core.indexes.base.Index'>.value_counts
<class 'pandas.core.indexes.interval.IntervalIndex'>.nunique
<class 'pandas.core.indexes.interval.IntervalIndex'>.value_counts
<class 'pandas.core.indexes.multi.MultiIndex'>.nunique
<class 'pandas.core.indexes.multi.MultiIndex'>.value_counts
<class 'pandas.core.indexes.period.PeriodIndex'>.nunique
<class 'pandas.core.indexes.period.PeriodIndex'>.value_counts
<class 'pandas.core.indexes.range.RangeIndex'>.nunique
<class 'pandas.core.indexes.range.RangeIndex'>.value_counts
<class 'pandas.core.series.Series'>.groupby
<class 'pandas.core.series.Series'>.mode
<class 'pandas.core.series.Series'>.nunique
<class 'pandas.core.series.Series'>.to_hdf
<class 'pandas.core.series.Series'>.value_counts
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.nunique
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.value_counts
crosstab
lreshape
pivot_table
value_counts

I think the missing ones are
crosstab, lreshape, HDFStore.put|append|append_to_multiple.


We propose to deprecate the current default of `dropna` and change it to
`False` across all methods listed above.

## Timeline

If accepted, the current `dropna` default would be deprecated as part of pandas
2.x and this deprecation would be enforced in pandas 3.0. In pandas 2.x, `FutureWarning` messages would
be emitted on any calls to these methods where the value of `dropna` is unspecified and
an NA value is present.

## PDEP History

- 4 May 2023: Initial draft