Skip to content

ENH: Enable fillna(value=None) #58085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions doc/source/user_guide/missing_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,27 @@ Replace NA with a scalar value
df
df.fillna(0)

When the data has object dtype, you can control what type of NA values are present.

.. ipython:: python

df = pd.DataFrame({"a": [pd.NA, np.nan, None]}, dtype=object)
df
df.fillna(None)
df.fillna(np.nan)
df.fillna(pd.NA)

However when the dtype is not object, these will all be replaced with the proper NA value for the dtype.

.. ipython:: python

data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}
df = pd.DataFrame(data)
df
df.fillna(None)
df.fillna(np.nan)
df.fillna(pd.NA)

Fill gaps forward or backward

.. ipython:: python
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Other enhancements
- Users can globally disable any ``PerformanceWarning`` by setting the option ``mode.performance_warnings`` to ``False`` (:issue:`56920`)
- :meth:`Styler.format_index_names` can now be used to format the index and column names (:issue:`48936` and :issue:`47489`)
- :meth:`DataFrame.cummin`, :meth:`DataFrame.cummax`, :meth:`DataFrame.cumprod` and :meth:`DataFrame.cumsum` methods now have a ``numeric_only`` parameter (:issue:`53072`)
- :meth:`DataFrame.fillna` and :meth:`Series.fillna` can now accept ``value=None``; for non-object dtype the corresponding NA value will be used (:issue:`57723`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other change here is that you must specify a value for value, whereas before it defaulted to None

Copy link
Member Author

@rhshadrach rhshadrach Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the default value of None would also raise. In other words, both before an after this PR you have to specify value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the default value of None would also raise. In other words, both before an after this PR you have to specify value.

That was true if you didn't specify method. But with 2.2:

>>> import pandas as pd
>>> import numpy as np
>>> s=pd.Series([1,np.nan,2])
>>> s.fillna(method="ffill")
<stdin>:1: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
0    1.0
1    1.0
2    2.0
dtype: float64
>>>

So I think it is worth saying that value is now required.

Copy link
Member Author

@rhshadrach rhshadrach Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But method is no longer an argument. In 2.x, you either had to specify value or method, otherwise we'd raise. Any code that does not raise a warning in 2.2 is already specifying the value argument. Any code that raises a warning in 2.2 will raise a TypeError in 3.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So don't we have to say somewhere that we are enforcing the deprecation that method is no longer an argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Removed deprecated keyword ``method`` on :meth:`Series.fillna`, :meth:`DataFrame.fillna` (:issue:`57760`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. All good then


.. ---------------------------------------------------------------------------
.. _whatsnew_300.notable_bug_fixes:
Expand Down
5 changes: 2 additions & 3 deletions pandas/core/arrays/_mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ def _pad_or_backfill(
return new_values

@doc(ExtensionArray.fillna)
def fillna(self, value=None, limit: int | None = None, copy: bool = True) -> Self:
def fillna(self, value, limit: int | None = None, copy: bool = True) -> Self:
mask = self.isna()
# error: Argument 2 to "check_value_size" has incompatible type
# "ExtensionArray"; expected "ndarray"
Expand All @@ -347,8 +347,7 @@ def fillna(self, value=None, limit: int | None = None, copy: bool = True) -> Sel
new_values[mask] = value
else:
# We validate the fill_value even if there is nothing to fill
if value is not None:
self._validate_setitem_value(value)
self._validate_setitem_value(value)

if not copy:
new_values = self[:]
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -1077,7 +1077,7 @@ def _pad_or_backfill(
@doc(ExtensionArray.fillna)
def fillna(
self,
value: object | ArrayLike | None = None,
value: object | ArrayLike,
limit: int | None = None,
copy: bool = True,
) -> Self:
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -892,7 +892,7 @@ def max(self, *, axis: AxisInt | None = None, skipna: bool = True) -> IntervalOr
indexer = obj.argsort()[-1]
return obj[indexer]

def fillna(self, value=None, limit: int | None = None, copy: bool = True) -> Self:
def fillna(self, value, limit: int | None = None, copy: bool = True) -> Self:
"""
Fill NA/NaN values using the specified method.

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ def _pad_or_backfill(
return new_values

@doc(ExtensionArray.fillna)
def fillna(self, value=None, limit: int | None = None, copy: bool = True) -> Self:
def fillna(self, value, limit: int | None = None, copy: bool = True) -> Self:
mask = self._mask

value = missing.check_value_size(value, mask, len(self))
Expand Down
4 changes: 1 addition & 3 deletions pandas/core/arrays/sparse/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -706,7 +706,7 @@ def isna(self) -> Self: # type: ignore[override]

def fillna(
self,
value=None,
value,
limit: int | None = None,
copy: bool = True,
) -> Self:
Expand Down Expand Up @@ -736,8 +736,6 @@ def fillna(
When ``self.fill_value`` is not NA, the result dtype will be
``self.dtype``. Again, this preserves the amount of memory used.
"""
if value is None:
raise ValueError("Must specify 'value'.")
new_values = np.where(isna(self.sp_values), value, self.sp_values)

if self._null_fill_value:
Expand Down
179 changes: 88 additions & 91 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -6752,7 +6752,7 @@ def _pad_or_backfill(
@overload
def fillna(
self,
value: Hashable | Mapping | Series | DataFrame = ...,
value: Hashable | Mapping | Series | DataFrame,
*,
axis: Axis | None = ...,
inplace: Literal[False] = ...,
Expand All @@ -6762,7 +6762,7 @@ def fillna(
@overload
def fillna(
self,
value: Hashable | Mapping | Series | DataFrame = ...,
value: Hashable | Mapping | Series | DataFrame,
*,
axis: Axis | None = ...,
inplace: Literal[True],
Expand All @@ -6772,7 +6772,7 @@ def fillna(
@overload
def fillna(
self,
value: Hashable | Mapping | Series | DataFrame = ...,
value: Hashable | Mapping | Series | DataFrame,
*,
axis: Axis | None = ...,
inplace: bool = ...,
Expand All @@ -6786,7 +6786,7 @@ def fillna(
)
def fillna(
self,
value: Hashable | Mapping | Series | DataFrame | None = None,
value: Hashable | Mapping | Series | DataFrame,
*,
axis: Axis | None = None,
inplace: bool = False,
Expand Down Expand Up @@ -6827,6 +6827,12 @@ def fillna(
reindex : Conform object to new index.
asfreq : Convert TimeSeries to specified frequency.

Notes
-----
For non-object dtype, ``value=None`` will use the NA value of the dtype.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See more details in the :ref:`Filling missing data<missing_data.fillna>`
section.

Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should add an example with value=None ?

Copy link
Member Author

@rhshadrach rhshadrach Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good suggestion, but what do you think about doing it in https://pandas.pydata.org/docs/user_guide/missing_data.html#filling-missing-data instead of the docstring here? Can then link to this in the docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good suggestion, but what do you think about doing it in https://pandas.pydata.org/docs/user_guide/missing_data.html#filling-missing-data instead of the docstring here? Can then link to this in the docstring.

That's fine, although I think the behavior of None with respect to how it is treated as a missing value isn't necessarily well explained on that page, so maybe that should be done as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs:

image

image

--------
>>> df = pd.DataFrame(
Expand Down Expand Up @@ -6909,101 +6915,92 @@ def fillna(
axis = 0
axis = self._get_axis_number(axis)

if value is None:
raise ValueError("Must specify a fill 'value'.")
else:
if self.ndim == 1:
if isinstance(value, (dict, ABCSeries)):
if not len(value):
# test_fillna_nonscalar
if inplace:
return None
return self.copy(deep=False)
from pandas import Series

value = Series(value)
value = value.reindex(self.index)
value = value._values
elif not is_list_like(value):
pass
else:
raise TypeError(
'"value" parameter must be a scalar, dict '
"or Series, but you passed a "
f'"{type(value).__name__}"'
)
if self.ndim == 1:
if isinstance(value, (dict, ABCSeries)):
if not len(value):
# test_fillna_nonscalar
if inplace:
return None
return self.copy(deep=False)
from pandas import Series

value = Series(value)
value = value.reindex(self.index)
value = value._values
elif not is_list_like(value):
pass
else:
raise TypeError(
'"value" parameter must be a scalar, dict '
"or Series, but you passed a "
f'"{type(value).__name__}"'
)

new_data = self._mgr.fillna(value=value, limit=limit, inplace=inplace)
new_data = self._mgr.fillna(value=value, limit=limit, inplace=inplace)

elif isinstance(value, (dict, ABCSeries)):
if axis == 1:
raise NotImplementedError(
"Currently only can fill "
"with dict/Series column "
"by column"
)
result = self if inplace else self.copy(deep=False)
for k, v in value.items():
if k not in result:
continue
elif isinstance(value, (dict, ABCSeries)):
if axis == 1:
raise NotImplementedError(
"Currently only can fill with dict/Series column by column"
)
result = self if inplace else self.copy(deep=False)
for k, v in value.items():
if k not in result:
continue

res_k = result[k].fillna(v, limit=limit)
res_k = result[k].fillna(v, limit=limit)

if not inplace:
result[k] = res_k
if not inplace:
result[k] = res_k
else:
# We can write into our existing column(s) iff dtype
# was preserved.
if isinstance(res_k, ABCSeries):
# i.e. 'k' only shows up once in self.columns
if res_k.dtype == result[k].dtype:
result.loc[:, k] = res_k
else:
# Different dtype -> no way to do inplace.
result[k] = res_k
else:
# We can write into our existing column(s) iff dtype
# was preserved.
if isinstance(res_k, ABCSeries):
# i.e. 'k' only shows up once in self.columns
if res_k.dtype == result[k].dtype:
result.loc[:, k] = res_k
# see test_fillna_dict_inplace_nonunique_columns
locs = result.columns.get_loc(k)
if isinstance(locs, slice):
locs = np.arange(self.shape[1])[locs]
elif isinstance(locs, np.ndarray) and locs.dtype.kind == "b":
locs = locs.nonzero()[0]
elif not (
isinstance(locs, np.ndarray) and locs.dtype.kind == "i"
):
# Should never be reached, but let's cover our bases
raise NotImplementedError(
"Unexpected get_loc result, please report a bug at "
"https://github.com/pandas-dev/pandas"
)

for i, loc in enumerate(locs):
res_loc = res_k.iloc[:, i]
target = self.iloc[:, loc]

if res_loc.dtype == target.dtype:
result.iloc[:, loc] = res_loc
else:
# Different dtype -> no way to do inplace.
result[k] = res_k
else:
# see test_fillna_dict_inplace_nonunique_columns
locs = result.columns.get_loc(k)
if isinstance(locs, slice):
locs = np.arange(self.shape[1])[locs]
elif (
isinstance(locs, np.ndarray) and locs.dtype.kind == "b"
):
locs = locs.nonzero()[0]
elif not (
isinstance(locs, np.ndarray) and locs.dtype.kind == "i"
):
# Should never be reached, but let's cover our bases
raise NotImplementedError(
"Unexpected get_loc result, please report a bug at "
"https://github.com/pandas-dev/pandas"
)

for i, loc in enumerate(locs):
res_loc = res_k.iloc[:, i]
target = self.iloc[:, loc]

if res_loc.dtype == target.dtype:
result.iloc[:, loc] = res_loc
else:
result.isetitem(loc, res_loc)
if inplace:
return self._update_inplace(result)
else:
return result
result.isetitem(loc, res_loc)
if inplace:
return self._update_inplace(result)
else:
return result

elif not is_list_like(value):
if axis == 1:
result = self.T.fillna(value=value, limit=limit).T
new_data = result._mgr
else:
new_data = self._mgr.fillna(
value=value, limit=limit, inplace=inplace
)
elif isinstance(value, ABCDataFrame) and self.ndim == 2:
new_data = self.where(self.notna(), value)._mgr
elif not is_list_like(value):
if axis == 1:
result = self.T.fillna(value=value, limit=limit).T
new_data = result._mgr
else:
raise ValueError(f"invalid fill value with a {type(value)}")
new_data = self._mgr.fillna(value=value, limit=limit, inplace=inplace)
elif isinstance(value, ABCDataFrame) and self.ndim == 2:
new_data = self.where(self.notna(), value)._mgr
else:
raise ValueError(f"invalid fill value with a {type(value)}")

result = self._constructor_from_mgr(new_data, axes=new_data.axes)
if inplace:
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2543,7 +2543,7 @@ def notna(self) -> npt.NDArray[np.bool_]:

notnull = notna

def fillna(self, value=None):
def fillna(self, value):
"""
Fill NA/NaN values with the specified value.

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -1675,7 +1675,7 @@ def duplicated(self, keep: DropKeep = "first") -> npt.NDArray[np.bool_]:
# (previously declared in base class "IndexOpsMixin")
_duplicated = duplicated # type: ignore[misc]

def fillna(self, value=None, downcast=None):
def fillna(self, value, downcast=None):
"""
fillna is not implemented for MultiIndex
"""
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/extension/base/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,12 @@ def test_fillna_scalar(self, data_missing):
expected = data_missing.fillna(valid)
tm.assert_extension_array_equal(result, expected)

def test_fillna_with_none(self, data_missing):
# GH#57723
result = data_missing.fillna(None)
expected = data_missing
tm.assert_extension_array_equal(result, expected)

def test_fillna_limit_pad(self, data_missing):
arr = data_missing.take([1, 0, 0, 0, 1])
result = pd.Series(arr).ffill(limit=2)
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,14 @@ def test_fillna_series(self, data_missing):
):
super().test_fillna_series(data_missing)

def test_fillna_with_none(self, data_missing):
# GH#57723
# EAs that don't have special logic for None will raise, unlike pandas'
# which interpret None as the NA value for the dtype.
msg = "conversion from NoneType to Decimal is not supported"
with pytest.raises(TypeError, match=msg):
super().test_fillna_with_none(data_missing)

@pytest.mark.parametrize("dropna", [True, False])
def test_value_counts(self, all_data, dropna):
all_data = all_data[:10]
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,13 @@ def test_fillna_frame(self):
"""We treat dictionaries as a mapping in fillna, not a scalar."""
super().test_fillna_frame()

def test_fillna_with_none(self, data_missing):
# GH#57723
# EAs that don't have special logic for None will raise, unlike pandas'
# which interpret None as the NA value for the dtype.
with pytest.raises(AssertionError):
super().test_fillna_with_none(data_missing)

@pytest.mark.parametrize(
"limit_area, input_ilocs, expected_ilocs",
[
Expand Down
Loading