Skip to content

Latest commit

 

History

History
697 lines (459 loc) · 18.5 KB

missing_data.rst

File metadata and controls

697 lines (459 loc) · 18.5 KB

{{ header }}

Working with missing data

Values considered "missing"

pandas uses different sentinel values to represent a missing (also referred to as NA) depending on the data type.

numpy.nan for NumPy data types. The disadvantage of using NumPy data types is that the original data type will be coerced to np.float64 or object.

.. ipython:: python

   pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])
   pd.Series([True, False], dtype=np.bool_).reindex([0, 1, 2])

:class:`NaT` for NumPy np.datetime64, np.timedelta64, and :class:`PeriodDtype`. For typing applications, use :class:`api.typing.NaTType`.

.. ipython:: python

   pd.Series([1, 2], dtype=np.dtype("timedelta64[ns]")).reindex([0, 1, 2])
   pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])
   pd.Series(["2020", "2020"], dtype=pd.PeriodDtype("D")).reindex([0, 1, 2])

:class:`NA` for :class:`StringDtype`, :class:`Int64Dtype` (and other bit widths), :class:`Float64Dtype` (and other bit widths), :class:`BooleanDtype` and :class:`ArrowDtype`. These types will maintain the original data type of the data. For typing applications, use :class:`api.types.NAType`.

.. ipython:: python

   pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])
   pd.Series([True, False], dtype="boolean[pyarrow]").reindex([0, 1, 2])

To detect these missing value, use the :func:`isna` or :func:`notna` methods.

.. ipython:: python

   ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])
   ser
   pd.isna(ser)


Note

:func:`isna` or :func:`notna` will also consider None a missing value.

.. ipython:: python

   ser = pd.Series([1, None], dtype=object)
   ser
   pd.isna(ser)

Warning

Equality compaisons between np.nan, :class:`NaT`, and :class:`NA` do not act like None

.. ipython:: python

   None == None  # noqa: E711
   np.nan == np.nan
   pd.NaT == pd.NaT
   pd.NA == pd.NA

Therefore, an equality comparison between a :class:`DataFrame` or :class:`Series` with one of these missing values does not provide the same information as :func:`isna` or :func:`notna`.

.. ipython:: python

   ser = pd.Series([True, None], dtype="boolean[pyarrow]")
   ser == pd.NA
   pd.isna(ser)

:class:`NA` semantics

Warning

Experimental: the behaviour of :class:`NA` can still change without warning.

Starting from pandas 1.0, an experimental :class:`NA` value (singleton) is available to represent scalar missing values. The goal of :class:`NA` is provide a "missing" indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).

For example, when having missing values in a :class:`Series` with the nullable integer dtype, it will use :class:`NA`:

.. ipython:: python

    s = pd.Series([1, 2, None], dtype="Int64")
    s
    s[2]
    s[2] is pd.NA

Currently, pandas does not use those data types using :class:`NA` by default in a :class:`DataFrame` or :class:`Series`, so you need to specify the dtype explicitly. An easy way to convert to those dtypes is explained in the :ref:`conversion section <missing_data.NA.conversion>`.

Propagation in arithmetic and comparison operations

In general, missing values propagate in operations involving :class:`NA`. When one of the operands is unknown, the outcome of the operation is also unknown.

For example, :class:`NA` propagates in arithmetic operations, similarly to np.nan:

.. ipython:: python

   pd.NA + 1
   "a" * pd.NA

There are a few special cases when the result is known, even when one of the operands is NA.

.. ipython:: python

   pd.NA ** 0
   1 ** pd.NA

In equality and comparison operations, :class:`NA` also propagates. This deviates from the behaviour of np.nan, where comparisons with np.nan always return False.

.. ipython:: python

   pd.NA == 1
   pd.NA == pd.NA
   pd.NA < 2.5

To check if a value is equal to :class:`NA`, use :func:`isna`

.. ipython:: python

   pd.isna(pd.NA)


Note

An exception on this basic propagation rule are reductions (such as the mean or the minimum), where pandas defaults to skipping missing values. See the :ref:`calculation section <missing_data.calculations>` for more.

Logical operations

For logical operations, :class:`NA` follows the rules of the three-valued logic (or Kleene logic, similarly to R, SQL and Julia). This logic means to only propagate missing values when it is logically required.

For example, for the logical "or" operation (|), if one of the operands is True, we already know the result will be True, regardless of the other value (so regardless the missing value would be True or False). In this case, :class:`NA` does not propagate:

.. ipython:: python

   True | False
   True | pd.NA
   pd.NA | True

On the other hand, if one of the operands is False, the result depends on the value of the other operand. Therefore, in this case :class:`NA` propagates:

.. ipython:: python

   False | True
   False | False
   False | pd.NA

The behaviour of the logical "and" operation (&) can be derived using similar logic (where now :class:`NA` will not propagate if one of the operands is already False):

.. ipython:: python

   False & True
   False & False
   False & pd.NA

.. ipython:: python

   True & True
   True & False
   True & pd.NA


NA in a boolean context

Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value.

.. ipython:: python
   :okexcept:

   bool(pd.NA)

This also means that :class:`NA` cannot be used in a context where it is evaluated to a boolean, such as if condition: ... where condition can potentially be :class:`NA`. In such cases, :func:`isna` can be used to check for :class:`NA` or condition being :class:`NA` can be avoided, for example by filling missing values beforehand.

A similar situation occurs when using :class:`Series` or :class:`DataFrame` objects in if statements, see :ref:`gotchas.truth`.

NumPy ufuncs

:attr:`pandas.NA` implements NumPy's __array_ufunc__ protocol. Most ufuncs work with NA, and generally return NA:

.. ipython:: python

   np.log(pd.NA)
   np.add(pd.NA, 1)

Warning

Currently, ufuncs involving an ndarray and NA will return an object-dtype filled with NA values.

.. ipython:: python

   a = np.array([1, 2, 3])
   np.greater(a, pd.NA)

The return type here may change to return a different array type in the future.

See :ref:`dsintro.numpy_interop` for more on ufuncs.

Conversion

If you have a :class:`DataFrame` or :class:`Series` using np.nan, :meth:`DataFrame.convert_dtypes` and :meth:`Series.convert_dtypes`, respectively, will convert your data to use the nullable data types supporting :class:`NA`, such as :class:`Int64Dtype` or :class:`ArrowDtype`. This is especially helpful after reading in data sets from IO methods where data types were inferred.

In this example, while the dtypes of all columns are changed, we show the results for the first 10 columns.

.. ipython:: python

   import io
   data = io.StringIO("a,b\n,True\n2,")
   df = pd.read_csv(data)
   df.dtypes
   df_conv = df.convert_dtypes()
   df_conv
   df_conv.dtypes

Inserting missing data

You can insert missing values by simply assigning to a :class:`Series` or :class:`DataFrame`. The missing value sentinel used will be chosen based on the dtype.

.. ipython:: python

   ser = pd.Series([1., 2., 3.])
   ser.loc[0] = None
   ser

   ser = pd.Series([pd.Timestamp("2021"), pd.Timestamp("2021")])
   ser.iloc[0] = np.nan
   ser

   ser = pd.Series([True, False], dtype="boolean[pyarrow]")
   ser.iloc[0] = None
   ser

For object types, pandas will use the value given:

.. ipython:: python

   s = pd.Series(["a", "b", "c"], dtype=object)
   s.loc[0] = None
   s.loc[1] = np.nan
   s

Calculations with missing data

Missing values propagate through arithmetic operations between pandas objects.

.. ipython:: python

   ser1 = pd.Series([np.nan, np.nan, 2, 3])
   ser2 = pd.Series([np.nan, 1, np.nan, 4])
   ser1
   ser2
   ser1 + ser2

The descriptive statistics and computational methods discussed in the :ref:`data structure overview <basics.stats>` (and listed :ref:`here <api.series.stats>` and :ref:`here <api.dataframe.stats>`) are all account for missing data.

When summing data, NA values or empty data will be treated as zero.

.. ipython:: python

   pd.Series([np.nan]).sum()
   pd.Series([], dtype="float64").sum()

When taking the product, NA values or empty data will be treated as 1.

.. ipython:: python

   pd.Series([np.nan]).prod()
   pd.Series([], dtype="float64").prod()

Cumulative methods like :meth:`~DataFrame.cumsum` and :meth:`~DataFrame.cumprod` ignore NA values by default preserve them in the result. This behavior can be changed with skipna

.. ipython:: python

   ser = pd.Series([1, np.nan, 3, np.nan])
   ser
   ser.cumsum()
   ser.cumsum(skipna=False)

Dropping missing data

:meth:`~DataFrame.dropna` dropa rows or columns with missing data.

.. ipython:: python

   df = pd.DataFrame([[np.nan, 1, 2], [1, 2, np.nan], [1, 2, 3]])
   df
   df.dropna()
   df.dropna(axis=1)

   ser = pd.Series([1, pd.NA], dtype="int64[pyarrow]")
   ser.dropna()

Filling missing data

Filling by value

:meth:`~DataFrame.fillna` replaces NA values with non-NA data.

Replace NA with a scalar value

.. ipython:: python

   data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}
   df = pd.DataFrame(data)
   df
   df.fillna(0)

When the data has object dtype, you can control what type of NA values are present.

.. ipython:: python

   df = pd.DataFrame({"a": [pd.NA, np.nan, None]}, dtype=object)
   df
   df.fillna(None)
   df.fillna(np.nan)
   df.fillna(pd.NA)

However when the dtype is not object, these will all be replaced with the proper NA value for the dtype.

.. ipython:: python

   data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}
   df = pd.DataFrame(data)
   df
   df.fillna(None)
   df.fillna(np.nan)
   df.fillna(pd.NA)

Fill gaps forward or backward

.. ipython:: python

   df.ffill()
   df.bfill()

Limit the number of NA values filled

.. ipython:: python

   df.ffill(limit=1)

NA values can be replaced with corresponding value from a :class:`Series` or :class:`DataFrame` where the index and column aligns between the original object and the filled object.

.. ipython:: python

   dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC"))
   dff.iloc[3:5, 0] = np.nan
   dff.iloc[4:6, 1] = np.nan
   dff.iloc[5:8, 2] = np.nan
   dff
   dff.fillna(dff.mean())

Note

:meth:`DataFrame.where` can also be used to fill NA values.Same result as above.

.. ipython:: python

   dff.where(pd.notna(dff), dff.mean(), axis="columns")

Interpolation

:meth:`DataFrame.interpolate` and :meth:`Series.interpolate` fills NA values using various interpolation methods.

.. ipython:: python

   df = pd.DataFrame(
       {
           "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
           "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
       }
   )
   df
   df.interpolate()

   idx = pd.date_range("2020-01-01", periods=10, freq="D")
   data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)
   ts = pd.Series(data, index=idx)
   ts.iloc[[1, 2, 5, 6, 9]] = np.nan

   ts
   @savefig series_before_interpolate.png
   ts.plot()

.. ipython:: python

   ts.interpolate()
   @savefig series_interpolate.png
   ts.interpolate().plot()

Interpolation relative to a :class:`Timestamp` in the :class:`DatetimeIndex` is available by setting method="time"

.. ipython:: python

   ts2 = ts.iloc[[0, 1, 3, 7, 9]]
   ts2
   ts2.interpolate()
   ts2.interpolate(method="time")

For a floating-point index, use method='values':

.. ipython:: python

   idx = [0.0, 1.0, 10.0]
   ser = pd.Series([0.0, np.nan, 10.0], idx)
   ser
   ser.interpolate()
   ser.interpolate(method="values")

If you have scipy installed, you can pass the name of a 1-d interpolation routine to method. as specified in the scipy interpolation documentation and reference guide. The appropriate interpolation method will depend on the data type.

Tip

If you are dealing with a time series that is growing at an increasing rate, use method='barycentric'.

If you have values approximating a cumulative distribution function, use method='pchip'.

To fill missing values with goal of smooth plotting use method='akima'.

.. ipython:: python

   df = pd.DataFrame(
      {
         "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
         "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
      }
   )
   df
   df.interpolate(method="barycentric")
   df.interpolate(method="pchip")
   df.interpolate(method="akima")

When interpolating via a polynomial or spline approximation, you must also specify the degree or order of the approximation:

.. ipython:: python

   df.interpolate(method="spline", order=2)
   df.interpolate(method="polynomial", order=2)

Comparing several methods.

.. ipython:: python

   np.random.seed(2)

   ser = pd.Series(np.arange(1, 10.1, 0.25) ** 2 + np.random.randn(37))
   missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
   ser.iloc[missing] = np.nan
   methods = ["linear", "quadratic", "cubic"]

   df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
   @savefig compare_interpolations.png
   df.plot()

Interpolating new observations from expanding data with :meth:`Series.reindex`.

.. ipython:: python

   ser = pd.Series(np.sort(np.random.uniform(size=100)))

   # interpolate at new_index
   new_index = ser.index.union(pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75]))
   interp_s = ser.reindex(new_index).interpolate(method="pchip")
   interp_s.loc[49:51]

Interpolation limits

:meth:`~DataFrame.interpolate` accepts a limit keyword argument to limit the number of consecutive NaN values filled since the last valid observation

.. ipython:: python

   ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])
   ser
   ser.interpolate()
   ser.interpolate(limit=1)

By default, NaN values are filled in a forward direction. Use limit_direction parameter to fill backward or from both directions.

.. ipython:: python

   ser.interpolate(limit=1, limit_direction="backward")
   ser.interpolate(limit=1, limit_direction="both")
   ser.interpolate(limit_direction="both")

By default, NaN values are filled whether they are surrounded by existing valid values or outside existing valid values. The limit_area parameter restricts filling to either inside or outside values.

.. ipython:: python

   # fill one consecutive inside value in both directions
   ser.interpolate(limit_direction="both", limit_area="inside", limit=1)

   # fill all consecutive outside values backward
   ser.interpolate(limit_direction="backward", limit_area="outside")

   # fill all consecutive outside values in both directions
   ser.interpolate(limit_direction="both", limit_area="outside")

Replacing values

:meth:`Series.replace` and :meth:`DataFrame.replace` can be used similar to :meth:`Series.fillna` and :meth:`DataFrame.fillna` to replace or insert missing values.

.. ipython:: python

   df = pd.DataFrame(np.eye(3))
   df
   df_missing = df.replace(0, np.nan)
   df_missing
   df_filled = df_missing.replace(np.nan, 2)
   df_filled

Replacing more than one value is possible by passing a list.

.. ipython:: python

   df_filled.replace([1, 44], [2, 28])

Replacing using a mapping dict.

.. ipython:: python

   df_filled.replace({1: 44, 2: 28})

Regular expression replacement

Note

Python strings prefixed with the r character such as r'hello world' are "raw" strings. They have different semantics regarding backslashes than strings without this prefix. Backslashes in raw strings will be interpreted as an escaped backslash, e.g., r'\' == '\\'.

Replace the '.' with NaN

.. ipython:: python

   d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}
   df = pd.DataFrame(d)
   df.replace(".", np.nan)

Replace the '.' with NaN with regular expression that removes surrounding whitespace

.. ipython:: python

   df.replace(r"\s*\.\s*", np.nan, regex=True)

Replace with a list of regexes.

.. ipython:: python

   df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True)

Replace with a regex in a mapping dict.

.. ipython:: python

   df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True)

Pass nested dictionaries of regular expressions that use the regex keyword.

.. ipython:: python

   df.replace({"b": {"b": r""}}, regex=True)
   df.replace(regex={"b": {r"\s*\.\s*": np.nan}})
   df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True)

Pass a list of regular expressions that will replace matches with a scalar.

.. ipython:: python

   df.replace([r"\s*\.\s*", r"a|b"], "placeholder", regex=True)

All of the regular expression examples can also be passed with the to_replace argument as the regex argument. In this case the value argument must be passed explicitly by name or regex must be a nested dictionary.

.. ipython:: python

   df.replace(regex=[r"\s*\.\s*", r"a|b"], value="placeholder")

Note

A regular expression object from re.compile is a valid input as well.