diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index 2b42eebf762e1..3d89fe171a343 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -215,8 +215,28 @@ arrays. For example: s2.dtype This trade-off is made largely for memory and performance reasons, and also so -that the resulting ``Series`` continues to be "numeric". One possibility is to -use ``dtype=object`` arrays instead. +that the resulting ``Series`` continues to be "numeric". + +If you need to represent integers with possibly missing values, use one of +the nullable-integer extension dtypes provided by pandas + +* :class:`Int8Dtype` +* :class:`Int16Dtype` +* :class:`Int32Dtype` +* :class:`Int64Dtype` + +.. ipython:: python + + s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'), + dtype=pd.Int64Dtype()) + s_int + s_int.dtype + + s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u']) + s2_int + s2_int.dtype + +See :ref:`integer_na` for more. ``NA`` type promotions ~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template index 7fe26d68c6fd0..145acbc55b85f 100644 --- a/doc/source/index.rst.template +++ b/doc/source/index.rst.template @@ -143,6 +143,7 @@ See the package overview for more detail about what's in the library. timeseries timedeltas categorical + integer_na visualization style io diff --git a/doc/source/integer_na.rst b/doc/source/integer_na.rst new file mode 100644 index 0000000000000..befcf7016f155 --- /dev/null +++ b/doc/source/integer_na.rst @@ -0,0 +1,101 @@ +.. currentmodule:: pandas + +{{ header }} + + .. _integer_na: + +************************** +Nullable Integer Data Type +************************** + +.. versionadded:: 0.24.0 + +In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent +missing data. Because ``NaN`` is a float, this forces an array of integers with +any missing values to become floating point. In some cases, this may not matter +much. But if your integer column is, say, an identifier, casting to float can +be problematic. Some integers cannot even be represented as floating point +numbers. + +Pandas can represent integer data with possibly missing values using +:class:`arrays.IntegerArray`. This is an :ref:`extension types ` +implemented within pandas. It is not the default dtype for integers, and will not be inferred; +you must explicitly pass the dtype into :meth:`array` or :class:`Series`: + +.. ipython:: python + + arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype()) + arr + +Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from +NumPy's ``'int64'`` dtype: + +.. ipython:: python + + pd.array([1, 2, np.nan], dtype="Int64") + +This array can be stored in a :class:`DataFrame` or :class:`Series` like any +NumPy array. + +.. ipython:: python + + pd.Series(arr) + +You can also pass the list-like object to the :class:`Series` constructor +with the dtype. + +.. ipython:: python + + s = pd.Series([1, 2, np.nan], dtype="Int64") + s + +By default (if you don't specify ``dtype``), NumPy is used, and you'll end +up with a ``float64`` dtype Series: + +.. ipython:: python + + pd.Series([1, 2, np.nan]) + +Operations involving an integer array will behave similar to NumPy arrays. +Missing values will be propagated, and and the data will be coerced to another +dtype if needed. + +.. ipython:: python + + # arithmetic + s + 1 + + # comparison + s == 1 + + # indexing + s.iloc[1:3] + + # operate with other dtypes + s + s.iloc[1:3].astype('Int8') + + # coerce when needed + s + 0.01 + +These dtypes can operate as part of of ``DataFrame``. + +.. ipython:: python + + df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')}) + df + df.dtypes + + +These dtypes can be merged & reshaped & casted. + +.. ipython:: python + + pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes + df['A'].astype(float) + +Reduction and groupby operations such as 'sum' work as well. + +.. ipython:: python + + df.sum() + df.groupby('B').A.sum() diff --git a/doc/source/missing_data.rst b/doc/source/missing_data.rst index 6a089decde3f5..a9234b83c78ab 100644 --- a/doc/source/missing_data.rst +++ b/doc/source/missing_data.rst @@ -19,32 +19,6 @@ pandas. See the :ref:`cookbook` for some advanced strategies. -Missing data basics -------------------- - -When / why does data become missing? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Some might quibble over our usage of *missing*. By "missing" we simply mean -**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with -missing data, either because it exists and was not collected or it never -existed. For example, in a collection of financial time series, some of the time -series might start on different dates. Thus, values prior to the start date -would generally be marked as missing. - -In pandas, one of the most common ways that missing data is **introduced** into -a data set is by reindexing. For example: - -.. ipython:: python - - df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], - columns=['one', 'two', 'three']) - df['four'] = 'bar' - df['five'] = df['one'] > 0 - df - df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) - df2 - Values considered "missing" ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -62,6 +36,16 @@ arise and we wish to also consider that "missing" or "not available" or "NA". .. _missing.isna: +.. ipython:: python + + df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], + columns=['one', 'two', 'three']) + df['four'] = 'bar' + df['five'] = df['one'] > 0 + df + df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) + df2 + To make detecting missing values easier (and across different array dtypes), pandas provides the :func:`isna` and :func:`notna` functions, which are also methods on @@ -90,6 +74,23 @@ Series and DataFrame objects: df2['one'] == np.nan +Integer Dtypes and Missing Data +------------------------------- + +Because ``NaN`` is a float, a column of integers with even one missing values +is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas +provides a nullable integer array, which can be used by explicitly requesting +the dtype: + +.. ipython:: python + + pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype()) + +Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be +used. + +See :ref:`integer_na` for more. + Datetimes --------- @@ -751,3 +752,19 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work reindexed[crit.fillna(False)] reindexed[crit.fillna(True)] + +Pandas provides a nullable integer dtype, but you must explicitly request it +when creating the series or column. Notice that we use a capital "I" in +the ``dtype="Int64"``. + +.. ipython:: python + + s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7], + dtype="Int64") + s > 0 + (s > 0).dtype + crit = (s > 0).reindex(list(range(8))) + crit + crit.dtype + +See :ref:`integer_na` for more. diff --git a/doc/source/whatsnew/v0.24.0.rst b/doc/source/whatsnew/v0.24.0.rst index 1734666efe669..c0860ba7993bf 100644 --- a/doc/source/whatsnew/v0.24.0.rst +++ b/doc/source/whatsnew/v0.24.0.rst @@ -159,7 +159,9 @@ Reduction and groupby operations such as 'sum' work. .. warning:: - The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date. + The Integer NA support currently uses the capitalized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date. + +See :ref:`integer_na` for more. .. _whatsnew_0240.enhancements.array: