From 06f8568b56495c6c6487322ab501fff3354a8c12 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Sat, 10 Nov 2018 07:26:19 -0600 Subject: [PATCH 1/6] wip --- doc/source/index.rst.template | 1 + doc/source/integer_na.rst | 82 +++++++++++++++++++++++++++++++++++ 2 files changed, 83 insertions(+) create mode 100644 doc/source/integer_na.rst diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template index d2b88e794e51e..07cd992d74e18 100644 --- a/doc/source/index.rst.template +++ b/doc/source/index.rst.template @@ -139,6 +139,7 @@ See the package overview for more detail about what's in the library. timeseries timedeltas categorical + integer_na visualization style io diff --git a/doc/source/integer_na.rst b/doc/source/integer_na.rst new file mode 100644 index 0000000000000..a5f28c6b19b97 --- /dev/null +++ b/doc/source/integer_na.rst @@ -0,0 +1,82 @@ +.. currentmodule:: pandas + +.. ipython:: python + :suppress: + + import numpy as np + import pandas as pd + + .. _integer_na: + +******************************** +Integer Data with Missing Values +******************************** + +.. versionadded:: 0.24.0 + +In :ref:`missing_data`, we say that pandas primarily uses ``NaN`` to represent +missing data. Because ``NaN`` is a float, this forces an array of integers with +any missing values to become floating point. In some cases, this may not matter +much. But if your integer column is, say, and identifier, casting to float can +lead to bad outcomes. + +Pandas can represent integer data with missing values with the +:class:`arrays.IntegerArray` array. This is an :ref:`extension types ` +implemented within pandas. It is not the default dtype and will not be inferred, +you must explicitly create an :class:`api.extensions.IntegerArray` using :func:`integer_array`. + +.. ipython:: python + + arr = integer_array([1, 2, np.nan]) + arr + +This array can be stored in a :class:`DataFrame` or :class:`Series` like any +NumPy array. + +.. ipython:: python + + pd.Series(arr) + +Alternatively, you can instruct pandas to treat an array-like as an +:class:`api.extensions.IntegerArray` by specifying a dtype with a capital "I". + +.. ipython:: python + + s = pd.Series([1, 2, np.nan], dtype="Int64") + s + +Note that by default (if you don't specify `dtype`), NumPy is used, and you'll end +up with a ``float64`` dtype Series: + +.. ipython:: python + + pd.Series([1, 2, np.nan]) + + +Operations involving an integer array will behave similar to NumPy arrays. +Missing values will be propagated, and and the data will be coerced to another +dtype if needed. + +.. ipython:: python + + # arithmetic + s + 1 + + # comparison + s == 1 + + # indexing + s.iloc[1:3] + + # operate with other dtypes + s + s.iloc[1:3].astype('Int8') + + # coerce when needed + s + 0.01 + +Reduction and groupby operations such as 'sum' work as well. + +.. ipython:: python + + df.sum() + df.groupby('B').A.sum() From 9e505a89c391068966e34eb968d2f8e37811cb9c Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Sat, 10 Nov 2018 14:56:23 -0600 Subject: [PATCH 2/6] DOC: Integer NA Closes https://github.com/pandas-dev/pandas/issues/22003 --- doc/source/basics.rst | 5 ++++- doc/source/gotchas.rst | 13 +++++++++++-- doc/source/integer_na.rst | 31 ++++++++++++++++++------------- doc/source/missing_data.rst | 16 ++++++++++++++++ doc/source/whatsnew/v0.24.0.txt | 4 +++- 5 files changed, 52 insertions(+), 17 deletions(-) diff --git a/doc/source/basics.rst b/doc/source/basics.rst index d19fcedf4e766..4d69f6dcd4fd1 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -84,7 +84,10 @@ unlike the axis labels, cannot be assigned to. When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only - floats and integers, the resulting array will be of float dtype. + floats and integers, the resulting array will be of float dtype. If you + need to store an integer column with missing data, use one of the "Int" + dtypes (``"Int8"``, ``"Int16"``, ``"Int32"``, ``"Int64"``). See + :ref:`integer_na` for more. .. _basics.accelerate: diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index 0eb2a4eed8581..f858f0af00104 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -228,8 +228,17 @@ arrays. For example: s2.dtype This trade-off is made largely for memory and performance reasons, and also so -that the resulting ``Series`` continues to be "numeric". One possibility is to -use ``dtype=object`` arrays instead. +that the resulting ``Series`` continues to be "numeric". + +If you need to represent integers with possibly missing values, use one of +the ``"Int"`` dtypes provided by pandas + +* :class:`Int8Dtype` +* :class:`Int16Dtype` +* :class:`Int32Dtype` +* :class:`Int64Dtype` + +See :ref:`integer_na` for more. ``NA`` type promotions ~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/integer_na.rst b/doc/source/integer_na.rst index a5f28c6b19b97..1b2340e627647 100644 --- a/doc/source/integer_na.rst +++ b/doc/source/integer_na.rst @@ -8,27 +8,33 @@ .. _integer_na: -******************************** -Integer Data with Missing Values -******************************** +************************** +Nullable Integer Data Type +************************** .. versionadded:: 0.24.0 -In :ref:`missing_data`, we say that pandas primarily uses ``NaN`` to represent +In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent missing data. Because ``NaN`` is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, and identifier, casting to float can -lead to bad outcomes. +be problematic. Pandas can represent integer data with missing values with the :class:`arrays.IntegerArray` array. This is an :ref:`extension types ` -implemented within pandas. It is not the default dtype and will not be inferred, -you must explicitly create an :class:`api.extensions.IntegerArray` using :func:`integer_array`. +implemented within pandas. It is not the default dtype for integers, and will not be inferred; +you must explicitly pass the dtype into the :meth:`array` or :class:`Series` method: .. ipython:: python - arr = integer_array([1, 2, np.nan]) - arr + pd.array([1, 2, np.nan], dtype=pd.Int64Dtype()) + +Or the string alias "Int64" (note the capital ``"I"``, to differentiate from +NumPy's ``'int64'`` dtype: + +.. ipython:: python + + pd.array([1, 2, np.nan], dtype="Int64") This array can be stored in a :class:`DataFrame` or :class:`Series` like any NumPy array. @@ -37,22 +43,21 @@ NumPy array. pd.Series(arr) -Alternatively, you can instruct pandas to treat an array-like as an -:class:`api.extensions.IntegerArray` by specifying a dtype with a capital "I". +You can also pass the list-like object to the :class:`Series` constructor +with the dtype. .. ipython:: python s = pd.Series([1, 2, np.nan], dtype="Int64") s -Note that by default (if you don't specify `dtype`), NumPy is used, and you'll end +By default (if you don't specify ``dtype``), NumPy is used, and you'll end up with a ``float64`` dtype Series: .. ipython:: python pd.Series([1, 2, np.nan]) - Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and and the data will be coerced to another dtype if needed. diff --git a/doc/source/missing_data.rst b/doc/source/missing_data.rst index 4864637691607..b3323eb131317 100644 --- a/doc/source/missing_data.rst +++ b/doc/source/missing_data.rst @@ -760,3 +760,19 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work reindexed[crit.fillna(False)] reindexed[crit.fillna(True)] + +Pandas provides a nullable integer dtype, but you must explicitly request it +when creating the series or column. Notice that we use a capital "I" in +the ``dtype="Int64"``. + +.. ipython:: python + + s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7], + dtype="Int64") + s > 0 + (s > 0).dtype + crit = (s > 0).reindex(list(range(8))) + crit + crit.dtype + +See :ref:`integer_na` for more. diff --git a/doc/source/whatsnew/v0.24.0.txt b/doc/source/whatsnew/v0.24.0.txt index eb20d5368ef15..bdbd31860c71c 100644 --- a/doc/source/whatsnew/v0.24.0.txt +++ b/doc/source/whatsnew/v0.24.0.txt @@ -99,7 +99,9 @@ Reduction and groupby operations such as 'sum' work. .. warning:: - The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date. + The Integer NA support currently uses the capitalized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date. + +See :ref:`integer_na` for more. .. _whatsnew_0240.enhancements.read_html: From 4ae4f8d03d8e1917585b3722f6920867958bbb7f Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Sat, 10 Nov 2018 15:11:28 -0600 Subject: [PATCH 3/6] subsection --- doc/source/missing_data.rst | 72 ++++++------------------------------- 1 file changed, 11 insertions(+), 61 deletions(-) diff --git a/doc/source/missing_data.rst b/doc/source/missing_data.rst index b3323eb131317..0ba0eb342f379 100644 --- a/doc/source/missing_data.rst +++ b/doc/source/missing_data.rst @@ -29,76 +29,26 @@ pandas. See the :ref:`cookbook` for some advanced strategies. -Missing data basics -------------------- +Integer Dtypes and Missing Data +------------------------------- -When / why does data become missing? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Some might quibble over our usage of *missing*. By "missing" we simply mean -**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with -missing data, either because it exists and was not collected or it never -existed. For example, in a collection of financial time series, some of the time -series might start on different dates. Thus, values prior to the start date -would generally be marked as missing. - -In pandas, one of the most common ways that missing data is **introduced** into -a data set is by reindexing. For example: +Because ``NaN`` is a float, a column of integers with even one missing values +is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas +provides a nullable integer array, which can be used by explicitly requesting +the dtype: .. ipython:: python - df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], - columns=['one', 'two', 'three']) - df['four'] = 'bar' - df['five'] = df['one'] > 0 - df - df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) - df2 - -Values considered "missing" -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -As data comes in many shapes and forms, pandas aims to be flexible with regard -to handling missing data. While ``NaN`` is the default missing value marker for -reasons of computational speed and convenience, we need to be able to easily -detect this value with data of different types: floating point, integer, -boolean, and general object. In many cases, however, the Python ``None`` will -arise and we wish to also consider that "missing" or "not available" or "NA". - -.. note:: - - If you want to consider ``inf`` and ``-inf`` to be "NA" in computations, - you can set ``pandas.options.mode.use_inf_as_na = True``. - -.. _missing.isna: + pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype()) -To make detecting missing values easier (and across different array dtypes), -pandas provides the :func:`isna` and -:func:`notna` functions, which are also methods on -Series and DataFrame objects: +Alternatively, the string alias ``'Int64'`` (note the capital ``"I"``) can be +used: .. ipython:: python - df2['one'] - pd.isna(df2['one']) - df2['four'].notna() - df2.isna() + pd.Series([1, 2, np.nan, 4], dtype="Int64") -.. warning:: - - One has to be mindful that in Python (and NumPy), the ``nan's`` don't compare equal, but ``None's`` **do**. - Note that pandas/NumPy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``. - - .. ipython:: python - - None == None - np.nan == np.nan - - So as compared to above, a scalar equality comparison versus a ``None/np.nan`` doesn't provide useful information. - - .. ipython:: python - - df2['one'] == np.nan +See :ref:`integer_na` for more. Datetimes --------- From 8d1d026a93fd4acccf1a22af3c1878b5d683b376 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Sat, 8 Dec 2018 14:26:29 -0600 Subject: [PATCH 4/6] update --- doc/source/basics.rst | 5 +---- doc/source/gotchas.rst | 11 +++++++++++ doc/source/integer_na.rst | 13 +++++++------ 3 files changed, 19 insertions(+), 10 deletions(-) diff --git a/doc/source/basics.rst b/doc/source/basics.rst index 1b7661a54d865..25e2c8cd1ff9a 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -114,10 +114,7 @@ unlike the axis labels, cannot be assigned to. When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only - floats and integers, the resulting array will be of float dtype. If you - need to store an integer column with missing data, use one of the "Int" - dtypes (``"Int8"``, ``"Int16"``, ``"Int32"``, ``"Int64"``). See - :ref:`integer_na` for more. + floats and integers, the resulting array will be of float dtype. In the past, pandas recommended :attr:`Series.values` or :attr:`DataFrame.values` for extracting the data from a Series or DataFrame. You'll still find references diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index ebb0f1849d6a2..880e462a749c7 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -234,6 +234,17 @@ the ``"Int"`` dtypes provided by pandas * :class:`Int32Dtype` * :class:`Int64Dtype` +.. ipython:: python + + s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'), + dtype=pd.Int64()) + s_int + s_int.dtype + + s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u']) + s2_int + s2_int.dtype + See :ref:`integer_na` for more. ``NA`` type promotions diff --git a/doc/source/integer_na.rst b/doc/source/integer_na.rst index 1b2340e627647..aed7c898fe086 100644 --- a/doc/source/integer_na.rst +++ b/doc/source/integer_na.rst @@ -17,19 +17,20 @@ Nullable Integer Data Type In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent missing data. Because ``NaN`` is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter -much. But if your integer column is, say, and identifier, casting to float can -be problematic. +much. But if your integer column is, say, an identifier, casting to float can +be problematic. Some integers cannot even be represented as floating point +numbers. -Pandas can represent integer data with missing values with the -:class:`arrays.IntegerArray` array. This is an :ref:`extension types ` +Pandas can represent integer data with possibly missing values using +:class:`arrays.IntegerArray`. This is an :ref:`extension types ` implemented within pandas. It is not the default dtype for integers, and will not be inferred; -you must explicitly pass the dtype into the :meth:`array` or :class:`Series` method: +you must explicitly pass the dtype into :meth:`array` or :class:`Series`: .. ipython:: python pd.array([1, 2, np.nan], dtype=pd.Int64Dtype()) -Or the string alias "Int64" (note the capital ``"I"``, to differentiate from +Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from NumPy's ``'int64'`` dtype: .. ipython:: python From 51c4353f39f31f08757b68a0d29101bf156486f5 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 20 Dec 2018 14:14:14 -0600 Subject: [PATCH 5/6] fixup --- doc/source/integer_na.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/source/integer_na.rst b/doc/source/integer_na.rst index 383bddb4ba9cc..d8f637ce59daf 100644 --- a/doc/source/integer_na.rst +++ b/doc/source/integer_na.rst @@ -24,7 +24,8 @@ you must explicitly pass the dtype into :meth:`array` or :class:`Series`: .. ipython:: python - pd.array([1, 2, np.nan], dtype=pd.Int64Dtype()) + arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype()) + arr Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from NumPy's ``'int64'`` dtype: From 0c9995f3ba1ab50b82eaa2e6c08219e7a4069134 Mon Sep 17 00:00:00 2001 From: Jeff Reback Date: Mon, 31 Dec 2018 19:03:58 -0500 Subject: [PATCH 6/6] add back construction for docs --- doc/source/integer_na.rst | 16 ++++++++++++++++ doc/source/missing_data.rst | 10 ++++++++++ 2 files changed, 26 insertions(+) diff --git a/doc/source/integer_na.rst b/doc/source/integer_na.rst index d8f637ce59daf..befcf7016f155 100644 --- a/doc/source/integer_na.rst +++ b/doc/source/integer_na.rst @@ -77,6 +77,22 @@ dtype if needed. # coerce when needed s + 0.01 +These dtypes can operate as part of of ``DataFrame``. + +.. ipython:: python + + df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')}) + df + df.dtypes + + +These dtypes can be merged & reshaped & casted. + +.. ipython:: python + + pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes + df['A'].astype(float) + Reduction and groupby operations such as 'sum' work as well. .. ipython:: python diff --git a/doc/source/missing_data.rst b/doc/source/missing_data.rst index 30256c5f6cfa2..a9234b83c78ab 100644 --- a/doc/source/missing_data.rst +++ b/doc/source/missing_data.rst @@ -36,6 +36,16 @@ arise and we wish to also consider that "missing" or "not available" or "NA". .. _missing.isna: +.. ipython:: python + + df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], + columns=['one', 'two', 'three']) + df['four'] = 'bar' + df['five'] = df['one'] > 0 + df + df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) + df2 + To make detecting missing values easier (and across different array dtypes), pandas provides the :func:`isna` and :func:`notna` functions, which are also methods on