-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Integer NA docs #23617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer NA docs #23617
Changes from 3 commits
06f8568
9e505a8
4ae4f8d
a6a7ba7
8d1d026
15a7b65
51c4353
0ef696d
d2a624d
0c9995f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,87 @@ | ||||||
.. currentmodule:: pandas | ||||||
|
||||||
.. ipython:: python | ||||||
:suppress: | ||||||
|
||||||
import numpy as np | ||||||
import pandas as pd | ||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
.. _integer_na: | ||||||
|
||||||
************************** | ||||||
Nullable Integer Data Type | ||||||
************************** | ||||||
|
||||||
.. versionadded:: 0.24.0 | ||||||
|
||||||
In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent | ||||||
missing data. Because ``NaN`` is a float, this forces an array of integers with | ||||||
any missing values to become floating point. In some cases, this may not matter | ||||||
much. But if your integer column is, say, and identifier, casting to float can | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
be problematic. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pretty niche case, but could also make reference to integers not representable in float64 space, e.g. |
||||||
|
||||||
Pandas can represent integer data with missing values with the | ||||||
:class:`arrays.IntegerArray` array. This is an :ref:`extension types <extending.extension-types>` | ||||||
implemented within pandas. It is not the default dtype for integers, and will not be inferred; | ||||||
you must explicitly pass the dtype into the :meth:`array` or :class:`Series` method: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
pd.array([1, 2, np.nan], dtype=pd.Int64Dtype()) | ||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Or the string alias "Int64" (note the capital ``"I"``, to differentiate from | ||||||
NumPy's ``'int64'`` dtype: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
pd.array([1, 2, np.nan], dtype="Int64") | ||||||
|
||||||
This array can be stored in a :class:`DataFrame` or :class:`Series` like any | ||||||
NumPy array. | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
pd.Series(arr) | ||||||
|
||||||
You can also pass the list-like object to the :class:`Series` constructor | ||||||
with the dtype. | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
s = pd.Series([1, 2, np.nan], dtype="Int64") | ||||||
s | ||||||
|
||||||
By default (if you don't specify ``dtype``), NumPy is used, and you'll end | ||||||
up with a ``float64`` dtype Series: | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
pd.Series([1, 2, np.nan]) | ||||||
|
||||||
Operations involving an integer array will behave similar to NumPy arrays. | ||||||
Missing values will be propagated, and and the data will be coerced to another | ||||||
dtype if needed. | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
# arithmetic | ||||||
s + 1 | ||||||
|
||||||
# comparison | ||||||
s == 1 | ||||||
|
||||||
# indexing | ||||||
s.iloc[1:3] | ||||||
|
||||||
# operate with other dtypes | ||||||
s + s.iloc[1:3].astype('Int8') | ||||||
|
||||||
# coerce when needed | ||||||
s + 0.01 | ||||||
|
||||||
Reduction and groupby operations such as 'sum' work as well. | ||||||
|
||||||
.. ipython:: python | ||||||
|
||||||
df.sum() | ||||||
df.groupby('B').A.sum() |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,76 +29,26 @@ pandas. | |
|
||
See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies. | ||
|
||
Missing data basics | ||
------------------- | ||
Integer Dtypes and Missing Data | ||
------------------------------- | ||
|
||
When / why does data become missing? | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Some might quibble over our usage of *missing*. By "missing" we simply mean | ||
**NA** ("not available") or "not present for whatever reason". Many data sets simply arrive with | ||
missing data, either because it exists and was not collected or it never | ||
existed. For example, in a collection of financial time series, some of the time | ||
series might start on different dates. Thus, values prior to the start date | ||
would generally be marked as missing. | ||
|
||
In pandas, one of the most common ways that missing data is **introduced** into | ||
a data set is by reindexing. For example: | ||
Because ``NaN`` is a float, a column of integers with even one missing values | ||
is cast to floating-point dtype (see :ref:`gotchas.intna` for more). Pandas | ||
provides a nullable integer array, which can be used by explicitly requesting | ||
the dtype: | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], | ||
columns=['one', 'two', 'three']) | ||
df['four'] = 'bar' | ||
df['five'] = df['one'] > 0 | ||
df | ||
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) | ||
df2 | ||
|
||
Values considered "missing" | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
As data comes in many shapes and forms, pandas aims to be flexible with regard | ||
to handling missing data. While ``NaN`` is the default missing value marker for | ||
reasons of computational speed and convenience, we need to be able to easily | ||
detect this value with data of different types: floating point, integer, | ||
boolean, and general object. In many cases, however, the Python ``None`` will | ||
arise and we wish to also consider that "missing" or "not available" or "NA". | ||
pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype()) | ||
|
||
.. note:: | ||
|
||
If you want to consider ``inf`` and ``-inf`` to be "NA" in computations, | ||
you can set ``pandas.options.mode.use_inf_as_na = True``. | ||
|
||
.. _missing.isna: | ||
|
||
To make detecting missing values easier (and across different array dtypes), | ||
pandas provides the :func:`isna` and | ||
:func:`notna` functions, which are also methods on | ||
Series and DataFrame objects: | ||
Alternatively, the string alias ``'Int64'`` (note the capital ``"I"``) can be | ||
used: | ||
|
||
.. ipython:: python | ||
|
||
df2['one'] | ||
pd.isna(df2['one']) | ||
df2['four'].notna() | ||
df2.isna() | ||
|
||
.. warning:: | ||
|
||
One has to be mindful that in Python (and NumPy), the ``nan's`` don't compare equal, but ``None's`` **do**. | ||
Note that pandas/NumPy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``. | ||
|
||
.. ipython:: python | ||
|
||
None == None | ||
np.nan == np.nan | ||
|
||
So as compared to above, a scalar equality comparison versus a ``None/np.nan`` doesn't provide useful information. | ||
|
||
.. ipython:: python | ||
pd.Series([1, 2, np.nan, 4], dtype="Int64") | ||
|
||
df2['one'] == np.nan | ||
See :ref:`integer_na` for more. | ||
|
||
Datetimes | ||
--------- | ||
|
@@ -760,3 +710,19 @@ However, these can be filled in using :meth:`~DataFrame.fillna` and it will work | |
|
||
reindexed[crit.fillna(False)] | ||
reindexed[crit.fillna(True)] | ||
|
||
Pandas provides a nullable integer dtype, but you must explicitly request it | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same, if correct |
||
when creating the series or column. Notice that we use a capital "I" in | ||
the ``dtype="Int64"``. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7], | ||
dtype="Int64") | ||
s > 0 | ||
(s > 0).dtype | ||
crit = (s > 0).reindex(list(range(8))) | ||
crit | ||
crit.dtype | ||
|
||
See :ref:`integer_na` for more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not important, but the
.. currentmodule:: pandas
is also included in the{{ header }}
(we only kept it in theapi.rst
because the autosummaries need it before the header is rendered)