Skip to content

Commit a2cec79

Browse files
jorisvandenbosscheproost
authored andcommitted
ENH: add NA scalar for missing value indicator, use in StringArray. (pandas-dev#29597)
1 parent 93f38af commit a2cec79

File tree

16 files changed

+530
-40
lines changed

16 files changed

+530
-40
lines changed

doc/source/user_guide/missing_data.rst

+143-6
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@ pandas.
1212
.. note::
1313

1414
The choice of using ``NaN`` internally to denote missing data was largely
15-
for simplicity and performance reasons. It differs from the MaskedArray
16-
approach of, for example, :mod:`scikits.timeseries`. We are hopeful that
17-
NumPy will soon be able to provide a native NA type solution (similar to R)
18-
performant enough to be used in pandas.
15+
for simplicity and performance reasons.
16+
Starting from pandas 1.0, some optional data types start experimenting
17+
with a native ``NA`` scalar using a mask-based approach. See
18+
:ref:`here <missing_data.NA>` for more.
1919

2020
See the :ref:`cookbook<cookbook.missing_data>` for some advanced strategies.
2121

@@ -110,7 +110,7 @@ pandas objects provide compatibility between ``NaT`` and ``NaN``.
110110
.. _missing.inserting:
111111

112112
Inserting missing data
113-
----------------------
113+
~~~~~~~~~~~~~~~~~~~~~~
114114

115115
You can insert missing values by simply assigning to containers. The
116116
actual missing value used will be chosen based on the dtype.
@@ -135,9 +135,10 @@ For object containers, pandas will use the value given:
135135
s.loc[1] = np.nan
136136
s
137137
138+
.. _missing_data.calculations:
138139

139140
Calculations with missing data
140-
------------------------------
141+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
141142

142143
Missing values propagate naturally through arithmetic operations between pandas
143144
objects.
@@ -771,3 +772,139 @@ the ``dtype="Int64"``.
771772
s
772773
773774
See :ref:`integer_na` for more.
775+
776+
777+
.. _missing_data.NA:
778+
779+
Experimental ``NA`` scalar to denote missing values
780+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
781+
782+
.. warning::
783+
784+
Experimental: the behaviour of ``pd.NA`` can still change without warning.
785+
786+
.. versionadded:: 1.0.0
787+
788+
Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is
789+
available to represent scalar missing values. At this moment, it is used in
790+
the nullable :doc:`integer <integer_na>`, boolean and
791+
:ref:`dedicated string <text.types>` data types as the missing value indicator.
792+
793+
The goal of ``pd.NA`` is provide a "missing" indicator that can be used
794+
consistently accross data types (instead of ``np.nan``, ``None`` or ``pd.NaT``
795+
depending on the data type).
796+
797+
For example, when having missing values in a Series with the nullable integer
798+
dtype, it will use ``pd.NA``:
799+
800+
.. ipython:: python
801+
802+
s = pd.Series([1, 2, None], dtype="Int64")
803+
s
804+
s[2]
805+
s[2] is pd.NA
806+
807+
Currently, pandas does not yet use those data types by default (when creating
808+
a DataFrame or Series, or when reading in data), so you need to specify
809+
the dtype explicitly.
810+
811+
Propagation in arithmetic and comparison operations
812+
---------------------------------------------------
813+
814+
In general, missing values *propagate* in operations involving ``pd.NA``. When
815+
one of the operands is unknown, the outcome of the operation is also unknown.
816+
817+
For example, ``pd.NA`` propagates in arithmetic operations, similarly to
818+
``np.nan``:
819+
820+
.. ipython:: python
821+
822+
pd.NA + 1
823+
"a" * pd.NA
824+
825+
In equality and comparison operations, ``pd.NA`` also propagates. This deviates
826+
from the behaviour of ``np.nan``, where comparisons with ``np.nan`` always
827+
return ``False``.
828+
829+
.. ipython:: python
830+
831+
pd.NA == 1
832+
pd.NA == pd.NA
833+
pd.NA < 2.5
834+
835+
To check if a value is equal to ``pd.NA``, the :func:`isna` function can be
836+
used:
837+
838+
.. ipython:: python
839+
840+
pd.isna(pd.NA)
841+
842+
An exception on this basic propagation rule are *reductions* (such as the
843+
mean or the minimum), where pandas defaults to skipping missing values. See
844+
:ref:`above <missing_data.calculations>` for more.
845+
846+
Logical operations
847+
------------------
848+
849+
For logical operations, ``pd.NA`` follows the rules of the
850+
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
851+
*Kleene logic*, similarly to R, SQL and Julia). This logic means to only
852+
propagate missing values when it is logically required.
853+
854+
For example, for the logical "or" operation (``|``), if one of the operands
855+
is ``True``, we already know the result will be ``True``, regardless of the
856+
other value (so regardless the missing value would be ``True`` or ``False``).
857+
In this case, ``pd.NA`` does not propagate:
858+
859+
.. ipython:: python
860+
861+
True | False
862+
True | pd.NA
863+
pd.NA | True
864+
865+
On the other hand, if one of the operands is ``False``, the result depends
866+
on the value of the other operand. Therefore, in this case ``pd.NA``
867+
propagates:
868+
869+
.. ipython:: python
870+
871+
False | True
872+
False | False
873+
False | pd.NA
874+
875+
The behaviour of the logical "and" operation (``&``) can be derived using
876+
similar logic (where now ``pd.NA`` will not propagate if one of the operands
877+
is already ``False``):
878+
879+
.. ipython:: python
880+
881+
False & True
882+
False & False
883+
False & pd.NA
884+
885+
.. ipython:: python
886+
887+
True & True
888+
True & False
889+
True & pd.NA
890+
891+
892+
``NA`` in a boolean context
893+
---------------------------
894+
895+
Since the actual value of an NA is unknown, it is ambiguous to convert NA
896+
to a boolean value. The following raises an error:
897+
898+
.. ipython:: python
899+
:okexcept:
900+
901+
bool(pd.NA)
902+
903+
This also means that ``pd.NA`` cannot be used in a context where it is
904+
evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can
905+
potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check
906+
for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by
907+
filling missing values beforehand.
908+
909+
A similar situation occurs when using Series or DataFrame objects in ``if``
910+
statements, see :ref:`gotchas.truth`.

doc/source/whatsnew/v1.0.0.rst

+44
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,50 @@ String accessor methods returning integers will return a value with :class:`Int6
102102
We recommend explicitly using the ``string`` data type when working with strings.
103103
See :ref:`text.types` for more.
104104

105+
.. _whatsnew_100.NA:
106+
107+
Experimental ``NA`` scalar to denote missing values
108+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
109+
110+
A new ``pd.NA`` value (singleton) is introduced to represent scalar missing
111+
values. Up to now, ``np.nan`` is used for this for float data, ``np.nan`` or
112+
``None`` for object-dtype data and ``pd.NaT`` for datetime-like data. The
113+
goal of ``pd.NA`` is provide a "missing" indicator that can be used
114+
consistently accross data types. For now, the nullable integer and boolean
115+
data types and the new string data type make use of ``pd.NA`` (:issue:`28095`).
116+
117+
.. warning::
118+
119+
Experimental: the behaviour of ``pd.NA`` can still change without warning.
120+
121+
For example, creating a Series using the nullable integer dtype:
122+
123+
.. ipython:: python
124+
125+
s = pd.Series([1, 2, None], dtype="Int64")
126+
s
127+
s[2]
128+
129+
Compared to ``np.nan``, ``pd.NA`` behaves differently in certain operations.
130+
In addition to arithmetic operations, ``pd.NA`` also propagates as "missing"
131+
or "unknown" in comparison operations:
132+
133+
.. ipython:: python
134+
135+
np.nan > 1
136+
pd.NA > 1
137+
138+
For logical operations, ``pd.NA`` follows the rules of the
139+
`three-valued logic <https://en.wikipedia.org/wiki/Three-valued_logic>`__ (or
140+
*Kleene logic*). For example:
141+
142+
.. ipython:: python
143+
144+
pd.NA | True
145+
146+
For more, see :ref:`NA section <missing_data.NA>` in the user guide on missing
147+
data.
148+
105149
.. _whatsnew_100.boolean:
106150

107151
Boolean data type with missing values support

pandas/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@
7070
StringDtype,
7171
BooleanDtype,
7272
# missing
73+
NA,
7374
isna,
7475
isnull,
7576
notna,

pandas/_libs/lib.pyx

+3-2
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ from pandas._libs.tslibs.timedeltas cimport convert_to_timedelta64
5757
from pandas._libs.tslibs.timezones cimport get_timezone, tz_compare
5858

5959
from pandas._libs.missing cimport (
60-
checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period
60+
checknull, isnaobj, is_null_datetime64, is_null_timedelta64, is_null_period, C_NA
6161
)
6262

6363

@@ -160,6 +160,7 @@ def is_scalar(val: object) -> bool:
160160
or PyTime_Check(val)
161161
# We differ from numpy, which claims that None is not scalar;
162162
# see np.isscalar
163+
or val is C_NA
163164
or val is None
164165
or isinstance(val, (Fraction, Number))
165166
or util.is_period_object(val)
@@ -1494,7 +1495,7 @@ cdef class Validator:
14941495
f'must define is_value_typed')
14951496

14961497
cdef bint is_valid_null(self, object value) except -1:
1497-
return value is None or util.is_nan(value)
1498+
return value is None or value is C_NA or util.is_nan(value)
14981499

14991500
cdef bint is_array_typed(self) except -1:
15001501
return False

pandas/_libs/missing.pxd

+5
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,8 @@ cpdef ndarray[uint8_t] isnaobj(ndarray arr)
99
cdef bint is_null_datetime64(v)
1010
cdef bint is_null_timedelta64(v)
1111
cdef bint is_null_period(v)
12+
13+
cdef class C_NAType:
14+
pass
15+
16+
cdef C_NAType C_NA

0 commit comments

Comments
 (0)