From 3a16f7b1ae24f37b89bc92ecb8a01b2bb5d45d62 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 10 Jul 2020 11:13:03 +0200 Subject: [PATCH 1/3] ROADMAP: add consistent missing values for all dtypes to the roadmap --- doc/source/development/roadmap.rst | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index d331491d02883..6ddc0dc007d7c 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -53,6 +53,32 @@ need to implement certain operations expected by pandas users (for example the algorithm used in, ``Series.str.upper``). That work may be done outside of pandas. +Consistent missing value handling +--------------------------------- + +Currently, pandas has varying missing data interfaces depending on the data +type: pandas uses ``np.nan`` as missing value indicator in floating point data, +``np.nan`` or ``None`` in object dtype data (eg strings, or booleans with +missing values are cast to object), ``pd.NaT`` in datetimelike data. For +categorical or interval data, they return ``np.nan`` on access even when the +categories or intervals are datetime-like. Integer data cannot store missing +data or are cast to float. + +Long term, we want to introduce consistent missing value handling accross the +different data types: all data types should support missing values and with the +same behaviour. + +To this end, a new experimental ``pd.NA`` scalar to be used as missing value +indicator has already been added in pandas 1.0 (and used in the experimental +nullable dtypes). Further work is needed to integrate this with other data +types, and to provide a path forward to make this the default in a future +version of pandas. + +This has been discussed at +`github #28095 `__ (and +linked issues), and described in more detail in this +`design doc `__. + Apache Arrow interoperability ----------------------------- From ee62bd07fd40c343dc0b9387fc4ec48ae5b306f5 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 7 Aug 2020 13:19:22 +0200 Subject: [PATCH 2/3] add notion of different semantics --- doc/source/development/roadmap.rst | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 6ddc0dc007d7c..33a4f547cb931 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -62,15 +62,17 @@ type: pandas uses ``np.nan`` as missing value indicator in floating point data, missing values are cast to object), ``pd.NaT`` in datetimelike data. For categorical or interval data, they return ``np.nan`` on access even when the categories or intervals are datetime-like. Integer data cannot store missing -data or are cast to float. +data or are cast to float. In addition, ``NaN`` has different semantics as +"nulls" in many other data tools. Long term, we want to introduce consistent missing value handling accross the different data types: all data types should support missing values and with the same behaviour. -To this end, a new experimental ``pd.NA`` scalar to be used as missing value -indicator has already been added in pandas 1.0 (and used in the experimental -nullable dtypes). Further work is needed to integrate this with other data +To this end, a new experimental ``pd.NA`` scalar that can be used as missing +value indicator and with a behaviour that deviates from ``np.nan`` has already +been added in pandas 1.0 (and used in the experimental nullable dtypes). Further +work and research is needed to integrate these new semantics with other data types, and to provide a path forward to make this the default in a future version of pandas. From 7bcb4e66a9dc77e937716604c66bb62a3bafe3eb Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Fri, 7 Aug 2020 20:57:34 +0200 Subject: [PATCH 3/3] update with Tom's suggestions --- doc/source/development/roadmap.rst | 36 ++++++++++++++---------------- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 33a4f547cb931..efee21b5889ed 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -56,25 +56,23 @@ pandas. Consistent missing value handling --------------------------------- -Currently, pandas has varying missing data interfaces depending on the data -type: pandas uses ``np.nan`` as missing value indicator in floating point data, -``np.nan`` or ``None`` in object dtype data (eg strings, or booleans with -missing values are cast to object), ``pd.NaT`` in datetimelike data. For -categorical or interval data, they return ``np.nan`` on access even when the -categories or intervals are datetime-like. Integer data cannot store missing -data or are cast to float. In addition, ``NaN`` has different semantics as -"nulls" in many other data tools. - -Long term, we want to introduce consistent missing value handling accross the -different data types: all data types should support missing values and with the -same behaviour. - -To this end, a new experimental ``pd.NA`` scalar that can be used as missing -value indicator and with a behaviour that deviates from ``np.nan`` has already -been added in pandas 1.0 (and used in the experimental nullable dtypes). Further -work and research is needed to integrate these new semantics with other data -types, and to provide a path forward to make this the default in a future -version of pandas. +Currently, pandas handles missing data differently for different data types. We +use different types to indicate that a value is missing (``np.nan`` for +floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically +strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike +data). Integer data cannot store missing data or are cast to float. In addition, +pandas 1.0 introduced a new missing value sentinel, ``pd.NA``, which is being +used for the experimental nullable integer, boolean, and string data types. + +These different missing values have different behaviors in user-facing +operations. Specifically, we introduced different semantics for the nullable +data types for certain operations (e.g. propagating in comparison operations +instead of comparing as False). + +Long term, we want to introduce consistent missing data handling for all data +types. This includes consistent behavior in all operations (indexing, arithmetic +operations, comparisons, etc.). We want to eventually make the new semantics the +default. This has been discussed at `github #28095 `__ (and