From 07dff6a8eac7a060161a83d3b24deeac4eb8524a Mon Sep 17 00:00:00 2001 From: Patrick Hoefler Date: Mon, 18 Dec 2023 01:03:28 +0100 Subject: [PATCH 01/10] DOC: Add whatsnew illustrating upcoming changes --- doc/source/whatsnew/v2.2.0.rst | 79 ++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 0c4fb6d3d1164..3a921fde7a902 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -9,6 +9,85 @@ including other versions of pandas. {{ header }} .. --------------------------------------------------------------------------- + +.. _whatsnew_220.upccoming_changes: + +Upcoming changes in pandas 3.0 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +pandas 3.0 will bring two bigger changes to the default behavior of pandas. + +Copy-on-Write +^^^^^^^^^^^^^ + +The currently optional mode Copy-on-Write will be enabled by default in pandas 3.0. There +won't be an option to keep the current behavior enabled. The new behavioral semantics are +explained in the :ref:`user guide about Copy-on-Write `. + +The new behavior can be enabled since pandas 2.0 with the following option: + +.. code-block:: ipython + + pd.options.mode.copy_on_write = True + +This change brings different changes in behavior in how pandas operates with respect to +copies and views. Some of these changes allow a clear deprecation, like the changes in +chained assignment. Other changes are more subtle and thus, the warnings are hidden behind +an option that can be enabled in pandas 2.2. + +.. code-block:: ipython + + pd.options.mode.copy_on_write = "warn" + +This mode will warn in many different scenarios that aren't actually relevant to +most queries. We recommend exploring this mode, but it is not necessary to get rid +of all of these warnings. + +Arrow-backed strings by default +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Historically, pandas represented string columns with NumPy object data type. This +representation has numerous problems, including slow performance and a large memory +footprint. This will change in pandas 3.0. pandas will start inferring string columns +as Arrow backed strings, which represents strings contiguous in memory. This brings +a huge performance and memory improvement. + +Old behavior: + +.. code-block:: ipython + + In [1]: ser = pd.Series(["a", "b"]) + Out[1]: + 0 a + 1 b + dtype: object + +New behavior: + + In [1]: ser = pd.Series(["a", "b"]) + Out[1]: + 0 a + 1 b + dtype: string + +The string data type that is used in these scenarios will mostly behave as NumPy +object would, this includes missing value semantics and general operations on these +columns. + +This change include a few additional changes across the API: + +- Specifying ``dtype="string"`` creates a dtype that is backed by Python strings + which are stored in a NumPy array. This will change in pandas 3.0, this dtype + will create an Arrow backed string column. +- The column names and the Index will also be backed by Arrow strings. +- PyArrow will become a required dependency with pandas 3.0 to accommodate this change. + +This future dtype inference logic can be enabled with: + +.. code-block:: ipython + + pd.options.future.infer_string = True + .. _whatsnew_220.enhancements: Enhancements From db4b8b7e41c409cf22a53e78e34aa7345ee4eb2a Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:09:58 +0100 Subject: [PATCH 02/10] Update doc/source/whatsnew/v2.2.0.rst Co-authored-by: Joris Van den Bossche --- doc/source/whatsnew/v2.2.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 3a921fde7a902..5d7aaa4b8cbe8 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -10,7 +10,7 @@ including other versions of pandas. .. --------------------------------------------------------------------------- -.. _whatsnew_220.upccoming_changes: +.. _whatsnew_220.upcoming_changes: Upcoming changes in pandas 3.0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From fc99e84ee5a50be0c949fc80fa563f93fbf988ea Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:10:22 +0100 Subject: [PATCH 03/10] Update doc/source/whatsnew/v2.2.0.rst Co-authored-by: Joris Van den Bossche --- doc/source/whatsnew/v2.2.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 5d7aaa4b8cbe8..e5d871a6b07a4 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -43,7 +43,7 @@ This mode will warn in many different scenarios that aren't actually relevant to most queries. We recommend exploring this mode, but it is not necessary to get rid of all of these warnings. -Arrow-backed strings by default +Dedicated string data type (backed by Arrow) by default ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Historically, pandas represented string columns with NumPy object data type. This From bad8a1cbd682de717619b7f46e07c501411beb52 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:10:29 +0100 Subject: [PATCH 04/10] Update doc/source/whatsnew/v2.2.0.rst Co-authored-by: Joris Van den Bossche --- doc/source/whatsnew/v2.2.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index e5d871a6b07a4..364701801b904 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -49,7 +49,7 @@ Dedicated string data type (backed by Arrow) by default Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems, including slow performance and a large memory footprint. This will change in pandas 3.0. pandas will start inferring string columns -as Arrow backed strings, which represents strings contiguous in memory. This brings +as a new ``string`` data type, backed by Arrow, which represents strings contiguous in memory. This brings a huge performance and memory improvement. Old behavior: From 7749f953dac3d4fbe91a558807d35f1acbd4f4af Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:10:37 +0100 Subject: [PATCH 05/10] Update doc/source/whatsnew/v2.2.0.rst Co-authored-by: Joris Van den Bossche --- doc/source/whatsnew/v2.2.0.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 364701801b904..4ce1eed0541bf 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -64,6 +64,9 @@ Old behavior: New behavior: + +.. code-block:: ipython + In [1]: ser = pd.Series(["a", "b"]) Out[1]: 0 a From 378b0cb9e11551bb2021cb865419ffb762588d13 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:10:43 +0100 Subject: [PATCH 06/10] Update doc/source/whatsnew/v2.2.0.rst Co-authored-by: Joris Van den Bossche --- doc/source/whatsnew/v2.2.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 4ce1eed0541bf..f8aa9f8a0aab4 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -74,7 +74,7 @@ New behavior: dtype: string The string data type that is used in these scenarios will mostly behave as NumPy -object would, this includes missing value semantics and general operations on these +object would, including missing value semantics and general operations on these columns. This change include a few additional changes across the API: From 667754bad95db7f802a181d30312405ede521d50 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:10:48 +0100 Subject: [PATCH 07/10] Update doc/source/whatsnew/v2.2.0.rst Co-authored-by: Joris Van den Bossche --- doc/source/whatsnew/v2.2.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index f8aa9f8a0aab4..01236c3305d7d 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -77,7 +77,7 @@ The string data type that is used in these scenarios will mostly behave as NumPy object would, including missing value semantics and general operations on these columns. -This change include a few additional changes across the API: +This change includes a few additional changes across the API: - Specifying ``dtype="string"`` creates a dtype that is backed by Python strings which are stored in a NumPy array. This will change in pandas 3.0, this dtype From bebd2d57ac2086762d890216c90ff03877f92e7f Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:10:52 +0100 Subject: [PATCH 08/10] Update doc/source/whatsnew/v2.2.0.rst Co-authored-by: Joris Van den Bossche --- doc/source/whatsnew/v2.2.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 01236c3305d7d..1de4be89ad08c 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -79,7 +79,7 @@ columns. This change includes a few additional changes across the API: -- Specifying ``dtype="string"`` creates a dtype that is backed by Python strings +- Currently, specifying ``dtype="string"`` creates a dtype that is backed by Python strings which are stored in a NumPy array. This will change in pandas 3.0, this dtype will create an Arrow backed string column. - The column names and the Index will also be backed by Arrow strings. From 640fa364a26dd86273e93e2c71c63a10c5a1ecb2 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:13:19 +0100 Subject: [PATCH 09/10] Add link --- doc/source/user_guide/copy_on_write.rst | 2 ++ doc/source/whatsnew/v2.2.0.rst | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index bc233f4323e2a..050c3901c3420 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -52,6 +52,8 @@ it explicitly disallows this. With CoW enabled, ``df`` is unchanged: The following sections will explain what this means and how it impacts existing applications. +.. _copy_on_write.migration_guide: + Migrating to Copy-on-Write -------------------------- diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 1de4be89ad08c..35a893d77019f 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -41,7 +41,8 @@ an option that can be enabled in pandas 2.2. This mode will warn in many different scenarios that aren't actually relevant to most queries. We recommend exploring this mode, but it is not necessary to get rid -of all of these warnings. +of all of these warnings. The :ref:`migration guide ` +explains the upgrade process in more detail. Dedicated string data type (backed by Arrow) by default ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From dc8acbf4dc0c1569eab06d2255dbb0117b92393a Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Thu, 21 Dec 2023 22:45:29 +0100 Subject: [PATCH 10/10] Fixup --- doc/source/whatsnew/v2.2.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index 35a893d77019f..884365ede8f32 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -45,7 +45,7 @@ of all of these warnings. The :ref:`migration guide