From d7c2b55845b60e0e1e62c67591732630a2c0ec79 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sat, 18 Nov 2023 00:26:26 +0100 Subject: [PATCH 1/6] DOC: Add note to user guide that SettingWithCopyWarning won't be necessary anymore --- doc/source/user_guide/indexing.rst | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index 7b839d62ddde9..f720ca7eff9e6 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -1727,6 +1727,16 @@ You can assign a custom index to the ``index`` attribute: Returning a view versus a copy ------------------------------ +.. warning:: + + `Copy-on-Write + `__. + will become the new default in pandas 3.0. This means than chained indexing will + never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary + anymore. + See `this section `__ + for more context. + When setting values in a pandas object, care must be taken to avoid what is called ``chained indexing``. Here is an example. @@ -1765,6 +1775,16 @@ faster, and allows one to index *both* axes if so desired. Why does assignment fail when using chained indexing? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +.. warning:: + + `Copy-on-Write + `__. + will become the new default in pandas 3.0. This means than chained indexing will + never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary + anymore. + See `this section `__ + for more context. + The problem in the previous section is just a performance issue. What's up with the ``SettingWithCopy`` warning? We don't **usually** throw warnings around when you do something that might cost a few extra milliseconds! @@ -1821,6 +1841,16 @@ Yikes! Evaluation order matters ~~~~~~~~~~~~~~~~~~~~~~~~ +.. warning:: + + `Copy-on-Write + `__. + will become the new default in pandas 3.0. This means than chained indexing will + never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary + anymore. + See `this section `__ + for more context. + When you use chained indexing, the order and type of the indexing operation partially determine whether the result is a slice into the original object, or a copy of the slice. From 8ea42c4e6a8ced38f1f44a9968cdbd1c7a502da9 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sat, 18 Nov 2023 00:29:10 +0100 Subject: [PATCH 2/6] Update --- doc/source/user_guide/indexing.rst | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index f720ca7eff9e6..9a57a13f45173 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -1736,6 +1736,13 @@ Returning a view versus a copy anymore. See `this section `__ for more context. + We recommend turning Copy-on-Write on to leverage the improvements with + + ``` + pd.options.mode.copy_on_write = True + ``` + + even before pandas 3.0 is available. When setting values in a pandas object, care must be taken to avoid what is called ``chained indexing``. Here is an example. @@ -1784,6 +1791,13 @@ Why does assignment fail when using chained indexing? anymore. See `this section `__ for more context. + We recommend turning Copy-on-Write on to leverage the improvements with + + ``` + pd.options.mode.copy_on_write = True + ``` + + even before pandas 3.0 is available. The problem in the previous section is just a performance issue. What's up with the ``SettingWithCopy`` warning? We don't **usually** throw warnings around when @@ -1850,6 +1864,13 @@ Evaluation order matters anymore. See `this section `__ for more context. + We recommend turning Copy-on-Write on to leverage the improvements with + + ``` + pd.options.mode.copy_on_write = True + ``` + + even before pandas 3.0 is available. When you use chained indexing, the order and type of the indexing operation partially determine whether the result is a slice into the original object, or From 8cbac01a6c6225e28a704ba88676873e8c6a1809 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sat, 18 Nov 2023 00:38:52 +0100 Subject: [PATCH 3/6] Fix links --- doc/source/user_guide/copy_on_write.rst | 10 ++++++++++ doc/source/user_guide/indexing.rst | 15 ++++++--------- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index d0c57b56585db..57f1949fd0c8e 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -6,6 +6,12 @@ Copy-on-Write (CoW) ******************* +.. note:: + + Copy-on-Write will become the default in pandas 3.0 We recommend + :ref:`turning it on now ` + to benefit from all improvements. + Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the optimizations that become possible through CoW are implemented and supported. All possible optimizations are supported starting from pandas 2.1. @@ -123,6 +129,8 @@ CoW triggers a copy when ``df`` is changed to avoid mutating ``view`` as well: df view +.. _copy_on_write_chained_assignment: + Chained Assignment ------------------ @@ -238,6 +246,8 @@ and :meth:`DataFrame.rename`. These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution. +.. _copy_on_write_enabling: + How to enable CoW ----------------- diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index 9a57a13f45173..4954ee1538697 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -1729,12 +1729,11 @@ Returning a view versus a copy .. warning:: - `Copy-on-Write - `__. + :ref:`Copy-on-Write ` will become the new default in pandas 3.0. This means than chained indexing will never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary anymore. - See `this section `__ + See :ref:`this section ` for more context. We recommend turning Copy-on-Write on to leverage the improvements with @@ -1784,12 +1783,11 @@ Why does assignment fail when using chained indexing? .. warning:: - `Copy-on-Write - `__. + :ref:`Copy-on-Write ` will become the new default in pandas 3.0. This means than chained indexing will never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary anymore. - See `this section `__ + See :ref:`this section ` for more context. We recommend turning Copy-on-Write on to leverage the improvements with @@ -1857,12 +1855,11 @@ Evaluation order matters .. warning:: - `Copy-on-Write - `__. + :ref:`Copy-on-Write ` will become the new default in pandas 3.0. This means than chained indexing will never work. As a consequence, the ``SettingWithCopyWarning`` won't be necessary anymore. - See `this section `__ + See :ref:`this section ` for more context. We recommend turning Copy-on-Write on to leverage the improvements with From 124cb9ff48578330d73bef527204af68f939bf64 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Sat, 18 Nov 2023 13:34:31 +0100 Subject: [PATCH 4/6] PERF: Improve conversion in read_csv when string option is set --- pandas/io/parsers/arrow_parser_wrapper.py | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/pandas/io/parsers/arrow_parser_wrapper.py b/pandas/io/parsers/arrow_parser_wrapper.py index a1d69deb6a21e..5d33126593faf 100644 --- a/pandas/io/parsers/arrow_parser_wrapper.py +++ b/pandas/io/parsers/arrow_parser_wrapper.py @@ -267,7 +267,15 @@ def read(self) -> DataFrame: dtype_mapping[pa.null()] = pd.Int64Dtype() frame = table.to_pandas(types_mapper=dtype_mapping.get) elif using_pyarrow_string_dtype(): - frame = table.to_pandas(types_mapper=arrow_string_types_mapper()) + + def types_mapper(dtype): + dtype_dict = self.kwds["dtype"] + if dtype_dict is not None and dtype_dict.get(dtype, None) is not None: + return dtype_dict.get(dtype) + return arrow_string_types_mapper()(dtype) + + frame = table.to_pandas(types_mapper=types_mapper) + else: if isinstance(self.kwds.get("dtype"), dict): frame = table.to_pandas(types_mapper=self.kwds["dtype"].get) From 54240e5e63d4b8adf9f72ea5c684c7dbebf14764 Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Mon, 20 Nov 2023 15:24:19 +0100 Subject: [PATCH 5/6] Update doc/source/user_guide/copy_on_write.rst Co-authored-by: Joris Van den Bossche --- doc/source/user_guide/copy_on_write.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/user_guide/copy_on_write.rst b/doc/source/user_guide/copy_on_write.rst index 57f1949fd0c8e..fc6f62ec2a4bb 100644 --- a/doc/source/user_guide/copy_on_write.rst +++ b/doc/source/user_guide/copy_on_write.rst @@ -8,7 +8,7 @@ Copy-on-Write (CoW) .. note:: - Copy-on-Write will become the default in pandas 3.0 We recommend + Copy-on-Write will become the default in pandas 3.0. We recommend :ref:`turning it on now ` to benefit from all improvements. From 3fa472b072feb358bf9eb01c1ff9ffaae778f35e Mon Sep 17 00:00:00 2001 From: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Date: Mon, 20 Nov 2023 15:25:06 +0100 Subject: [PATCH 6/6] Revert "PERF: Improve conversion in read_csv when string option is set" This reverts commit 124cb9ff48578330d73bef527204af68f939bf64. --- pandas/io/parsers/arrow_parser_wrapper.py | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/pandas/io/parsers/arrow_parser_wrapper.py b/pandas/io/parsers/arrow_parser_wrapper.py index 5d33126593faf..a1d69deb6a21e 100644 --- a/pandas/io/parsers/arrow_parser_wrapper.py +++ b/pandas/io/parsers/arrow_parser_wrapper.py @@ -267,15 +267,7 @@ def read(self) -> DataFrame: dtype_mapping[pa.null()] = pd.Int64Dtype() frame = table.to_pandas(types_mapper=dtype_mapping.get) elif using_pyarrow_string_dtype(): - - def types_mapper(dtype): - dtype_dict = self.kwds["dtype"] - if dtype_dict is not None and dtype_dict.get(dtype, None) is not None: - return dtype_dict.get(dtype) - return arrow_string_types_mapper()(dtype) - - frame = table.to_pandas(types_mapper=types_mapper) - + frame = table.to_pandas(types_mapper=arrow_string_types_mapper()) else: if isinstance(self.kwds.get("dtype"), dict): frame = table.to_pandas(types_mapper=self.kwds["dtype"].get)