@@ -9,6 +9,89 @@ including other versions of pandas.
9
9
{{ header }}
10
10
11
11
.. ---------------------------------------------------------------------------
12
+
13
+ .. _whatsnew_220.upcoming_changes :
14
+
15
+ Upcoming changes in pandas 3.0
16
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17
+
18
+ pandas 3.0 will bring two bigger changes to the default behavior of pandas.
19
+
20
+ Copy-on-Write
21
+ ^^^^^^^^^^^^^
22
+
23
+ The currently optional mode Copy-on-Write will be enabled by default in pandas 3.0. There
24
+ won't be an option to keep the current behavior enabled. The new behavioral semantics are
25
+ explained in the :ref: `user guide about Copy-on-Write <copy_on_write >`.
26
+
27
+ The new behavior can be enabled since pandas 2.0 with the following option:
28
+
29
+ .. code-block :: ipython
30
+
31
+ pd.options.mode.copy_on_write = True
32
+
33
+ This change brings different changes in behavior in how pandas operates with respect to
34
+ copies and views. Some of these changes allow a clear deprecation, like the changes in
35
+ chained assignment. Other changes are more subtle and thus, the warnings are hidden behind
36
+ an option that can be enabled in pandas 2.2.
37
+
38
+ .. code-block :: ipython
39
+
40
+ pd.options.mode.copy_on_write = "warn"
41
+
42
+ This mode will warn in many different scenarios that aren't actually relevant to
43
+ most queries. We recommend exploring this mode, but it is not necessary to get rid
44
+ of all of these warnings. The :ref: `migration guide <copy_on_write.migration_guide >`
45
+ explains the upgrade process in more detail.
46
+
47
+ Dedicated string data type (backed by Arrow) by default
48
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
49
+
50
+ Historically, pandas represented string columns with NumPy object data type. This
51
+ representation has numerous problems, including slow performance and a large memory
52
+ footprint. This will change in pandas 3.0. pandas will start inferring string columns
53
+ as a new ``string `` data type, backed by Arrow, which represents strings contiguous in memory. This brings
54
+ a huge performance and memory improvement.
55
+
56
+ Old behavior:
57
+
58
+ .. code-block :: ipython
59
+
60
+ In [1]: ser = pd.Series(["a", "b"])
61
+ Out[1]:
62
+ 0 a
63
+ 1 b
64
+ dtype: object
65
+
66
+ New behavior:
67
+
68
+
69
+ .. code-block :: ipython
70
+
71
+ In [1]: ser = pd.Series(["a", "b"])
72
+ Out[1]:
73
+ 0 a
74
+ 1 b
75
+ dtype: string
76
+
77
+ The string data type that is used in these scenarios will mostly behave as NumPy
78
+ object would, including missing value semantics and general operations on these
79
+ columns.
80
+
81
+ This change includes a few additional changes across the API:
82
+
83
+ - Currently, specifying ``dtype="string" `` creates a dtype that is backed by Python strings
84
+ which are stored in a NumPy array. This will change in pandas 3.0, this dtype
85
+ will create an Arrow backed string column.
86
+ - The column names and the Index will also be backed by Arrow strings.
87
+ - PyArrow will become a required dependency with pandas 3.0 to accommodate this change.
88
+
89
+ This future dtype inference logic can be enabled with:
90
+
91
+ .. code-block :: ipython
92
+
93
+ pd.options.future.infer_string = True
94
+
12
95
.. _whatsnew_220.enhancements :
13
96
14
97
Enhancements
0 commit comments