DOC: PDEP6 page format update (pandas-dev#53442)

UmbertoFasci · im-vinicius · commit caf300616b06 · 2023-07-08T12:28:07.000+02:00
* DOC: PDP6 page bulleted list update pandas-dev#53414 * PDEP6 full format update pandas-dev#53414 * PDEP6 final bullet list updated pandas-dev#53414 * Fix unintended deletion pandas-dev#53414 * Grammar fix * PDEP6 numbered list and object puncuation. pandas-dev#53414 * Update nested list in description section. * PDEP6 Final Commit Proposal for Review. pandas-dev#53414 * PDEP6 update code block format FAQ. pandas-dev#53414 * PDEP6 whitespace cleanup pandas-dev#53414 * PDEP6 FAQ code block update pandas-dev#53414 * PDEP6 format update for final review. pandas-dev#53414 * PDEP6 final review with whitespace fix * PDEP6 FAQ section typo fix.
diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md
@@ -9,7 +9,7 @@
 ## Abstract
 
 The suggestion is that setitem-like operations would
-not change a ``Series`` dtype (nor that of a ``DataFrame``'s column).
+not change a ``Series``' dtype (nor that of a ``DataFrame``'s column).
 
 Current behaviour:
 ```python
@@ -51,65 +51,71 @@ In[11]: ser[2] = "2000-01-04x"  # typo - but pandas does not error, it upcasts t
 ```
 
 The scope of this PDEP is limited to setitem-like operations on Series (and DataFrame columns).
-For example, starting with
+For example, starting with:
 ```python
 df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]})
 ser = df["a"].copy()
 ```
 then the following would all raise:
 
-- setitem-like operations:
-  - ``ser.fillna('foo', inplace=True)``;
-  - ``ser.where(ser.isna(), 'foo', inplace=True)``
-  - ``ser.fillna('foo', inplace=False)``;
-  - ``ser.where(ser.isna(), 'foo', inplace=False)``
-- setitem indexing operations (where ``indexer`` could be a slice, a mask,
+* setitem-like operations:
+
+    - ``ser.fillna('foo', inplace=True)``
+    - ``ser.where(ser.isna(), 'foo', inplace=True)``
+    - ``ser.fillna('foo', inplace=False)``
+    - ``ser.where(ser.isna(), 'foo', inplace=False)``
+
+* setitem indexing operations (where ``indexer`` could be a slice, a mask,
   a single value, a list or array of values, or any other allowed indexer):
-  - ``ser.iloc[indexer] = 'foo'``
-  - ``ser.loc[indexer] = 'foo'``
-  - ``df.iloc[indexer, 0] = 'foo'``
-  - ``df.loc[indexer, 'a'] = 'foo'``
-  - ``ser[indexer] = 'foo'``
+
+    - ``ser.iloc[indexer] = 'foo'``
+    - ``ser.loc[indexer] = 'foo'``
+    - ``df.iloc[indexer, 0] = 'foo'``
+    - ``df.loc[indexer, 'a'] = 'foo'``
+    - ``ser[indexer] = 'foo'``
 
 It may be desirable to expand the top list to ``Series.replace`` and ``Series.update``,
 but to keep the scope of the PDEP down, they are excluded for now.
 
 Examples of operations which would not raise are:
-- ``ser.diff()``;
-- ``pd.concat([ser, ser.astype(object)])``;
-- ``ser.mean()``;
-- ``ser[0] = 3``;  # same dtype
-- ``ser[0] = 3.``;  # 3.0 is a 'round' float and so compatible with 'int64' dtype
-- ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)``;
-- ``df.index.intersection(ser.index)``.
+
+- ``ser.diff()``
+- ``pd.concat([ser, ser.astype(object)])``
+- ``ser.mean()``
+- ``ser[0] = 3``  # same dtype
+- ``ser[0] = 3.``  # 3.0 is a 'round' float and so compatible with 'int64' dtype
+- ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)``
+- ``df.index.intersection(ser.index)``
 
 ## Detailed description
 
 Concretely, the suggestion is:
-- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype;
-- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise.
+
+- If a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype.
+- If a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise.
 
 For a start, this would involve:
 
-1. changing ``Block.setitem`` such that it does not have an ``except`` block in
+1. changing ``Block.setitem`` such that it does not have an ``except`` block in:
+
+    <!-- language: python -->
 
-  ```python
-  value = extract_array(value, extract_numpy=True)
-  try:
-      casted = np_can_hold_element(values.dtype, value)
-  except LossySetitemError:
-      # current dtype cannot store value, coerce to common dtype
-      nb = self.coerce_to_target_dtype(value)
-      return nb.setitem(indexer, value)
-  else:
-  ```
+        value = extract_array(value, extract_numpy=True)
+        try:
+            casted = np_can_hold_element(values.dtype, value)
+        except LossSetitiemError:
+            # current dtype cannot store value, coerce to common dtype
+            nb = self.coerce_to_target_dtype(value)
+            return nb.setitem(index, value)
+        else:
 
 2. making a similar change in:
-   - ``Block.where``;
-   - ``Block.putmask``;
-   - ``EABackedBlock.setitem``;
-   - ``EABackedBlock.where``;
-   - ``EABackedBlock.putmask``;
+
+    - ``Block.where``
+    - ``Block.putmask``
+    - ``EABackedBlock.setitem``
+    - ``EABackedBlock.where``
+    - ``EABackedBlock.putmask``
 
 The above would already require several hundreds of tests to be adjusted. Note that once
 implementation starts, the list of locations to change may turn out to be slightly
@@ -147,11 +153,13 @@ numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just wha
 when constructing it.
 
 Possible options could be:
-1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``);
-2. convert the float value to ``int`` before setting it (i.e. silently round all float values);
-3. limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64) .
+
+1. Only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``).
+2. Convert the float value to ``int`` before setting it (i.e. silently round all float values).
+3. Limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64).
 
 Let us compare with what other libraries do:
+
 - ``numpy``: option 2
 - ``cudf``: option 2
 - ``polars``: option 2
@@ -165,12 +173,13 @@ if the objective of this PDEP is to prevent bugs, then this is also not desirabl
 someone might set ``1.5`` and later be surprised to learn that they actually set ``1``.
 
 There are several downsides to option ``3``:
-- it would be inconsistent with the nullable dtypes' behaviour;
-- it would also add complexity to the codebase and to tests;
-- it would be hard to teach, as instead of being able to teach a simple rule,
-  there would be a rule with exceptions;
-- there would be a risk of loss of precision and or overflow;
-- it opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``.
+
+- It would be inconsistent with the nullable dtypes' behaviour.
+- It would also add complexity to the codebase and to tests.
+- It would be hard to teach, as instead of being able to teach a simple rule,
+  There would be a rule with exceptions.
+- There would be a risk of loss of precision and or overflow.
+- It opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``.
 
 Option ``1`` is the maximally safe one in terms of protecting users from bugs, being
 consistent with the current behaviour of nullable dtypes, and in being simple to teach.
@@ -208,22 +217,25 @@ at all. To keep this proposal focused, it is intentionally excluded from the sco
 **A**: The current behavior would be to upcast to ``int32``. So under this PDEP,
   it would instead raise.
 
-**Q: What happens in setting ``16.000000000000001`` in an `int8`` Series?**
+**Q: What happens in setting ``16.000000000000001`` in an ``int8`` Series?**
 
 **A**: As far as Python is concerned, ``16.000000000000001`` and ``16.0`` are the
   same number. So, it would be inserted as ``16`` and the dtype would not change
   (just like what happens now, there would be no change here).
 
-**Q: What if I want ``1.0000000001`` to be inserted as ``1.0`` in an `'int8'` Series?**
+**Q: What if I want ``1.0000000001`` to be inserted as ``1.0`` in an ``int8`` Series?**
+
+**A**: You may want to define your own helper function, such as:
+
+```python
+def maybe_convert_to_int(x: int | float, tolerance: float):
+    if np.abs(x - round(x)) < tolerance:
+        return round(x)
+    return x
+```
+
+which you could adapt according to your needs.
 
-**A**: You may want to define your own helper function, such as
-  ```python
-  >>> def maybe_convert_to_int(x: int | float, tolerance: float):
-      if np.abs(x - round(x)) < tolerance:
-          return round(x)
-      return x
-  ```
-  which you could adapt according to your needs.
 
 ## Timeline