Skip to content

Commit 9688402

Browse files
committed
DOC: Simplify gotchas.rst (pandas-dev#54415)
1 parent 76ba161 commit 9688402

File tree

1 file changed

+39
-53
lines changed

1 file changed

+39
-53
lines changed

doc/source/user_guide/gotchas.rst

Lines changed: 39 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ DataFrame memory usage
1313
The memory usage of a :class:`DataFrame` (including the index) is shown when calling
1414
the :meth:`~DataFrame.info`. A configuration option, ``display.memory_usage``
1515
(see :ref:`the list of options <options.available>`), specifies if the
16-
:class:`DataFrame` memory usage will be displayed when invoking the ``df.info()``
16+
:class:`DataFrame` memory usage will be displayed when invoking the :meth:`~DataFrame.info`
1717
method.
1818

1919
For example, the memory usage of the :class:`DataFrame` below is shown
@@ -50,13 +50,13 @@ as it can be expensive to do this deeper introspection.
5050
df.info(memory_usage="deep")
5151
5252
By default the display option is set to ``True`` but can be explicitly
53-
overridden by passing the ``memory_usage`` argument when invoking ``df.info()``.
53+
overridden by passing the ``memory_usage`` argument when invoking :meth:`~DataFrame.info`.
5454

5555
The memory usage of each column can be found by calling the
5656
:meth:`~DataFrame.memory_usage` method. This returns a :class:`Series` with an index
5757
represented by column names and memory usage of each column shown in bytes. For
5858
the :class:`DataFrame` above, the memory usage of each column and the total memory
59-
usage can be found with the ``memory_usage`` method:
59+
usage can be found with the :meth:`~DataFrame.memory_usage` method:
6060

6161
.. ipython:: python
6262
@@ -164,7 +164,8 @@ Mutating with User Defined Function (UDF) methods
164164
-------------------------------------------------
165165

166166
This section applies to pandas methods that take a UDF. In particular, the methods
167-
``.apply``, ``.aggregate``, ``.transform``, and ``.filter``.
167+
:meth:`DataFrame.apply`, :meth:`DataFrame.aggregate`, :meth:`DataFrame.transform`, and
168+
:meth:`DataFrame.filter`.
168169

169170
It is a general rule in programming that one should not mutate a container
170171
while it is being iterated over. Mutation will invalidate the iterator,
@@ -192,16 +193,14 @@ the :class:`DataFrame`, unexpected behavior can arise.
192193
Here is a similar example with :meth:`DataFrame.apply`:
193194

194195
.. ipython:: python
196+
:okexcept:
195197
196198
def f(s):
197199
s.pop("a")
198200
return s
199201
200202
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
201-
try:
202-
df.apply(f, axis="columns")
203-
except Exception as err:
204-
print(repr(err))
203+
df.apply(f, axis="columns")
205204
206205
To resolve this issue, one can make a copy so that the mutation does
207206
not apply to the container being iterated over.
@@ -229,29 +228,41 @@ not apply to the container being iterated over.
229228
df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})
230229
df.apply(f, axis="columns")
231230
232-
``NaN``, Integer ``NA`` values and ``NA`` type promotions
233-
---------------------------------------------------------
231+
Missing value representation for NumPy types
232+
--------------------------------------------
234233

235-
Choice of ``NA`` representation
236-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
234+
``np.nan`` as the ``NA`` representation for NumPy types
235+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
237236

238237
For lack of ``NA`` (missing) support from the ground up in NumPy and Python in
239-
general, we were given the difficult choice between either:
238+
general, ``NA`` could have been represented with:
240239

241240
* A *masked array* solution: an array of data and an array of boolean values
242241
indicating whether a value is there or is missing.
243242
* Using a special sentinel value, bit pattern, or set of sentinel values to
244243
denote ``NA`` across the dtypes.
245244

246-
For many reasons we chose the latter. After years of production use it has
247-
proven, at least in my opinion, to be the best decision given the state of
248-
affairs in NumPy and Python in general. The special value ``NaN``
249-
(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
250-
functions :meth:`DataFrame.isna` and :meth:`DataFrame.notna` which can be used across the dtypes to
251-
detect NA values.
245+
The special value ``np.nan`` (Not-A-Number) was chosen as the ``NA`` value for NumPy types, and there are API
246+
functions like :meth:`DataFrame.isna` and :meth:`DataFrame.notna` which can be used across the dtypes to
247+
detect NA values. However, this choice has a downside of coercing missing integer data as float types as
248+
shown in :ref:`gotchas.intna`.
249+
250+
``NA`` type promotions for NumPy types
251+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
252+
253+
When introducing NAs into an existing :class:`Series` or :class:`DataFrame` via
254+
:meth:`~Series.reindex` or some other means, boolean and integer types will be
255+
promoted to a different dtype in order to store the NAs. The promotions are
256+
summarized in this table:
252257

253-
However, it comes with it a couple of trade-offs which I most certainly have
254-
not ignored.
258+
.. csv-table::
259+
:header: "Typeclass","Promotion dtype for storing NAs"
260+
:widths: 40,60
261+
262+
``floating``, no change
263+
``object``, no change
264+
``integer``, cast to ``float64``
265+
``boolean``, cast to ``object``
255266

256267
.. _gotchas.intna:
257268

@@ -276,12 +287,13 @@ This trade-off is made largely for memory and performance reasons, and also so
276287
that the resulting :class:`Series` continues to be "numeric".
277288

278289
If you need to represent integers with possibly missing values, use one of
279-
the nullable-integer extension dtypes provided by pandas
290+
the nullable-integer extension dtypes provided by pandas or pyarrow
280291

281292
* :class:`Int8Dtype`
282293
* :class:`Int16Dtype`
283294
* :class:`Int32Dtype`
284295
* :class:`Int64Dtype`
296+
* :class:`ArrowDtype`
285297

286298
.. ipython:: python
287299
@@ -293,28 +305,10 @@ the nullable-integer extension dtypes provided by pandas
293305
s2_int
294306
s2_int.dtype
295307
296-
See :ref:`integer_na` for more.
297-
298-
``NA`` type promotions
299-
~~~~~~~~~~~~~~~~~~~~~~
300-
301-
When introducing NAs into an existing :class:`Series` or :class:`DataFrame` via
302-
:meth:`~Series.reindex` or some other means, boolean and integer types will be
303-
promoted to a different dtype in order to store the NAs. The promotions are
304-
summarized in this table:
305-
306-
.. csv-table::
307-
:header: "Typeclass","Promotion dtype for storing NAs"
308-
:widths: 40,60
309-
310-
``floating``, no change
311-
``object``, no change
312-
``integer``, cast to ``float64``
313-
``boolean``, cast to ``object``
308+
s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")
309+
s_int_pa
314310
315-
While this may seem like a heavy trade-off, I have found very few cases where
316-
this is an issue in practice i.e. storing values greater than 2**53. Some
317-
explanation for the motivation is in the next section.
311+
See :ref:`integer_na` and :ref:`pyarrow` for more.
318312

319313
Why not make NumPy like R?
320314
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -342,16 +336,8 @@ each type to be used as the missing value. While doing this with the full NumPy
342336
type hierarchy would be possible, it would be a more substantial trade-off
343337
(especially for the 8- and 16-bit data types) and implementation undertaking.
344338

345-
An alternate approach is that of using masked arrays. A masked array is an
346-
array of data with an associated boolean *mask* denoting whether each value
347-
should be considered ``NA`` or not. I am personally not in love with this
348-
approach as I feel that overall it places a fairly heavy burden on the user and
349-
the library implementer. Additionally, it exacts a fairly high performance cost
350-
when working with numerical data compared with the simple approach of using
351-
``NaN``. Thus, I have chosen the Pythonic "practicality beats purity" approach
352-
and traded integer ``NA`` capability for a much simpler approach of using a
353-
special value in float and object arrays to denote ``NA``, and promoting
354-
integer arrays to floating when NAs must be introduced.
339+
However, R ``NA`` semantics are now available by using masked NumPy types such as :class:`Int64Dtype`
340+
or PyArrow types (:class:`ArrowDtype`).
355341

356342

357343
Differences with NumPy

0 commit comments

Comments
 (0)