Skip to content

Commit 5ceeb43

Browse files
authored
DOC: Improve gotchas.rst (#45375)
1 parent 5acb14b commit 5ceeb43

File tree

4 files changed

+51
-49
lines changed

4 files changed

+51
-49
lines changed

doc/source/user_guide/gotchas.rst

+39-48
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@ Frequently Asked Questions (FAQ)
1010

1111
DataFrame memory usage
1212
----------------------
13-
The memory usage of a ``DataFrame`` (including the index) is shown when calling
13+
The memory usage of a :class:`DataFrame` (including the index) is shown when calling
1414
the :meth:`~DataFrame.info`. A configuration option, ``display.memory_usage``
1515
(see :ref:`the list of options <options.available>`), specifies if the
16-
``DataFrame``'s memory usage will be displayed when invoking the ``df.info()``
16+
:class:`DataFrame` memory usage will be displayed when invoking the ``df.info()``
1717
method.
1818

19-
For example, the memory usage of the ``DataFrame`` below is shown
19+
For example, the memory usage of the :class:`DataFrame` below is shown
2020
when calling :meth:`~DataFrame.info`:
2121

2222
.. ipython:: python
@@ -53,9 +53,9 @@ By default the display option is set to ``True`` but can be explicitly
5353
overridden by passing the ``memory_usage`` argument when invoking ``df.info()``.
5454

5555
The memory usage of each column can be found by calling the
56-
:meth:`~DataFrame.memory_usage` method. This returns a ``Series`` with an index
56+
:meth:`~DataFrame.memory_usage` method. This returns a :class:`Series` with an index
5757
represented by column names and memory usage of each column shown in bytes. For
58-
the ``DataFrame`` above, the memory usage of each column and the total memory
58+
the :class:`DataFrame` above, the memory usage of each column and the total memory
5959
usage can be found with the ``memory_usage`` method:
6060

6161
.. ipython:: python
@@ -65,8 +65,8 @@ usage can be found with the ``memory_usage`` method:
6565
# total memory usage of dataframe
6666
df.memory_usage().sum()
6767
68-
By default the memory usage of the ``DataFrame``'s index is shown in the
69-
returned ``Series``, the memory usage of the index can be suppressed by passing
68+
By default the memory usage of the :class:`DataFrame` index is shown in the
69+
returned :class:`Series`, the memory usage of the index can be suppressed by passing
7070
the ``index=False`` argument:
7171

7272
.. ipython:: python
@@ -75,7 +75,7 @@ the ``index=False`` argument:
7575
7676
The memory usage displayed by the :meth:`~DataFrame.info` method utilizes the
7777
:meth:`~DataFrame.memory_usage` method to determine the memory usage of a
78-
``DataFrame`` while also formatting the output in human-readable units (base-2
78+
:class:`DataFrame` while also formatting the output in human-readable units (base-2
7979
representation; i.e. 1KB = 1024 bytes).
8080

8181
See also :ref:`Categorical Memory Usage <categorical.memory>`.
@@ -98,32 +98,28 @@ of the following code should be:
9898
Should it be ``True`` because it's not zero-length, or ``False`` because there
9999
are ``False`` values? It is unclear, so instead, pandas raises a ``ValueError``:
100100

101-
.. code-block:: python
101+
.. ipython:: python
102+
:okexcept:
102103
103-
>>> if pd.Series([False, True, False]):
104-
... print("I was true")
105-
Traceback
106-
...
107-
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
104+
if pd.Series([False, True, False]):
105+
print("I was true")
108106
109-
You need to explicitly choose what you want to do with the ``DataFrame``, e.g.
107+
You need to explicitly choose what you want to do with the :class:`DataFrame`, e.g.
110108
use :meth:`~DataFrame.any`, :meth:`~DataFrame.all` or :meth:`~DataFrame.empty`.
111109
Alternatively, you might want to compare if the pandas object is ``None``:
112110

113-
.. code-block:: python
111+
.. ipython:: python
114112
115-
>>> if pd.Series([False, True, False]) is not None:
116-
... print("I was not None")
117-
I was not None
113+
if pd.Series([False, True, False]) is not None:
114+
print("I was not None")
118115
119116
120117
Below is how to check if any of the values are ``True``:
121118

122-
.. code-block:: python
119+
.. ipython:: python
123120
124-
>>> if pd.Series([False, True, False]).any():
125-
... print("I am any")
126-
I am any
121+
if pd.Series([False, True, False]).any():
122+
print("I am any")
127123
128124
To evaluate single-element pandas objects in a boolean context, use the method
129125
:meth:`~DataFrame.bool`:
@@ -138,27 +134,21 @@ To evaluate single-element pandas objects in a boolean context, use the method
138134
Bitwise boolean
139135
~~~~~~~~~~~~~~~
140136

141-
Bitwise boolean operators like ``==`` and ``!=`` return a boolean ``Series``,
142-
which is almost always what you want anyways.
137+
Bitwise boolean operators like ``==`` and ``!=`` return a boolean :class:`Series`
138+
which performs an element-wise comparison when compared to a scalar.
143139

144-
.. code-block:: python
140+
.. ipython:: python
145141
146-
>>> s = pd.Series(range(5))
147-
>>> s == 4
148-
0 False
149-
1 False
150-
2 False
151-
3 False
152-
4 True
153-
dtype: bool
142+
s = pd.Series(range(5))
143+
s == 4
154144
155145
See :ref:`boolean comparisons<basics.compare>` for more examples.
156146

157147
Using the ``in`` operator
158148
~~~~~~~~~~~~~~~~~~~~~~~~~
159149

160-
Using the Python ``in`` operator on a ``Series`` tests for membership in the
161-
index, not membership among the values.
150+
Using the Python ``in`` operator on a :class:`Series` tests for membership in the
151+
**index**, not membership among the values.
162152

163153
.. ipython:: python
164154
@@ -167,15 +157,15 @@ index, not membership among the values.
167157
'b' in s
168158
169159
If this behavior is surprising, keep in mind that using ``in`` on a Python
170-
dictionary tests keys, not values, and ``Series`` are dict-like.
160+
dictionary tests keys, not values, and :class:`Series` are dict-like.
171161
To test for membership in the values, use the method :meth:`~pandas.Series.isin`:
172162

173163
.. ipython:: python
174164
175165
s.isin([2])
176166
s.isin([2]).any()
177167
178-
For ``DataFrames``, likewise, ``in`` applies to the column axis,
168+
For :class:`DataFrame`, likewise, ``in`` applies to the column axis,
179169
testing for membership in the list of column names.
180170

181171
.. _gotchas.udf-mutation:
@@ -206,8 +196,8 @@ causing unexpected behavior. Consider the example:
206196
One probably would have expected that the result would be ``[1, 3, 5]``.
207197
When using a pandas method that takes a UDF, internally pandas is often
208198
iterating over the
209-
``DataFrame`` or other pandas object. Therefore, if the UDF mutates (changes)
210-
the ``DataFrame``, unexpected behavior can arise.
199+
:class:`DataFrame` or other pandas object. Therefore, if the UDF mutates (changes)
200+
the :class:`DataFrame`, unexpected behavior can arise.
211201

212202
Here is a similar example with :meth:`DataFrame.apply`:
213203

@@ -267,7 +257,7 @@ For many reasons we chose the latter. After years of production use it has
267257
proven, at least in my opinion, to be the best decision given the state of
268258
affairs in NumPy and Python in general. The special value ``NaN``
269259
(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
270-
functions ``isna`` and ``notna`` which can be used across the dtypes to
260+
functions :meth:`DataFrame.isna` and :meth:`DataFrame.notna` which can be used across the dtypes to
271261
detect NA values.
272262

273263
However, it comes with it a couple of trade-offs which I most certainly have
@@ -293,7 +283,7 @@ arrays. For example:
293283
s2.dtype
294284
295285
This trade-off is made largely for memory and performance reasons, and also so
296-
that the resulting ``Series`` continues to be "numeric".
286+
that the resulting :class:`Series` continues to be "numeric".
297287

298288
If you need to represent integers with possibly missing values, use one of
299289
the nullable-integer extension dtypes provided by pandas
@@ -318,7 +308,7 @@ See :ref:`integer_na` for more.
318308
``NA`` type promotions
319309
~~~~~~~~~~~~~~~~~~~~~~
320310

321-
When introducing NAs into an existing ``Series`` or ``DataFrame`` via
311+
When introducing NAs into an existing :class:`Series` or :class:`DataFrame` via
322312
:meth:`~Series.reindex` or some other means, boolean and integer types will be
323313
promoted to a different dtype in order to store the NAs. The promotions are
324314
summarized in this table:
@@ -376,18 +366,19 @@ integer arrays to floating when NAs must be introduced.
376366

377367
Differences with NumPy
378368
----------------------
379-
For ``Series`` and ``DataFrame`` objects, :meth:`~DataFrame.var` normalizes by
369+
For :class:`Series` and :class:`DataFrame` objects, :meth:`~DataFrame.var` normalizes by
380370
``N-1`` to produce unbiased estimates of the sample variance, while NumPy's
381-
``var`` normalizes by N, which measures the variance of the sample. Note that
371+
:meth:`numpy.var` normalizes by N, which measures the variance of the sample. Note that
382372
:meth:`~DataFrame.cov` normalizes by ``N-1`` in both pandas and NumPy.
383373

374+
.. _gotchas.thread-safety:
384375

385376
Thread-safety
386377
-------------
387378

388-
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to
379+
pandas is not 100% thread safe. The known issues relate to
389380
the :meth:`~DataFrame.copy` method. If you are doing a lot of copying of
390-
``DataFrame`` objects shared among threads, we recommend holding locks inside
381+
:class:`DataFrame` objects shared among threads, we recommend holding locks inside
391382
the threads where the data copying occurs.
392383

393384
See `this link <https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe>`__
@@ -406,7 +397,7 @@ symptom of this issue is an error like::
406397

407398
To deal
408399
with this issue you should convert the underlying NumPy array to the native
409-
system byte order *before* passing it to ``Series`` or ``DataFrame``
400+
system byte order *before* passing it to :class:`Series` or :class:`DataFrame`
410401
constructors using something similar to the following:
411402

412403
.. ipython:: python

pandas/core/frame.py

+5
Original file line numberDiff line numberDiff line change
@@ -3219,6 +3219,11 @@ def memory_usage(self, index: bool = True, deep: bool = False) -> Series:
32193219
many repeated values.
32203220
DataFrame.info : Concise summary of a DataFrame.
32213221
3222+
Notes
3223+
-----
3224+
See the :ref:`Frequently Asked Questions <df-memory-usage>` for more
3225+
details.
3226+
32223227
Examples
32233228
--------
32243229
>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']

pandas/core/generic.py

+4
Original file line numberDiff line numberDiff line change
@@ -5939,6 +5939,10 @@ def copy(self: NDFrameT, deep: bool_t = True) -> NDFrameT:
59395939
immutable, the underlying data can be safely shared and a copy
59405940
is not needed.
59415941
5942+
Since pandas is not thread safe, see the
5943+
:ref:`gotchas <gotchas.thread-safety>` when copying in a threading
5944+
environment.
5945+
59425946
Examples
59435947
--------
59445948
>>> s = pd.Series([1, 2], index=["a", "b"])

pandas/io/formats/info.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,9 @@
280280
made based in column dtype and number of rows assuming values
281281
consume the same memory amount for corresponding dtypes. With deep
282282
memory introspection, a real memory usage calculation is performed
283-
at the cost of computational resources.
283+
at the cost of computational resources. See the
284+
:ref:`Frequently Asked Questions <df-memory-usage>` for more
285+
details.
284286
{show_counts_sub}{null_counts_sub}
285287
286288
Returns

0 commit comments

Comments
 (0)