@@ -13,7 +13,7 @@ DataFrame memory usage
13
13
The memory usage of a :class: `DataFrame ` (including the index) is shown when calling
14
14
the :meth: `~DataFrame.info `. A configuration option, ``display.memory_usage ``
15
15
(see :ref: `the list of options <options.available >`), specifies if the
16
- :class: `DataFrame ` memory usage will be displayed when invoking the `` df .info() ` `
16
+ :class: `DataFrame ` memory usage will be displayed when invoking the :meth: ` ~DataFrame .info `
17
17
method.
18
18
19
19
For example, the memory usage of the :class: `DataFrame ` below is shown
@@ -50,13 +50,13 @@ as it can be expensive to do this deeper introspection.
50
50
df.info(memory_usage = " deep" )
51
51
52
52
By default the display option is set to ``True `` but can be explicitly
53
- overridden by passing the ``memory_usage `` argument when invoking `` df .info() ` `.
53
+ overridden by passing the ``memory_usage `` argument when invoking :meth: ` ~DataFrame .info `.
54
54
55
55
The memory usage of each column can be found by calling the
56
56
:meth: `~DataFrame.memory_usage ` method. This returns a :class: `Series ` with an index
57
57
represented by column names and memory usage of each column shown in bytes. For
58
58
the :class: `DataFrame ` above, the memory usage of each column and the total memory
59
- usage can be found with the `` memory_usage ` ` method:
59
+ usage can be found with the :meth: ` ~DataFrame. memory_usage ` method:
60
60
61
61
.. ipython :: python
62
62
@@ -164,7 +164,8 @@ Mutating with User Defined Function (UDF) methods
164
164
-------------------------------------------------
165
165
166
166
This section applies to pandas methods that take a UDF. In particular, the methods
167
- ``.apply ``, ``.aggregate ``, ``.transform ``, and ``.filter ``.
167
+ :meth: `DataFrame.apply `, :meth: `DataFrame.aggregate `, :meth: `DataFrame.transform `, and
168
+ :meth: `DataFrame.filter `.
168
169
169
170
It is a general rule in programming that one should not mutate a container
170
171
while it is being iterated over. Mutation will invalidate the iterator,
@@ -192,16 +193,14 @@ the :class:`DataFrame`, unexpected behavior can arise.
192
193
Here is a similar example with :meth: `DataFrame.apply `:
193
194
194
195
.. ipython :: python
196
+ :okexcept:
195
197
196
198
def f (s ):
197
199
s.pop(" a" )
198
200
return s
199
201
200
202
df = pd.DataFrame({" a" : [1 , 2 , 3 ], " b" : [4 , 5 , 6 ]})
201
- try :
202
- df.apply(f, axis = " columns" )
203
- except Exception as err:
204
- print (repr (err))
203
+ df.apply(f, axis = " columns" )
205
204
206
205
To resolve this issue, one can make a copy so that the mutation does
207
206
not apply to the container being iterated over.
@@ -229,29 +228,41 @@ not apply to the container being iterated over.
229
228
df = pd.DataFrame({" a" : [1 , 2 , 3 ], ' b' : [4 , 5 , 6 ]})
230
229
df.apply(f, axis = " columns" )
231
230
232
- `` NaN ``, Integer `` NA `` values and `` NA `` type promotions
233
- ---------------------------------------------------------
231
+ Missing value representation for NumPy types
232
+ --------------------------------------------
234
233
235
- Choice of ``NA `` representation
236
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
234
+ `` np.nan `` as the ``NA `` representation for NumPy types
235
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
237
236
238
237
For lack of ``NA `` (missing) support from the ground up in NumPy and Python in
239
- general, we were given the difficult choice between either :
238
+ general, `` NA `` could have been represented with :
240
239
241
240
* A *masked array * solution: an array of data and an array of boolean values
242
241
indicating whether a value is there or is missing.
243
242
* Using a special sentinel value, bit pattern, or set of sentinel values to
244
243
denote ``NA `` across the dtypes.
245
244
246
- For many reasons we chose the latter. After years of production use it has
247
- proven, at least in my opinion, to be the best decision given the state of
248
- affairs in NumPy and Python in general. The special value ``NaN ``
249
- (Not-A-Number) is used everywhere as the ``NA `` value, and there are API
250
- functions :meth: `DataFrame.isna ` and :meth: `DataFrame.notna ` which can be used across the dtypes to
251
- detect NA values.
245
+ The special value ``np.nan `` (Not-A-Number) was chosen as the ``NA `` value for NumPy types, and there are API
246
+ functions like :meth: `DataFrame.isna ` and :meth: `DataFrame.notna ` which can be used across the dtypes to
247
+ detect NA values. However, this choice has a downside of coercing missing integer data as float types as
248
+ shown in :ref: `gotchas.intna `.
249
+
250
+ ``NA `` type promotions for NumPy types
251
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
252
+
253
+ When introducing NAs into an existing :class: `Series ` or :class: `DataFrame ` via
254
+ :meth: `~Series.reindex ` or some other means, boolean and integer types will be
255
+ promoted to a different dtype in order to store the NAs. The promotions are
256
+ summarized in this table:
252
257
253
- However, it comes with it a couple of trade-offs which I most certainly have
254
- not ignored.
258
+ .. csv-table ::
259
+ :header: "Typeclass","Promotion dtype for storing NAs"
260
+ :widths: 40,60
261
+
262
+ ``floating ``, no change
263
+ ``object ``, no change
264
+ ``integer ``, cast to ``float64 ``
265
+ ``boolean ``, cast to ``object ``
255
266
256
267
.. _gotchas.intna :
257
268
@@ -276,12 +287,13 @@ This trade-off is made largely for memory and performance reasons, and also so
276
287
that the resulting :class: `Series ` continues to be "numeric".
277
288
278
289
If you need to represent integers with possibly missing values, use one of
279
- the nullable-integer extension dtypes provided by pandas
290
+ the nullable-integer extension dtypes provided by pandas or pyarrow
280
291
281
292
* :class: `Int8Dtype `
282
293
* :class: `Int16Dtype `
283
294
* :class: `Int32Dtype `
284
295
* :class: `Int64Dtype `
296
+ * :class: `ArrowDtype `
285
297
286
298
.. ipython :: python
287
299
@@ -293,28 +305,10 @@ the nullable-integer extension dtypes provided by pandas
293
305
s2_int
294
306
s2_int.dtype
295
307
296
- See :ref: `integer_na ` for more.
297
-
298
- ``NA `` type promotions
299
- ~~~~~~~~~~~~~~~~~~~~~~
300
-
301
- When introducing NAs into an existing :class: `Series ` or :class: `DataFrame ` via
302
- :meth: `~Series.reindex ` or some other means, boolean and integer types will be
303
- promoted to a different dtype in order to store the NAs. The promotions are
304
- summarized in this table:
305
-
306
- .. csv-table ::
307
- :header: "Typeclass","Promotion dtype for storing NAs"
308
- :widths: 40,60
309
-
310
- ``floating ``, no change
311
- ``object ``, no change
312
- ``integer ``, cast to ``float64 ``
313
- ``boolean ``, cast to ``object ``
308
+ s_int_pa = pd.Series([1 , 2 , None ], dtype = " int64[pyarrow]" )
309
+ s_int_pa
314
310
315
- While this may seem like a heavy trade-off, I have found very few cases where
316
- this is an issue in practice i.e. storing values greater than 2**53. Some
317
- explanation for the motivation is in the next section.
311
+ See :ref: `integer_na ` and :ref: `pyarrow ` for more.
318
312
319
313
Why not make NumPy like R?
320
314
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -342,16 +336,8 @@ each type to be used as the missing value. While doing this with the full NumPy
342
336
type hierarchy would be possible, it would be a more substantial trade-off
343
337
(especially for the 8- and 16-bit data types) and implementation undertaking.
344
338
345
- An alternate approach is that of using masked arrays. A masked array is an
346
- array of data with an associated boolean *mask * denoting whether each value
347
- should be considered ``NA `` or not. I am personally not in love with this
348
- approach as I feel that overall it places a fairly heavy burden on the user and
349
- the library implementer. Additionally, it exacts a fairly high performance cost
350
- when working with numerical data compared with the simple approach of using
351
- ``NaN ``. Thus, I have chosen the Pythonic "practicality beats purity" approach
352
- and traded integer ``NA `` capability for a much simpler approach of using a
353
- special value in float and object arrays to denote ``NA ``, and promoting
354
- integer arrays to floating when NAs must be introduced.
339
+ However, R ``NA `` semantics are now available by using masked NumPy types such as :class: `Int64Dtype `
340
+ or PyArrow types (:class: `ArrowDtype `).
355
341
356
342
357
343
Differences with NumPy
0 commit comments