@@ -27,13 +27,94 @@ general, we were given the difficult choice between either
27
27
- Using a special sentinel value, bit pattern, or set of sentinel values to
28
28
denote ``NA `` across the dtypes
29
29
30
+ For many reasons we chose the latter. After years of production use it has
31
+ proven, at least in my opinion, to be the best decision given the state of
32
+ affairs in NumPy and Python in general. The special value ``NaN ``
33
+ (Not-A-Number) is used everywhere as the ``NA `` value, and there are API
34
+ functions ``isnull `` and ``notnull `` which can be used across the dtypes to
35
+ detect NA values.
36
+
37
+ However, it comes with it a couple of trade-offs which I most certainly have
38
+ not ignored.
30
39
31
40
Support for integer ``NA ``
32
41
~~~~~~~~~~~~~~~~~~~~~~~~~~
33
42
43
+ In the absence of high performance ``NA `` support being built into NumPy from
44
+ the ground up, the primary casualty is the ability to represent NAs in integer
45
+ arrays. For example:
46
+
47
+ .. ipython :: python
48
+
49
+ s = Series([1 , 2 , 3 , 4 , 5 ], index = list (' abcde' ))
50
+ s
51
+ s.dtype
52
+
53
+ s2 = s.reindex([' a' , ' b' , ' c' , ' f' , ' u' ])
54
+ s2
55
+ s2.dtype
56
+
57
+ This trade-off is made largely for memory and performance reasons, and also so
58
+ that the resulting Series continues to be "numeric". One possibility is to use
59
+ ``dtype=object `` arrays instead.
60
+
34
61
``NA `` type promotions
35
62
~~~~~~~~~~~~~~~~~~~~~~
36
63
64
+ When introducing NAs into an existing Series or DataFrame via ``reindex `` or
65
+ some other means, boolean and integer types will be promoted to a different
66
+ dtype in order to store the NAs. These are summarized by this table:
67
+
68
+ .. csv-table ::
69
+ :header: "Typeclass","Promotion dtype for storing NAs"
70
+ :widths: 40,60
71
+
72
+ ``floating ``, no change
73
+ ``object ``, no change
74
+ ``integer ``, cast to ``float64 ``
75
+ ``boolean ``, cast to ``object ``
76
+
77
+ While this may seem like a heavy trade-off, in practice I have found very few
78
+ cases where this is an issue in practice. Some explanation for the motivation
79
+ here in the next section.
80
+
81
+ Why not make NumPy like R?
82
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~
83
+
84
+ Many people have suggested that NumPy should simply emulate the ``NA `` support
85
+ present in the more domain-specific statistical programming langauge `R
86
+ <http://r-project.org> `__. Part of the reason is the NumPy type hierarchy:
87
+
88
+ .. csv-table ::
89
+ :header: "Typeclass","Dtypes"
90
+ :widths: 30,70
91
+ :delim: |
92
+
93
+ ``numpy.floating `` | ``float16, float32, float64, float128 ``
94
+ ``numpy.integer `` | ``int8, int16, int32, int64 ``
95
+ ``numpy.unsignedinteger `` | ``uint8, uint16, uint32, uint64 ``
96
+ ``numpy.object_ `` | ``object_ ``
97
+ ``numpy.bool_ `` | ``bool_ ``
98
+ ``numpy.character `` | ``string_, unicode_ ``
99
+
100
+ The R language, by contrast, only has a handful of built-in data types:
101
+ ``integer ``, ``numeric `` (floating-point), ``character ``, and
102
+ ``boolean ``. ``NA `` types are implemented by reserving special bit patterns for
103
+ each type to be used as the missing value. While doing this with the full NumPy
104
+ type hierarchy would be possible, it would be a more substantial trade-off
105
+ (especially for the 8- and 16-bit data types) and implementation undertaking.
106
+
107
+ An alternate approach is that of using masked arrays. A masked array is an
108
+ array of data with an associated boolean *mask * denoting whether each value
109
+ should be considered ``NA `` or not. I am personally not in love with this
110
+ approach as I feel that overall it places a fairly heavy burden on the user and
111
+ the library implementer. Additionally, it exacts a fairly high performance cost
112
+ when working with numerical data compared with the simple approach of using
113
+ ``NaN ``. Thus, I have chosen the Pythonic "practicality beats purity" approach
114
+ and traded integer ``NA `` capability for a much simpler approach of using a
115
+ special value in float and object arrays to denote ``NA ``, and promoting
116
+ integer arrays to floating when NAs must be introduced.
117
+
37
118
Integer indexing
38
119
----------------
39
120
@@ -71,7 +152,8 @@ index can be somewhat complicated. For example, the following does not work:
71
152
s.ix['c':'e'+1]
72
153
73
154
A very common use case is to limit a time series to start and end at two
74
- specific dates. To enable this, we made the design design to make label-based slicing include both endpoints:
155
+ specific dates. To enable this, we made the design design to make label-based
156
+ slicing include both endpoints:
75
157
76
158
.. ipython :: python
77
159
0 commit comments