Skip to content

Commit 99e2eec

Browse files
committed
DOC: more on gotchas, review integer indexing API changes, GH #627
1 parent 63a1239 commit 99e2eec

File tree

2 files changed

+86
-1
lines changed

2 files changed

+86
-1
lines changed

RELEASE.rst

+3
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,8 @@ pandas 0.7.0
232232
(GH #666)
233233
- Use sort kind in Series.sort / argsort (GH #668)
234234
- Fix DataFrame operations on non-scalar, non-pandas objects (GH #672)
235+
- Don't convert DataFrame column to integer type when passing integer to
236+
__setitem__ (GH #669)
235237

236238
Thanks
237239
------
@@ -261,6 +263,7 @@ Thanks
261263
- Dieter Vandenbussche
262264
- Texas P.
263265
- Pinxing Ye
266+
- ... and everyone I forgot
264267

265268
pandas 0.6.1
266269
============

doc/source/gotchas.rst

+83-1
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,94 @@ general, we were given the difficult choice between either
2727
- Using a special sentinel value, bit pattern, or set of sentinel values to
2828
denote ``NA`` across the dtypes
2929

30+
For many reasons we chose the latter. After years of production use it has
31+
proven, at least in my opinion, to be the best decision given the state of
32+
affairs in NumPy and Python in general. The special value ``NaN``
33+
(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
34+
functions ``isnull`` and ``notnull`` which can be used across the dtypes to
35+
detect NA values.
36+
37+
However, it comes with it a couple of trade-offs which I most certainly have
38+
not ignored.
3039

3140
Support for integer ``NA``
3241
~~~~~~~~~~~~~~~~~~~~~~~~~~
3342

43+
In the absence of high performance ``NA`` support being built into NumPy from
44+
the ground up, the primary casualty is the ability to represent NAs in integer
45+
arrays. For example:
46+
47+
.. ipython:: python
48+
49+
s = Series([1, 2, 3, 4, 5], index=list('abcde'))
50+
s
51+
s.dtype
52+
53+
s2 = s.reindex(['a', 'b', 'c', 'f', 'u'])
54+
s2
55+
s2.dtype
56+
57+
This trade-off is made largely for memory and performance reasons, and also so
58+
that the resulting Series continues to be "numeric". One possibility is to use
59+
``dtype=object`` arrays instead.
60+
3461
``NA`` type promotions
3562
~~~~~~~~~~~~~~~~~~~~~~
3663

64+
When introducing NAs into an existing Series or DataFrame via ``reindex`` or
65+
some other means, boolean and integer types will be promoted to a different
66+
dtype in order to store the NAs. These are summarized by this table:
67+
68+
.. csv-table::
69+
:header: "Typeclass","Promotion dtype for storing NAs"
70+
:widths: 40,60
71+
72+
``floating``, no change
73+
``object``, no change
74+
``integer``, cast to ``float64``
75+
``boolean``, cast to ``object``
76+
77+
While this may seem like a heavy trade-off, in practice I have found very few
78+
cases where this is an issue in practice. Some explanation for the motivation
79+
here in the next section.
80+
81+
Why not make NumPy like R?
82+
~~~~~~~~~~~~~~~~~~~~~~~~~~
83+
84+
Many people have suggested that NumPy should simply emulate the ``NA`` support
85+
present in the more domain-specific statistical programming langauge `R
86+
<http://r-project.org>`__. Part of the reason is the NumPy type hierarchy:
87+
88+
.. csv-table::
89+
:header: "Typeclass","Dtypes"
90+
:widths: 30,70
91+
:delim: |
92+
93+
``numpy.floating`` | ``float16, float32, float64, float128``
94+
``numpy.integer`` | ``int8, int16, int32, int64``
95+
``numpy.unsignedinteger`` | ``uint8, uint16, uint32, uint64``
96+
``numpy.object_`` | ``object_``
97+
``numpy.bool_`` | ``bool_``
98+
``numpy.character`` | ``string_, unicode_``
99+
100+
The R language, by contrast, only has a handful of built-in data types:
101+
``integer``, ``numeric`` (floating-point), ``character``, and
102+
``boolean``. ``NA`` types are implemented by reserving special bit patterns for
103+
each type to be used as the missing value. While doing this with the full NumPy
104+
type hierarchy would be possible, it would be a more substantial trade-off
105+
(especially for the 8- and 16-bit data types) and implementation undertaking.
106+
107+
An alternate approach is that of using masked arrays. A masked array is an
108+
array of data with an associated boolean *mask* denoting whether each value
109+
should be considered ``NA`` or not. I am personally not in love with this
110+
approach as I feel that overall it places a fairly heavy burden on the user and
111+
the library implementer. Additionally, it exacts a fairly high performance cost
112+
when working with numerical data compared with the simple approach of using
113+
``NaN``. Thus, I have chosen the Pythonic "practicality beats purity" approach
114+
and traded integer ``NA`` capability for a much simpler approach of using a
115+
special value in float and object arrays to denote ``NA``, and promoting
116+
integer arrays to floating when NAs must be introduced.
117+
37118
Integer indexing
38119
----------------
39120

@@ -71,7 +152,8 @@ index can be somewhat complicated. For example, the following does not work:
71152
s.ix['c':'e'+1]
72153

73154
A very common use case is to limit a time series to start and end at two
74-
specific dates. To enable this, we made the design design to make label-based slicing include both endpoints:
155+
specific dates. To enable this, we made the design design to make label-based
156+
slicing include both endpoints:
75157

76158
.. ipython:: python
77159

0 commit comments

Comments
 (0)