DOC: more on gotchas, review integer indexing API changes, GH #627

wesm · wesm · commit 99e2eec5187f · 2012-01-24T16:23:18.000-05:00
diff --git a/RELEASE.rst b/RELEASE.rst
@@ -232,6 +232,8 @@ pandas 0.7.0
     (GH #666)
   - Use sort kind in Series.sort / argsort (GH #668)
   - Fix DataFrame operations on non-scalar, non-pandas objects (GH #672)
+  - Don't convert DataFrame column to integer type when passing integer to
+    __setitem__ (GH #669)
 
 Thanks
 ------
@@ -261,6 +263,7 @@ Thanks
 - Dieter Vandenbussche
 - Texas P.
 - Pinxing Ye
+- ... and everyone I forgot
 
 pandas 0.6.1
 ============
diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst
@@ -27,13 +27,94 @@ general, we were given the difficult choice between either
 - Using a special sentinel value, bit pattern, or set of sentinel values to
   denote ``NA`` across the dtypes
 
+For many reasons we chose the latter. After years of production use it has
+proven, at least in my opinion, to be the best decision given the state of
+affairs in NumPy and Python in general. The special value ``NaN``
+(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
+functions ``isnull`` and ``notnull`` which can be used across the dtypes to
+detect NA values.
+
+However, it comes with it a couple of trade-offs which I most certainly have
+not ignored.
 
 Support for integer ``NA``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+In the absence of high performance ``NA`` support being built into NumPy from
+the ground up, the primary casualty is the ability to represent NAs in integer
+arrays. For example:
+
+.. ipython:: python
+
+   s = Series([1, 2, 3, 4, 5], index=list('abcde'))
+   s
+   s.dtype
+
+   s2 = s.reindex(['a', 'b', 'c', 'f', 'u'])
+   s2
+   s2.dtype
+
+This trade-off is made largely for memory and performance reasons, and also so
+that the resulting Series continues to be "numeric". One possibility is to use
+``dtype=object`` arrays instead.
+
 ``NA`` type promotions
 ~~~~~~~~~~~~~~~~~~~~~~
 
+When introducing NAs into an existing Series or DataFrame via ``reindex`` or
+some other means, boolean and integer types will be promoted to a different
+dtype in order to store the NAs. These are summarized by this table:
+
+.. csv-table::
+   :header: "Typeclass","Promotion dtype for storing NAs"
+   :widths: 40,60
+
+   ``floating``, no change
+   ``object``, no change
+   ``integer``, cast to ``float64``
+   ``boolean``, cast to ``object``
+
+While this may seem like a heavy trade-off, in practice I have found very few
+cases where this is an issue in practice. Some explanation for the motivation
+here in the next section.
+
+Why not make NumPy like R?
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many people have suggested that NumPy should simply emulate the ``NA`` support
+present in the more domain-specific statistical programming langauge `R
+<http://r-project.org>`__. Part of the reason is the NumPy type hierarchy:
+
+.. csv-table::
+   :header: "Typeclass","Dtypes"
+   :widths: 30,70
+   :delim: |
+
+   ``numpy.floating`` | ``float16, float32, float64, float128``
+   ``numpy.integer`` | ``int8, int16, int32, int64``
+   ``numpy.unsignedinteger`` | ``uint8, uint16, uint32, uint64``
+   ``numpy.object_`` | ``object_``
+   ``numpy.bool_`` | ``bool_``
+   ``numpy.character`` | ``string_, unicode_``
+
+The R language, by contrast, only has a handful of built-in data types:
+``integer``, ``numeric`` (floating-point), ``character``, and
+``boolean``. ``NA`` types are implemented by reserving special bit patterns for
+each type to be used as the missing value. While doing this with the full NumPy
+type hierarchy would be possible, it would be a more substantial trade-off
+(especially for the 8- and 16-bit data types) and implementation undertaking.
+
+An alternate approach is that of using masked arrays. A masked array is an
+array of data with an associated boolean *mask* denoting whether each value
+should be considered ``NA`` or not. I am personally not in love with this
+approach as I feel that overall it places a fairly heavy burden on the user and
+the library implementer. Additionally, it exacts a fairly high performance cost
+when working with numerical data compared with the simple approach of using
+``NaN``. Thus, I have chosen the Pythonic "practicality beats purity" approach
+and traded integer ``NA`` capability for a much simpler approach of using a
+special value in float and object arrays to denote ``NA``, and promoting
+integer arrays to floating when NAs must be introduced.
+
 Integer indexing
 ----------------
 
@@ -71,7 +152,8 @@ index can be somewhat complicated. For example, the following does not work:
     s.ix['c':'e'+1]
 
 A very common use case is to limit a time series to start and end at two
-specific dates. To enable this, we made the design design to make label-based slicing include both endpoints:
+specific dates. To enable this, we made the design design to make label-based
+slicing include both endpoints:
 
 .. ipython:: python