Merge pull request #8858 from bashtage/stata-doc-fixup

jorisvandenbossche · jorisvandenbossche · commit 6fb9c041e238 · 2014-11-20T08:38:44.000+01:00
DOC: Clean up Stata documentation
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -3670,22 +3670,27 @@ into a .dta file. The format version of this file is always 115 (Stata 12).
    df.to_stata('stata.dta')
 
 *Stata* data files have limited data type support; only strings with 244 or
-fewer characters, ``int8``, ``int16``, ``int32`` and ``float64`` can be stored
-in ``.dta`` files.  *Stata* reserves certain values to represent
-missing data. Furthermore, when a value is encountered outside of the
-permitted range, the data type is upcast to the next larger size.  For
-example, ``int8`` values are restricted to lie between -127 and 100, and so
-variables with values above 100 will trigger a conversion to ``int16``. ``nan``
-values in floating points data types are stored as the basic missing data type
-(``.`` in *Stata*).  It is not possible to indicate missing data values for
-integer data types.
+fewer characters, ``int8``, ``int16``, ``int32``, ``float32` and ``float64``
+can be stored
+in ``.dta`` files.  Additionally, *Stata* reserves certain values to represent
+missing data. Exporting a non-missing value that is outside of the
+permitted range in Stata for a particular data type will retype the variable
+to the next larger size.  For example, ``int8`` values are restricted to lie
+between -127 and 100 in Stata, and so variables with values above 100 will
+trigger a conversion to ``int16``. ``nan`` values in floating points data
+types are stored as the basic missing data type (``.`` in *Stata*).
+
+.. note::
+
+    It is not possible to export missing data values for integer data types.
+
 
 The *Stata* writer gracefully handles other data types including ``int64``,
-``bool``, ``uint8``, ``uint16``, ``uint32`` and ``float32`` by upcasting to
+``bool``, ``uint8``, ``uint16``, ``uint32`` by casting to
 the smallest supported type that can represent the data.  For example, data
 with a type of ``uint8`` will be cast to ``int8`` if all values are less than
 100 (the upper bound for non-missing ``int8`` data in *Stata*), or, if values are
-outside of this range, the data is cast to ``int16``.
+outside of this range, the variable is cast to ``int16``.
 
 
 .. warning::
@@ -3701,50 +3706,41 @@ outside of this range, the data is cast to ``int16``.
   115 dta file format. Attempting to write *Stata* dta files with strings
   longer than 244 characters raises a ``ValueError``.
 
-.. warning::
-
-  *Stata* data files only support text labels for categorical data.  Exporting
-  data frames containing categorical data will convert non-string categorical values
-  to strings.
-
-Writing data to/from Stata format files with a ``category`` dtype was implemented in 0.15.2.
-
 .. _io.stata_reader:
 
-Reading from STATA format
+Reading from Stata format
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The top-level function ``read_stata`` will read a dta format file
-and return a DataFrame:
-The class :class:`~pandas.io.stata.StataReader` will read the header of the
-given dta file at initialization. Its method
-:func:`~pandas.io.stata.StataReader.data` will read the observations,
-converting them to a DataFrame which is returned:
+The top-level function ``read_stata`` will read a dta files
+and return a DataFrame.  Alternatively,  the class :class:`~pandas.io.stata.StataReader`
+can be used if more granular access is required. :class:`~pandas.io.stata.StataReader`
+reads the header of the dta file at initialization. The method
+:func:`~pandas.io.stata.StataReader.data` reads and converts observations to a DataFrame.
 
 .. ipython:: python
 
    pd.read_stata('stata.dta')
 
-Currently the ``index`` is retrieved as a column on read back.
+Currently the ``index`` is retrieved as a column.
 
 The parameter ``convert_categoricals`` indicates whether value labels should be
 read and used to create a ``Categorical`` variable from them. Value labels can
 also be retrieved by the function ``variable_labels``, which requires data to be
-called before (see ``pandas.io.stata.StataReader``).
+called before use (see ``pandas.io.stata.StataReader``).
 
 The parameter ``convert_missing`` indicates whether missing value
 representations in Stata should be preserved.  If ``False`` (the default),
 missing values are represented as ``np.nan``.  If ``True``, missing values are
 represented using ``StataMissingValue`` objects, and columns containing missing
-values will have ``dtype`` set to ``object``.
+values will have ```object`` data type.
 
-The StataReader supports .dta Formats 104, 105, 108, 113-115 and 117.
-Alternatively, the function :func:`~pandas.io.stata.read_stata` can be used
+:func:`~pandas.read_stata` and :class:`~pandas.io.stata.StataReader` supports .dta
+formats 104, 105, 108, 113-115 (Stata 10-12) and 117 (Stata 13+).
 
 .. note::
 
-   Setting ``preserve_dtypes=False`` will upcast all integer data types to
-   ``int64`` and all floating point data types to ``float64``.  By default,
+   Setting ``preserve_dtypes=False`` will upcast to the standard pandas data types:
+   ``int64`` for all integer types and ``float64`` for floating poitn data.  By default,
    the Stata data types are preserved when importing.
 
 .. ipython:: python
@@ -3775,14 +3771,13 @@ is lost when exporting.
 
 Labeled data can similarly be imported from *Stata* data files as ``Categorical``
 variables using the keyword argument ``convert_categoricals`` (``True`` by default).  
-By default, imported ``Categorical`` variables are ordered according to the 
-underlying numerical data. However, setting ``order_categoricals=False`` will 
-import labeled data as ``Categorical`` variables without an order.
+The keyword argument ``order_categoricals`` (``True`` by default) determines
+ whether imported ``Categorical`` variables are ordered.
 
 .. note::
 
     When importing categorical data, the values of the variables in the *Stata*
-    data file are not generally preserved since ``Categorical`` variables always
+    data file are not preserved since ``Categorical`` variables always
     use integer data types between ``-1`` and ``n-1`` where ``n`` is the number
     of categories. If the original values in the *Stata* data file are required,
     these can be imported by setting ``convert_categoricals=False``, which will
@@ -3795,7 +3790,7 @@ import labeled data as ``Categorical`` variables without an order.
 
 .. note::
 
-    *Stata* suppots partially labeled series.  These series have value labels for
+    *Stata* supports partially labeled series.  These series have value labels for
     some but not all data values. Importing a partially labeled series will produce
     a ``Categorial`` with string categories for the values that are labeled and
     numeric categories for values with no label.