diff --git a/doc/source/io.rst b/doc/source/io.rst index 9686a72d43cf8..bf8776d4bc396 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -3670,22 +3670,27 @@ into a .dta file. The format version of this file is always 115 (Stata 12). df.to_stata('stata.dta') *Stata* data files have limited data type support; only strings with 244 or -fewer characters, ``int8``, ``int16``, ``int32`` and ``float64`` can be stored -in ``.dta`` files. *Stata* reserves certain values to represent -missing data. Furthermore, when a value is encountered outside of the -permitted range, the data type is upcast to the next larger size. For -example, ``int8`` values are restricted to lie between -127 and 100, and so -variables with values above 100 will trigger a conversion to ``int16``. ``nan`` -values in floating points data types are stored as the basic missing data type -(``.`` in *Stata*). It is not possible to indicate missing data values for -integer data types. +fewer characters, ``int8``, ``int16``, ``int32``, ``float32` and ``float64`` +can be stored +in ``.dta`` files. Additionally, *Stata* reserves certain values to represent +missing data. Exporting a non-missing value that is outside of the +permitted range in Stata for a particular data type will retype the variable +to the next larger size. For example, ``int8`` values are restricted to lie +between -127 and 100 in Stata, and so variables with values above 100 will +trigger a conversion to ``int16``. ``nan`` values in floating points data +types are stored as the basic missing data type (``.`` in *Stata*). + +.. note:: + + It is not possible to export missing data values for integer data types. + The *Stata* writer gracefully handles other data types including ``int64``, -``bool``, ``uint8``, ``uint16``, ``uint32`` and ``float32`` by upcasting to +``bool``, ``uint8``, ``uint16``, ``uint32`` by casting to the smallest supported type that can represent the data. For example, data with a type of ``uint8`` will be cast to ``int8`` if all values are less than 100 (the upper bound for non-missing ``int8`` data in *Stata*), or, if values are -outside of this range, the data is cast to ``int16``. +outside of this range, the variable is cast to ``int16``. .. warning:: @@ -3701,50 +3706,41 @@ outside of this range, the data is cast to ``int16``. 115 dta file format. Attempting to write *Stata* dta files with strings longer than 244 characters raises a ``ValueError``. -.. warning:: - - *Stata* data files only support text labels for categorical data. Exporting - data frames containing categorical data will convert non-string categorical values - to strings. - -Writing data to/from Stata format files with a ``category`` dtype was implemented in 0.15.2. - .. _io.stata_reader: -Reading from STATA format +Reading from Stata format ~~~~~~~~~~~~~~~~~~~~~~~~~ -The top-level function ``read_stata`` will read a dta format file -and return a DataFrame: -The class :class:`~pandas.io.stata.StataReader` will read the header of the -given dta file at initialization. Its method -:func:`~pandas.io.stata.StataReader.data` will read the observations, -converting them to a DataFrame which is returned: +The top-level function ``read_stata`` will read a dta files +and return a DataFrame. Alternatively, the class :class:`~pandas.io.stata.StataReader` +can be used if more granular access is required. :class:`~pandas.io.stata.StataReader` +reads the header of the dta file at initialization. The method +:func:`~pandas.io.stata.StataReader.data` reads and converts observations to a DataFrame. .. ipython:: python pd.read_stata('stata.dta') -Currently the ``index`` is retrieved as a column on read back. +Currently the ``index`` is retrieved as a column. The parameter ``convert_categoricals`` indicates whether value labels should be read and used to create a ``Categorical`` variable from them. Value labels can also be retrieved by the function ``variable_labels``, which requires data to be -called before (see ``pandas.io.stata.StataReader``). +called before use (see ``pandas.io.stata.StataReader``). The parameter ``convert_missing`` indicates whether missing value representations in Stata should be preserved. If ``False`` (the default), missing values are represented as ``np.nan``. If ``True``, missing values are represented using ``StataMissingValue`` objects, and columns containing missing -values will have ``dtype`` set to ``object``. +values will have ```object`` data type. -The StataReader supports .dta Formats 104, 105, 108, 113-115 and 117. -Alternatively, the function :func:`~pandas.io.stata.read_stata` can be used +:func:`~pandas.read_stata` and :class:`~pandas.io.stata.StataReader` supports .dta +formats 104, 105, 108, 113-115 (Stata 10-12) and 117 (Stata 13+). .. note:: - Setting ``preserve_dtypes=False`` will upcast all integer data types to - ``int64`` and all floating point data types to ``float64``. By default, + Setting ``preserve_dtypes=False`` will upcast to the standard pandas data types: + ``int64`` for all integer types and ``float64`` for floating poitn data. By default, the Stata data types are preserved when importing. .. ipython:: python @@ -3775,14 +3771,13 @@ is lost when exporting. Labeled data can similarly be imported from *Stata* data files as ``Categorical`` variables using the keyword argument ``convert_categoricals`` (``True`` by default). -By default, imported ``Categorical`` variables are ordered according to the -underlying numerical data. However, setting ``order_categoricals=False`` will -import labeled data as ``Categorical`` variables without an order. +The keyword argument ``order_categoricals`` (``True`` by default) determines + whether imported ``Categorical`` variables are ordered. .. note:: When importing categorical data, the values of the variables in the *Stata* - data file are not generally preserved since ``Categorical`` variables always + data file are not preserved since ``Categorical`` variables always use integer data types between ``-1`` and ``n-1`` where ``n`` is the number of categories. If the original values in the *Stata* data file are required, these can be imported by setting ``convert_categoricals=False``, which will @@ -3795,7 +3790,7 @@ import labeled data as ``Categorical`` variables without an order. .. note:: - *Stata* suppots partially labeled series. These series have value labels for + *Stata* supports partially labeled series. These series have value labels for some but not all data values. Importing a partially labeled series will produce a ``Categorial`` with string categories for the values that are labeled and numeric categories for values with no label.