Skip to content

DOC: Clean up Stata documentation #8858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 20, 2014
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 33 additions & 38 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3670,22 +3670,27 @@ into a .dta file. The format version of this file is always 115 (Stata 12).
df.to_stata('stata.dta')

*Stata* data files have limited data type support; only strings with 244 or
fewer characters, ``int8``, ``int16``, ``int32`` and ``float64`` can be stored
in ``.dta`` files. *Stata* reserves certain values to represent
missing data. Furthermore, when a value is encountered outside of the
permitted range, the data type is upcast to the next larger size. For
example, ``int8`` values are restricted to lie between -127 and 100, and so
variables with values above 100 will trigger a conversion to ``int16``. ``nan``
values in floating points data types are stored as the basic missing data type
(``.`` in *Stata*). It is not possible to indicate missing data values for
integer data types.
fewer characters, ``int8``, ``int16``, ``int32``, ``float32` and ``float64``
can be stored
in ``.dta`` files. Additionally, *Stata* reserves certain values to represent
missing data. Exporting a non-missing value that is outside of the
permitted range in Stata for a particular data type will retype the variable
to the next larger size. For example, ``int8`` values are restricted to lie
between -127 and 100 in Stata, and so variables with values above 100 will
trigger a conversion to ``int16``. ``nan`` values in floating points data
types are stored as the basic missing data type (``.`` in *Stata*).

.. note::

It is not possible to export missing data values for integer data types.


The *Stata* writer gracefully handles other data types including ``int64``,
``bool``, ``uint8``, ``uint16``, ``uint32`` and ``float32`` by upcasting to
``bool``, ``uint8``, ``uint16``, ``uint32`` by casting to
the smallest supported type that can represent the data. For example, data
with a type of ``uint8`` will be cast to ``int8`` if all values are less than
100 (the upper bound for non-missing ``int8`` data in *Stata*), or, if values are
outside of this range, the data is cast to ``int16``.
outside of this range, the variable is cast to ``int16``.


.. warning::
Expand All @@ -3701,50 +3706,41 @@ outside of this range, the data is cast to ``int16``.
115 dta file format. Attempting to write *Stata* dta files with strings
longer than 244 characters raises a ``ValueError``.

.. warning::

*Stata* data files only support text labels for categorical data. Exporting
data frames containing categorical data will convert non-string categorical values
to strings.

Writing data to/from Stata format files with a ``category`` dtype was implemented in 0.15.2.

.. _io.stata_reader:

Reading from STATA format
Reading from Stata format
~~~~~~~~~~~~~~~~~~~~~~~~~

The top-level function ``read_stata`` will read a dta format file
and return a DataFrame:
The class :class:`~pandas.io.stata.StataReader` will read the header of the
given dta file at initialization. Its method
:func:`~pandas.io.stata.StataReader.data` will read the observations,
converting them to a DataFrame which is returned:
The top-level function ``read_stata`` will read a dta files
and return a DataFrame. Alternatively, the class :class:`~pandas.io.stata.StataReader`
can be used if more granular access is required. :class:`~pandas.io.stata.StataReader`
reads the header of the dta file at initialization. The method
:func:`~pandas.io.stata.StataReader.data` reads and converts observations to a DataFrame.

.. ipython:: python

pd.read_stata('stata.dta')

Currently the ``index`` is retrieved as a column on read back.
Currently the ``index`` is retrieved as a column.

The parameter ``convert_categoricals`` indicates whether value labels should be
read and used to create a ``Categorical`` variable from them. Value labels can
also be retrieved by the function ``variable_labels``, which requires data to be
called before (see ``pandas.io.stata.StataReader``).
called before use (see ``pandas.io.stata.StataReader``).

The parameter ``convert_missing`` indicates whether missing value
representations in Stata should be preserved. If ``False`` (the default),
missing values are represented as ``np.nan``. If ``True``, missing values are
represented using ``StataMissingValue`` objects, and columns containing missing
values will have ``dtype`` set to ``object``.
values will have ```object`` data type.

The StataReader supports .dta Formats 104, 105, 108, 113-115 and 117.
Alternatively, the function :func:`~pandas.io.stata.read_stata` can be used
:func:`~pandas.read_stata` and :class:`~pandas.io.stata.StataReader` supports .dta
formats 104, 105, 108, 113-115 (Stata 10-12) and 117 (Stata 13+).

.. note::

Setting ``preserve_dtypes=False`` will upcast all integer data types to
``int64`` and all floating point data types to ``float64``. By default,
Setting ``preserve_dtypes=False`` will upcast to the standard pandas data types:
``int64`` for all integer types and ``float64`` for floating poitn data. By default,
the Stata data types are preserved when importing.

.. ipython:: python
Expand Down Expand Up @@ -3775,14 +3771,13 @@ is lost when exporting.

Labeled data can similarly be imported from *Stata* data files as ``Categorical``
variables using the keyword argument ``convert_categoricals`` (``True`` by default).
By default, imported ``Categorical`` variables are ordered according to the
underlying numerical data. However, setting ``order_categoricals=False`` will
import labeled data as ``Categorical`` variables without an order.
The keyword argument ``order_categoricals`` (``True`` by default) determines
whether imported ``Categorical`` variables are ordered.

.. note::

When importing categorical data, the values of the variables in the *Stata*
data file are not generally preserved since ``Categorical`` variables always
data file are not preserved since ``Categorical`` variables always
use integer data types between ``-1`` and ``n-1`` where ``n`` is the number
of categories. If the original values in the *Stata* data file are required,
these can be imported by setting ``convert_categoricals=False``, which will
Expand All @@ -3795,7 +3790,7 @@ import labeled data as ``Categorical`` variables without an order.

.. note::

*Stata* suppots partially labeled series. These series have value labels for
*Stata* supports partially labeled series. These series have value labels for
some but not all data values. Importing a partially labeled series will produce
a ``Categorial`` with string categories for the values that are labeled and
numeric categories for values with no label.
Expand Down