Skip to content

Commit 6fb9c04

Browse files
Merge pull request #8858 from bashtage/stata-doc-fixup
DOC: Clean up Stata documentation
2 parents 3274f27 + d15b2d3 commit 6fb9c04

File tree

1 file changed

+33
-38
lines changed

1 file changed

+33
-38
lines changed

doc/source/io.rst

+33-38
Original file line numberDiff line numberDiff line change
@@ -3670,22 +3670,27 @@ into a .dta file. The format version of this file is always 115 (Stata 12).
36703670
df.to_stata('stata.dta')
36713671
36723672
*Stata* data files have limited data type support; only strings with 244 or
3673-
fewer characters, ``int8``, ``int16``, ``int32`` and ``float64`` can be stored
3674-
in ``.dta`` files. *Stata* reserves certain values to represent
3675-
missing data. Furthermore, when a value is encountered outside of the
3676-
permitted range, the data type is upcast to the next larger size. For
3677-
example, ``int8`` values are restricted to lie between -127 and 100, and so
3678-
variables with values above 100 will trigger a conversion to ``int16``. ``nan``
3679-
values in floating points data types are stored as the basic missing data type
3680-
(``.`` in *Stata*). It is not possible to indicate missing data values for
3681-
integer data types.
3673+
fewer characters, ``int8``, ``int16``, ``int32``, ``float32` and ``float64``
3674+
can be stored
3675+
in ``.dta`` files. Additionally, *Stata* reserves certain values to represent
3676+
missing data. Exporting a non-missing value that is outside of the
3677+
permitted range in Stata for a particular data type will retype the variable
3678+
to the next larger size. For example, ``int8`` values are restricted to lie
3679+
between -127 and 100 in Stata, and so variables with values above 100 will
3680+
trigger a conversion to ``int16``. ``nan`` values in floating points data
3681+
types are stored as the basic missing data type (``.`` in *Stata*).
3682+
3683+
.. note::
3684+
3685+
It is not possible to export missing data values for integer data types.
3686+
36823687

36833688
The *Stata* writer gracefully handles other data types including ``int64``,
3684-
``bool``, ``uint8``, ``uint16``, ``uint32`` and ``float32`` by upcasting to
3689+
``bool``, ``uint8``, ``uint16``, ``uint32`` by casting to
36853690
the smallest supported type that can represent the data. For example, data
36863691
with a type of ``uint8`` will be cast to ``int8`` if all values are less than
36873692
100 (the upper bound for non-missing ``int8`` data in *Stata*), or, if values are
3688-
outside of this range, the data is cast to ``int16``.
3693+
outside of this range, the variable is cast to ``int16``.
36893694

36903695

36913696
.. warning::
@@ -3701,50 +3706,41 @@ outside of this range, the data is cast to ``int16``.
37013706
115 dta file format. Attempting to write *Stata* dta files with strings
37023707
longer than 244 characters raises a ``ValueError``.
37033708

3704-
.. warning::
3705-
3706-
*Stata* data files only support text labels for categorical data. Exporting
3707-
data frames containing categorical data will convert non-string categorical values
3708-
to strings.
3709-
3710-
Writing data to/from Stata format files with a ``category`` dtype was implemented in 0.15.2.
3711-
37123709
.. _io.stata_reader:
37133710

3714-
Reading from STATA format
3711+
Reading from Stata format
37153712
~~~~~~~~~~~~~~~~~~~~~~~~~
37163713

3717-
The top-level function ``read_stata`` will read a dta format file
3718-
and return a DataFrame:
3719-
The class :class:`~pandas.io.stata.StataReader` will read the header of the
3720-
given dta file at initialization. Its method
3721-
:func:`~pandas.io.stata.StataReader.data` will read the observations,
3722-
converting them to a DataFrame which is returned:
3714+
The top-level function ``read_stata`` will read a dta files
3715+
and return a DataFrame. Alternatively, the class :class:`~pandas.io.stata.StataReader`
3716+
can be used if more granular access is required. :class:`~pandas.io.stata.StataReader`
3717+
reads the header of the dta file at initialization. The method
3718+
:func:`~pandas.io.stata.StataReader.data` reads and converts observations to a DataFrame.
37233719

37243720
.. ipython:: python
37253721
37263722
pd.read_stata('stata.dta')
37273723
3728-
Currently the ``index`` is retrieved as a column on read back.
3724+
Currently the ``index`` is retrieved as a column.
37293725

37303726
The parameter ``convert_categoricals`` indicates whether value labels should be
37313727
read and used to create a ``Categorical`` variable from them. Value labels can
37323728
also be retrieved by the function ``variable_labels``, which requires data to be
3733-
called before (see ``pandas.io.stata.StataReader``).
3729+
called before use (see ``pandas.io.stata.StataReader``).
37343730

37353731
The parameter ``convert_missing`` indicates whether missing value
37363732
representations in Stata should be preserved. If ``False`` (the default),
37373733
missing values are represented as ``np.nan``. If ``True``, missing values are
37383734
represented using ``StataMissingValue`` objects, and columns containing missing
3739-
values will have ``dtype`` set to ``object``.
3735+
values will have ```object`` data type.
37403736

3741-
The StataReader supports .dta Formats 104, 105, 108, 113-115 and 117.
3742-
Alternatively, the function :func:`~pandas.io.stata.read_stata` can be used
3737+
:func:`~pandas.read_stata` and :class:`~pandas.io.stata.StataReader` supports .dta
3738+
formats 104, 105, 108, 113-115 (Stata 10-12) and 117 (Stata 13+).
37433739

37443740
.. note::
37453741

3746-
Setting ``preserve_dtypes=False`` will upcast all integer data types to
3747-
``int64`` and all floating point data types to ``float64``. By default,
3742+
Setting ``preserve_dtypes=False`` will upcast to the standard pandas data types:
3743+
``int64`` for all integer types and ``float64`` for floating poitn data. By default,
37483744
the Stata data types are preserved when importing.
37493745

37503746
.. ipython:: python
@@ -3775,14 +3771,13 @@ is lost when exporting.
37753771

37763772
Labeled data can similarly be imported from *Stata* data files as ``Categorical``
37773773
variables using the keyword argument ``convert_categoricals`` (``True`` by default).
3778-
By default, imported ``Categorical`` variables are ordered according to the
3779-
underlying numerical data. However, setting ``order_categoricals=False`` will
3780-
import labeled data as ``Categorical`` variables without an order.
3774+
The keyword argument ``order_categoricals`` (``True`` by default) determines
3775+
whether imported ``Categorical`` variables are ordered.
37813776

37823777
.. note::
37833778

37843779
When importing categorical data, the values of the variables in the *Stata*
3785-
data file are not generally preserved since ``Categorical`` variables always
3780+
data file are not preserved since ``Categorical`` variables always
37863781
use integer data types between ``-1`` and ``n-1`` where ``n`` is the number
37873782
of categories. If the original values in the *Stata* data file are required,
37883783
these can be imported by setting ``convert_categoricals=False``, which will
@@ -3795,7 +3790,7 @@ import labeled data as ``Categorical`` variables without an order.
37953790

37963791
.. note::
37973792

3798-
*Stata* suppots partially labeled series. These series have value labels for
3793+
*Stata* supports partially labeled series. These series have value labels for
37993794
some but not all data values. Importing a partially labeled series will produce
38003795
a ``Categorial`` with string categories for the values that are labeled and
38013796
numeric categories for values with no label.

0 commit comments

Comments
 (0)