@@ -3670,22 +3670,27 @@ into a .dta file. The format version of this file is always 115 (Stata 12).
3670
3670
df.to_stata(' stata.dta' )
3671
3671
3672
3672
*Stata * data files have limited data type support; only strings with 244 or
3673
- fewer characters, ``int8 ``, ``int16 ``, ``int32 `` and ``float64 `` can be stored
3674
- in ``.dta `` files. *Stata * reserves certain values to represent
3675
- missing data. Furthermore, when a value is encountered outside of the
3676
- permitted range, the data type is upcast to the next larger size. For
3677
- example, ``int8 `` values are restricted to lie between -127 and 100, and so
3678
- variables with values above 100 will trigger a conversion to ``int16 ``. ``nan ``
3679
- values in floating points data types are stored as the basic missing data type
3680
- (``. `` in *Stata *). It is not possible to indicate missing data values for
3681
- integer data types.
3673
+ fewer characters, ``int8 ``, ``int16 ``, ``int32 ``, ``float32` and ``float64 ``
3674
+ can be stored
3675
+ in ``.dta `` files. Additionally, *Stata * reserves certain values to represent
3676
+ missing data. Exporting a non-missing value that is outside of the
3677
+ permitted range in Stata for a particular data type will retype the variable
3678
+ to the next larger size. For example, ``int8 `` values are restricted to lie
3679
+ between -127 and 100 in Stata, and so variables with values above 100 will
3680
+ trigger a conversion to ``int16 ``. ``nan `` values in floating points data
3681
+ types are stored as the basic missing data type (``. `` in *Stata *).
3682
+
3683
+ .. note ::
3684
+
3685
+ It is not possible to export missing data values for integer data types.
3686
+
3682
3687
3683
3688
The *Stata * writer gracefully handles other data types including ``int64 ``,
3684
- ``bool ``, ``uint8 ``, ``uint16 ``, ``uint32 `` and `` float32 `` by upcasting to
3689
+ ``bool ``, ``uint8 ``, ``uint16 ``, ``uint32 `` by casting to
3685
3690
the smallest supported type that can represent the data. For example, data
3686
3691
with a type of ``uint8 `` will be cast to ``int8 `` if all values are less than
3687
3692
100 (the upper bound for non-missing ``int8 `` data in *Stata *), or, if values are
3688
- outside of this range, the data is cast to ``int16 ``.
3693
+ outside of this range, the variable is cast to ``int16 ``.
3689
3694
3690
3695
3691
3696
.. warning ::
@@ -3701,50 +3706,41 @@ outside of this range, the data is cast to ``int16``.
3701
3706
115 dta file format. Attempting to write *Stata * dta files with strings
3702
3707
longer than 244 characters raises a ``ValueError ``.
3703
3708
3704
- .. warning ::
3705
-
3706
- *Stata * data files only support text labels for categorical data. Exporting
3707
- data frames containing categorical data will convert non-string categorical values
3708
- to strings.
3709
-
3710
- Writing data to/from Stata format files with a ``category `` dtype was implemented in 0.15.2.
3711
-
3712
3709
.. _io.stata_reader :
3713
3710
3714
- Reading from STATA format
3711
+ Reading from Stata format
3715
3712
~~~~~~~~~~~~~~~~~~~~~~~~~
3716
3713
3717
- The top-level function ``read_stata `` will read a dta format file
3718
- and return a DataFrame:
3719
- The class :class: `~pandas.io.stata.StataReader ` will read the header of the
3720
- given dta file at initialization. Its method
3721
- :func: `~pandas.io.stata.StataReader.data ` will read the observations,
3722
- converting them to a DataFrame which is returned:
3714
+ The top-level function ``read_stata `` will read a dta files
3715
+ and return a DataFrame. Alternatively, the class :class: `~pandas.io.stata.StataReader `
3716
+ can be used if more granular access is required. :class: `~pandas.io.stata.StataReader `
3717
+ reads the header of the dta file at initialization. The method
3718
+ :func: `~pandas.io.stata.StataReader.data ` reads and converts observations to a DataFrame.
3723
3719
3724
3720
.. ipython :: python
3725
3721
3726
3722
pd.read_stata(' stata.dta' )
3727
3723
3728
- Currently the ``index `` is retrieved as a column on read back .
3724
+ Currently the ``index `` is retrieved as a column.
3729
3725
3730
3726
The parameter ``convert_categoricals `` indicates whether value labels should be
3731
3727
read and used to create a ``Categorical `` variable from them. Value labels can
3732
3728
also be retrieved by the function ``variable_labels ``, which requires data to be
3733
- called before (see ``pandas.io.stata.StataReader ``).
3729
+ called before use (see ``pandas.io.stata.StataReader ``).
3734
3730
3735
3731
The parameter ``convert_missing `` indicates whether missing value
3736
3732
representations in Stata should be preserved. If ``False `` (the default),
3737
3733
missing values are represented as ``np.nan ``. If ``True ``, missing values are
3738
3734
represented using ``StataMissingValue `` objects, and columns containing missing
3739
- values will have ``dtype `` set to `` object ``.
3735
+ values will have ``` object `` data type .
3740
3736
3741
- The StataReader supports .dta Formats 104, 105, 108, 113-115 and 117.
3742
- Alternatively, the function :func: ` ~pandas.io.stata.read_stata ` can be used
3737
+ :func: ` ~pandas.read_stata ` and :class: ` ~pandas.io.stata. StataReader` supports .dta
3738
+ formats 104, 105, 108, 113-115 (Stata 10-12) and 117 (Stata 13+).
3743
3739
3744
3740
.. note ::
3745
3741
3746
- Setting ``preserve_dtypes=False `` will upcast all integer data types to
3747
- ``int64 `` and all floating point data types to ``float64 ``. By default,
3742
+ Setting ``preserve_dtypes=False `` will upcast to the standard pandas data types:
3743
+ ``int64 `` for all integer types and ``float64 `` for floating poitn data . By default,
3748
3744
the Stata data types are preserved when importing.
3749
3745
3750
3746
.. ipython :: python
@@ -3775,14 +3771,13 @@ is lost when exporting.
3775
3771
3776
3772
Labeled data can similarly be imported from *Stata * data files as ``Categorical ``
3777
3773
variables using the keyword argument ``convert_categoricals `` (``True `` by default).
3778
- By default, imported ``Categorical `` variables are ordered according to the
3779
- underlying numerical data. However, setting ``order_categoricals=False `` will
3780
- import labeled data as ``Categorical `` variables without an order.
3774
+ The keyword argument ``order_categoricals `` (``True `` by default) determines
3775
+ whether imported ``Categorical `` variables are ordered.
3781
3776
3782
3777
.. note ::
3783
3778
3784
3779
When importing categorical data, the values of the variables in the *Stata *
3785
- data file are not generally preserved since ``Categorical `` variables always
3780
+ data file are not preserved since ``Categorical `` variables always
3786
3781
use integer data types between ``-1 `` and ``n-1 `` where ``n `` is the number
3787
3782
of categories. If the original values in the *Stata * data file are required,
3788
3783
these can be imported by setting ``convert_categoricals=False ``, which will
@@ -3795,7 +3790,7 @@ import labeled data as ``Categorical`` variables without an order.
3795
3790
3796
3791
.. note ::
3797
3792
3798
- *Stata * suppots partially labeled series. These series have value labels for
3793
+ *Stata * supports partially labeled series. These series have value labels for
3799
3794
some but not all data values. Importing a partially labeled series will produce
3800
3795
a ``Categorial `` with string categories for the values that are labeled and
3801
3796
numeric categories for values with no label.
0 commit comments