___  ____  ____  ____  ____(R)
                                                      /__    /   ____/   /   ____/   
                                                     ___/   /   /___/   /   /___/    
                                                       Statistics/Data analysis      
      
      Title
      
          [P] File formats .dta -- Description of .dta file format 103
      
      
          Warning:  The entry below describes the contents of an old Stata .dta
          file format.  Newer versions of Stata continue to read, and perhaps to
          write, this old format.  What follows is the original help file for the
          .dta file format when it was the current file format.
      
      
      Description
      
          (The information contained in this highly technical entry probably does
          not interest you.  We describe in detail the format of Stata .dta data
          sets for those interested in writing programs in C or other languages
          that read and write our data sets.)
      
      
      Remarks
      
          Remarks are presented under the following headings:
      
              1. Introduction
              2. Representation of strings
              3. Representation of numbers
              4. Final cautions
              5. Data set format definition
                  5.1 Header
                  5.2 Descriptors
                  5.3 Variable labels
                  5.4 Data
                  5.5 Value labels
      
      
      1. Introduction
      
          Stata-format data sets record data in a way generalized to work across
          computers that do not agree on how data are recorded.  Given a computer,
          data sets are divided into two categories: native-format and
          foreign-format data sets.  Stata uses the following two rules:
      
              R1. On any computer, Stata knows how to write only native-format data
                  sets.
      
              R2. On all computers, Stata can read foreign-format as well as
                  native-format data sets.
      
          Rules R1 and R2 ensure that Stata users need not be concerned with data
          set formats.  If you are writing a program to read and write Stata data
          sets, you will have to determine whether you want to follow the same
          rules or instead restrict your program to operate on only native-format
          data sets.  Since Stata follows rules R1 and R2, such a restriction would
          not be too limiting.  If the user had a foreign-format data set, he or
          she could enter Stata, use the data, and then save it again.
      
      
      
      2. Representation of strings
      
          1. Strings in Stata may be from 2 to 80 bytes long, in steps of 2. The
             file format is general for strings from 1 to 128 bytes long, including
             odd-length strings.  Stata does not currently support strings with
             lengths in excess of 80 characters.  Stata will operate properly with
             odd-length strings, but there is a performance penalty.
      
          2. Stata records a string with a trailing binary 0 (\0) delimiter if the
             length of the string is less than the maximum declared length.  The
             string is recorded without the delimiter if the string is of the
             maximum length.
      
          3. Leading and trailing blanks are significant.
      
          4. Strings use ASCII encoding.
      
          5. Strings may contain only the ASCII codes 1-127.  The remaining codes
             (0, 128-255) are reserved.
      
      
      3. Representation of numbers
      
          1. Numbers are represented as 2-byte and 4-byte integers and 4-byte and
             8-byte floats.  In the case of floats, IEEE format is used.
      
          2. Byte ordering varies across machines for all numeric types.  Bytes are
             ordered either least-significant to most-significant, dubbed LOHI, or
             most-significant to least-significant, dubbed HILO.  The IBM PC and
             DEC VAX, for instance, use LOHI encoding.  Machines based on the
             Motorola 68000 chip, such as the Sun, use HILO encoding.
      
          3. For purposes of written documentation, numbers are written with the
             most significant byte listed first.  Thus, x'0001' refers to an int
             taking on the logical value 1 on all machines.
      
          4. When reading a HILO number on a LOHI machine or a LOHI number on a
             HILO machine, perform the following before interpreting the number:
      
                  byte          no translation necessary
                  2-byte int    swap bytes 0 and 1
                  4-byte int    swap bytes 0 and 3, 1 and 2
                  4-byte float  swap bytes 0 and 3, 1 and 2
                  8-byte float  swap bytes 0 and 7, 1 and 6, 2 and 5, 3 and 4
      
      
      4. Final cautions
      
          1. The number of observations in a data set is stored in a 4-byte
             integer.  Nevertheless, Stata on a PC under DOS cannot process more
             than 32,754 (sic) observations.  This limit does not apply under Unix.
      
      
          2. All variable numbers are recorded as 2-byte integers. Nevertheless, no
             version of Stata will allow more than 254 (sic) variables in a data
             set.  This may be relaxed in future versions.
      
      
      5. Description
      
          Stata-format data sets contain five components, which are, in order:
      
              1. Header
              2. Descriptors
              3. Variable Labels
              4. Data
              5. Value Labels
      
      
      5.1 Header
      
          The Header is defined as
      
              Contents            Length    Format    Comments
              -----------------------------------------------------------------------
              release                  1    byte      contains 103 = x'67'
              byteorder                1    byte      x'01' -> HILO, x'02' -> LOHI
              filetype                 1    byte      x'01' -> data, x'02' -> xp
              unused                   1    byte      x'00'
              nvar (number of vars)    2    int       encoded per byteorder
              nobs (number of obs)     4    int       encoded per byteorder
              data_label              32    char      data set label, \0 terminated
              -----------------------------------------------------------------------
              Total                  42
      
      
      5.2 Descriptors
      
          The Descriptor is defined as
      
              Contents            Length    Format       Comments
              ----------------------------------------------------------------
              typlist               nvar    byte array
              varlist             9*nvar    char array
              srtlist         2*(nvar+1)    int array    encoded per byteorder
              fmtlist             7*nvar    char array
              lbllist             9*nvar    char array
              ----------------------------------------------------------------
      
      
          typlist stores the type of each variable, 1, ..., nvar. Stata stores four
          numeric types: double, float, long, and int, encoded 'd', 'f', 'l', and
          'i'.  If nvar == 4, a typlist of 'idlf' indicates that variable 1 is an
          int, variable 2 a double, variable 3 a long, and variable 4 a float.
          Types above 0x80 are used to represent strings, encoded as length + 0x7f.
          For example, a string with maximum length 8 would have type 0x87.  If
          typlist is read into the C-array char typlist[], then typlist[i-1]
          indicates the type of variable i.
      
          varlist contains the names of the Stata variables 1, ..., nvar, each up
          to 8 characters in length, and each terminated by a binary zero
          (hereafter denoted as \0). For instance, if nvar == 4,
      
              vbl1\0....myvar\0...thisvar\0.lstvar\0..
      
      
          would indicate that variable 1 is named vbl1, variable 2 myvar, variable
          3 thisvar, and variable 4 lstvar.  The byte positions indicated by
          periods will contain random numbers.  If varlist is read into the C-array
          char varlist[], then &varlist[(i-1)*9] points to the name of the ith
          variable.
      
          srtlist specifies the sort-order of the data set and is terminated by an
          (int) 0.  Each 2 bytes is a single int and contains either a variable
          number or zero.  The zero marks the end of the srtlist, and the array
          positions after that contain random junk.  For instance, if the data are
          not sorted, the first int will contain a zero and the ints thereafter
          will contain junk.  If nvar == 4, the record will appear as:
      
              '0000................'
      
          If the data set is sorted by a single variable myvar and if that variable
          is the second variable in the varlist, the record will appear as:
      
              '02000000............'  (if byteorder==LOHI)
              '00020000............'  (if byteorder==HILO)
      
          If the data is sorted by myvar and within myvar by vbl1, and vbl1 is the
          first variable in the data, the record will appear as:
      
              '020001000000........'  (if byteorder==LOHI)
              '000200010000........'  (if byteorder==HILO)
      
      
          If srtlist were read into the C-array int srtlist[], then srtlist[0]
          would be the number of the first sort variable or, if the data were not
          sorted, 0.  If the number is not zero, srtlist[1] would be the number of
          the second sort variable or, if there is not a second sort variable, 0,
          and so on.
      
          fmtlist contains the formats of the variables 1, ..., nvar.  Each format
          is 7 bytes long and includes a binary zero end-of-string marker.  For
          instance,
      
              %9.0f\0.%8.2g\0.%20.0g\0%3.0f\0
      
          indicates that variable 1 has a %9.0f format, variable 2 a %8.2g format,
          variable 3 a %20.0g format, and variable 4 a %3.0f format.  If fmtlist is
          read into the C-array char fmtlist[], then &fmtlist[7*(i-1)] refers to
          the starting address of the format for the ith variable.  Users are
          warned that the Stata g-format is not the same as the C g-format.
          Stata's g-format is far more sophisticated.
      
          lbllist contains the names of the value formats associated with the
          variables 1, ..., nvar.  Each value-format name is 8 bytes long and
          includes a binary zero end-of-string marker.  For instance,
      
              \0........yesno\0...\0........yesno\0...
      
          indicates that variables 1 and 3 have no value label associated with
          them, whereas variables 2 and 4 are both associated with the value label
          named yesno.  If lbllist is read into the C-array char lbllist[], then
          &lbllist[9*(i-1)] points to the start of the label name associated with
          the ith variable.
      
      
      5.3 Variable labels
      
          The Variable Labels are recorded as
      
              Contents            Length    Format     Comments
              ------------------------------------------------------
              Variable 1's label      32    char       \0 terminated
              Variable 2's label      32    char       \0 terminated
              .                        .    .          .
              .                        .    .          .
              Variable nvar's label   32    char       \0 terminated
      
          If there is no variable label, the first character of its label is \0.
      
      5.4 Data
      
          The Data are recorded as
      
              Contents                Length         Format
              -----------------------------------------------
              var 1, obs 1         per typlist    per typlist
              var 2, obs 1         per typlist    per typlist
              .                              .    .
              var nvar, obs 1      per typlist    per typlist
              var 1, obs 2         per typlist    per typlist
              .                              .    .
              .                              .    .
              var 1, obs nobs      per typlist    per typlist
              .                              .    .
              var nvar, obs nobs   per typlist    per typlist
      
          The data are written as all the variables on the first observation,
          followed by all the data on the second observation, and so on.  Each
          variable is written in its own internal format, as given in typlist.  All
          values are written per byteorder.  Strings are null terminated if they
          are shorter than the allowed space, but they are not terminated if they
          occupy the full width.
      
          End-of-file may occur at this point.  If it does, there are no value
          labels to be read.  End-of-file may similarly occur between value labels.
          On end-of-file, all data have been processed.
      
      
      5.5 Value labels
      
          Each Value Label is written as:
      
              Contents            Length    Format    Comments
              -------------------------------------------------------------------
              entries                  2    int       Number of entries in label
              labname                  9    char      \0 terminated
              padding                  1    byte
              value 1                  2    int
              value 2                  2    int
              .                        .    .
              value entries            2    int
              name 1                   8    char      see below
              name 2                   8    char      see below
              .                        .    .
              name entries             8    char      see below
              -------------------------------------------------------------------
      
          Each name is up to 8 characters in length.  If it is 7 or fewer
          characters, it is already terminated by a binary zero.  If it is exactly
          8 characters in length, no binary 0 end- of-string marker is included.
      
          The above layout is repeated for each value label included in the file.
      
          See dta for the current version formats.