___ ____ ____ ____ ____(R) /__ / ____/ / ____/ ___/ / /___/ / /___/ Statistics/Data analysis Title [P] File formats .dta -- Description of .dta file format 102 Warning: The entry below describes the contents of an old Stata .dta file format. Newer versions of Stata continue to read, and perhaps to write, this old format. What follows is the original help file for the .dta file format when it was the current file format. Description (The information contained in this highly technical entry probably does not interest you. We describe in detail the format of Stata .dta data sets for those interested in writing programs in C or other languages that read and write our data sets.) Remarks Remarks are presented under the following headings: 1. Representation of strings 2. Representation of numbers 3. Final cautions 4. Data set format definition 4.1 Header 4.2 Descriptors 4.3 Variable labels 4.4 Data 4.5 Value labels 1. Representation of strings 1. Stata records a string with a trailing binary 0 (\0) delimiter if the length of the string is less than the maximum declared length. The string is recorded without the delimiter if the string is of the maximum length. 2. Leading and trailing blanks are significant. 3. Strings use ASCII encoding. 4. Strings may contain only the ASCII codes 1-127. The remaining codes (0, 128-255) are reserved. 2. Representation of numbers 1. Numbers are represented as 2-byte and 4-byte integers and 4-byte and 8-byte floats. In the case of floats, IEEE format is used. 2. Bytes are ordered least-significant to most-significant, dubbed LOHI. 3. For purposes of written documentation, numbers are written with the most significant byte listed first. Thus, x'0001' refers to an int taking on the logical value 1. 4. Each type allows for a missing value code, known as . For each type, the range allowed for nonmissing and the missing value codes are int minimum nonmissing -32768 (0x8000) maximum nonmissing +32766 (0x7ffe) code for . +32767 (0x7fff) long minimum nonmissing -2,146,483,648 (0x80000000) maximum nonmissing +2,147,483,646 (0x7ffffffe) code for . +2,147,483,647 (0x7fffffff) float minimum nonmissing -1e+37 maximum nonmissing +1e+37 code for . +2^128 double minimum nonmissing -1e+99 maximum nonmissing +1e+99 code for . +2^333 3. Final cautions 1. All variable numbers are recorded as 2-byte integers. Nevertheless, no version of Stata will allow more than 254 (sic) variables in a data set. This may be relaxed in future versions. 4. Description Stata-format data sets contain five components, which are, in order: 1. Header 2. Descriptors 3. Variable Labels 4. Data 5. Value Labels 4.1 Header The Header is defined as Contents Length Format Comments ----------------------------------------------------------------------- release 1 byte contains 102 = x'66' unused 1 byte x'00' filetype 1 byte x'01' -> data, x'02' -> xp unused 1 byte x'00' nvar (number of vars) 2 int encoded per byteorder nobs (number of obs) 2 int encoded per byteorder data_label 32 char data set label, \0 terminated ----------------------------------------------------------------------- Total 40 4.2 Descriptors The Descriptor is defined as Contents Length Format Comments ---------------------------------------------------------------- typlist nvar byte array varlist 9*nvar char array srtlist 2*(nvar+1) int array encoded per byteorder fmtlist 7*nvar char array lbllist 9*nvar char array ---------------------------------------------------------------- typlist stores the type of each variable, 1, ..., nvar. Stata stores four numeric types: double, float, long, and int, encoded 'd', 'f', 'l', and 'i'. If nvar == 4, a typlist of 'idlf' indicates that variable 1 is an int, variable 2 a double, variable 3 a long, and variable 4 a float. If typlist is read into the C-array char typlist[], then typlist[i-1] indicates the type of variable i. varlist contains the names of the Stata variables 1, ..., nvar, each up to 8 characters in length, and each terminated by a binary zero (hereafter denoted as \0). For instance, if nvar == 4, vbl1\0....myvar\0...thisvar\0.lstvar\0.. would indicate that variable 1 is named vbl1, variable 2 myvar, variable 3 thisvar, and variable 4 lstvar. The byte positions indicated by periods will contain random numbers. If varlist is read into the C-array char varlist[], then &varlist[(i-1)*9] points to the name of the ith variable. srtlist specifies the sort-order of the data set and is terminated by an (int) 0. Each 2 bytes is a single int and contains either a variable number or zero. The zero marks the end of the srtlist, and the array positions after that contain random junk. For instance, if the data are not sorted, the first int will contain a zero and the ints thereafter will contain junk. If nvar == 4, the record will appear as: '0000................' If the data set is sorted by a single variable myvar and if that variable is the second variable in the varlist, the record will appear as: '02000000............' If the data is sorted by myvar and within myvar by vbl1, and vbl1 is the first variable in the data, the record will appear as: '020001000000........' If srtlist were read into the C-array int srtlist[], then srtlist[0] would be the number of the first sort variable or, if the data were not sorted, 0. If the number is not zero, srtlist[1] would be the number of the second sort variable or, if there is not a second sort variable, 0, and so on. fmtlist contains the formats of the variables 1, ..., nvar. Each format is 7 bytes long and includes a binary zero end-of-string marker. For instance, %9.0f\0.%8.2g\0.%20.0g\0%3.0f\0 indicates that variable 1 has a %9.0f format, variable 2 a %8.2g format, variable 3 a %20.0g format, and variable 4 a %3.0f format. If fmtlist is read into the C-array char fmtlist[], then &fmtlist[7*(i-1)] refers to the starting address of the format for the ith variable. Formats are denoted by a leading percent sign (%) followed by the string "#.#", where # stands for an integer. The first integer specifies the width of the format. The second integer, which must be less than or equal to the first, specifies the number of digits that are to follow the decimal point. A character denoting the format type (e, f, or g) is then listed. lbllist contains the names of the value formats associated with the variables 1, ..., nvar. Each value-format name is 8 bytes long and includes a binary zero end-of-string marker. For instance, \0........yesno\0...\0........yesno\0... indicates that variables 1 and 3 have no value label associated with them, whereas variables 2 and 4 are both associated with the value label named yesno. If lbllist is read into the C-array char lbllist[], then &lbllist[9*(i-1)] points to the start of the label name associated with the ith variable. 4.3 Variable labels The Variable Labels are recorded as Contents Length Format Comments ------------------------------------------------------ Variable 1's label 32 char \0 terminated Variable 2's label 32 char \0 terminated . . . . . . . . Variable nvar's label 32 char \0 terminated If there is no variable label, the first character of its label is \0. 4.4 Data The Data are recorded as Contents Length Format ----------------------------------------------- var 1, obs 1 per typlist per typlist var 2, obs 1 per typlist per typlist . . . var nvar, obs 1 per typlist per typlist var 1, obs 2 per typlist per typlist . . . . . . var 1, obs nobs per typlist per typlist . . . var nvar, obs nobs per typlist per typlist The data are written as all the variables on the first observation, followed by all the data on the second observation, and so on. Each variable is written in its own internal format, as given in typlist. All values are written per byteorder. Strings are null terminated if they are shorter than the allowed space, but they are not terminated if they occupy the full width. End-of-file may occur at this point. If it does, there are no value labels to be read. End-of-file may similarly occur between value labels. On end-of-file, all data have been processed. 4.5 Value labels Each Value Label is written as: Contents Length Format Comments ------------------------------------------------------------------- entries 2 int Number of entries in label labname 9 char \0 terminated padding 1 byte value 1 2 int value 2 2 int . . . value entries 2 int name 1 8 char see below name 2 8 char see below . . . name entries 8 char see below ------------------------------------------------------------------- Each name is up to 8 characters in length. If it is 7 or fewer characters, it is already terminated by a binary zero. If it is exactly 8 characters in length, no binary 0 end- of-string marker is included. The above layout is repeated for each value label included in the file. See dta for the current version formats.