-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add support for reading 110-format Stata dta files #58044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a whatsnew note v3.0.0.rst
under Other Enhancements
?
I have now added the whatsnew line as requested. |
cc @bashtage |
Do we have documentation that there are no differences between 110 and 111? This seems to be the assumption here. |
There is a difference between 110 and 111 - The 110 format uses the older typlist codes which limit string variables to a maximum of 80 characters. There is official documentation for the 110 format in the Stata 7 manual: However I have not been able to track any down for the already supported 111 format to compare with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment. Is it really the case that there are no difference between 108 and 110 aside from the version number? Have you done a diff of the documentation to verify?
pandas/io/stata.py
Outdated
@@ -1407,7 +1407,7 @@ def _read_old_header(self, first_char: bytes) -> None: | |||
self._time_stamp = self._get_time_stamp() | |||
|
|||
# descriptors | |||
if self._format_version > 108: | |||
if self._format_version > 110: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this increased?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was previously trying to use the newer typlist encoding for everything higher than the 108 format whereas the 110 format kept using the older version (presumably as string variables in Small and Intercooled Stata 7 were still limited to 80 characters). To the best of my knowledge this is the only difference between the 110 and 111 formats.
pandas/io/stata.py
Outdated
@@ -1408,7 +1408,7 @@ def _read_old_header(self, first_char: bytes) -> None: | |||
self._time_stamp = self._get_time_stamp() | |||
|
|||
# descriptors | |||
if self._format_version > 108: | |||
if self._format_version > 110: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we switch this logic to >= and use only versions that have explicit support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll update it to >= 111.
The differences between 108 and 110 are that the maximum variable name length was increased from 8 to 32 characters (https://www.stata.com/stata7/language.html#longnames) and the expansion record field size was increased from 2 to 4 bytes. The display format also now allowed European decimals (https://www.stata.com/stata7/language.html#andmore), but that doesn't make a difference to the file structure. My understanding is that the 110 and 111 formats are the same, other than different typlist encodings, which would make sense as they're both for Stata 7, but for editions with different limits. It appears that Stata 7/SE was released around a year after Stata 7/IC and Small Stata 7 (see 01feb2002 entry in https://www.stata.com/help.cgi?whatsnew7), which might explain the lack of documentation. Assuming that the 111 format is implemented correctly then it seems that the 113 format is the same as 111, except that the values used to encode missing values were changed to allow 26 additional missing codes (see https://www.stata.com/help.cgi?whatsnew7to8). |
In case it helps, here are the changes that I have determined between each format version (excluding changes to the display format codes) from looking at the available documentation: 102 (confirmed as undocumented but can be inferred from the next version, the Stata 1 manual and a "history of Stata" article)
103 (documented in Stata 2 manual)
104 (documentation not yet located - probably in Stata 3 manual)
105 (documented in Stata 4 and 5 manuals)
108 (documented in Stata 6 manual)
110 (documented in Stata 7 manual)
111 (documentation not found - maybe in Stata 7/SE manual if this exists)
113 (documented in on-line help)
114 (documented in on-line help)
115 (documented in on-line help)
117 (documented in on-line help)
118 (documented in on-line help)
119 (documented in on-line help)
120 (documented in on-line help)
121 (documented in on-line help)
|
Can you rebase and ping on green? |
…d or new typlist version
0028ff9
to
ee3bae8
Compare
I have now rebased this, and all checks pass. |
Thanks. LGTM. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This change enables the ability to read 110-format (Stata 7) dta files. A test data file is included in the same style as other supported versions.