Skip to content

pandas cannot actually read Stata file format 104 #26667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
miker985 opened this issue Jun 5, 2019 · 8 comments · Fixed by #34116
Closed

pandas cannot actually read Stata file format 104 #26667

miker985 opened this issue Jun 5, 2019 · 8 comments · Fixed by #34116
Labels
Enhancement IO Stata read_stata, to_stata
Milestone

Comments

@miker985
Copy link
Contributor

miker985 commented Jun 5, 2019

Code Sample, a copy-pastable example if possible

Data usage agreements prevent me from attaching a Stata file of format version 104, so I cannot easily provide a copy/paste example. This is unfortunate as I have several of these files...

. dtaversion CIV_LSMS_1987_ANTHROPOMETRICS_A.DTA
  (file "CIV_LSMS_1987_ANTHROPOMETRICS_A.DTA" is .dta-format 104 from Stata 4)

Problem description

The stata.py module will provide error messages that it can only read certain versions and explicitly enumerates version 104 both in the _version_error string and in StataReader._read_old_header

However, StataReader cannot read files with version 104. This is because

  1. StataReader.__init__ calls _read_header code
  2. _read_header calls _read_old_header code
  3. _read_old_header calls _get_time_stamp code
  4. _get_time_stamp raises a ValueError for format versions > 104 code

None of this behavior can be overridden or otherwise configured as part of instantiating pandas.io.stata.StataReader

Expected Output

Remove 104 as a supported format and error: ValueError: Version of given Stata file is not 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), or 118 (Stata 14)

Alternatively _get_time_stamp could be changed to return "" or a similar value to reflect that Stata did not include a file timestamp in that format version. My experiments subclassing StataReader to suppress the error, read the data file in as a df, and export to CSV yield a file equivalent to an equivalent export delimited call from Stata.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

cc @bashtage does this make sense?

@bashtage
Copy link
Contributor

bashtage commented Jun 5, 2019

By design. AFAIK there is no documentation for this ancient format.

@bashtage
Copy link
Contributor

bashtage commented Jun 5, 2019 via email

@TomAugspurger
Copy link
Contributor

Sounds good. Would you recommend bumping our minimal supported stata version?

@jorisvandenbossche jorisvandenbossche added the IO Stata read_stata, to_stata label Jun 5, 2019
@bashtage
Copy link
Contributor

bashtage commented Jun 5, 2019

Should figure out the oldest dta file available for testing and probably use that as the minimum officially supported. If someone could provide some 104 files (and a dump of their contents) then it could be possible to verify the 104 compliance.

@bashtage
Copy link
Contributor

bashtage commented Jun 5, 2019

Just looked it up, dta file format 113 comes from 2003, so I can't imagine when a dta 104 file dates to.

@miker985
Copy link
Contributor Author

miker985 commented Jun 5, 2019

Per Stata, version 4 was released in 1995.
https://www.stata.com/support/faqs/resources/history-of-stata/

@bashtage
Copy link
Contributor

bashtage commented Jun 5, 2019

I found a format 108 file online and it can read the numeric values but doesn't handle all of the data correctly. Might see if I can figure out a patch.

bashtage added a commit to bashtage/pandas that referenced this issue May 11, 2020
Test against old Stata versions and remove text indicating support
for versions which do not work reliably

closes pandas-dev#26667
@jreback jreback added this to the 1.1 milestone May 11, 2020
bashtage added a commit to bashtage/pandas that referenced this issue May 11, 2020
Test against old Stata versions and remove text indicating support
for versions which do not work reliably

closes pandas-dev#26667
bashtage added a commit to bashtage/pandas that referenced this issue May 12, 2020
Test against old Stata versions and remove text indicating support
for versions which do not work reliably

closes pandas-dev#26667
jreback pushed a commit that referenced this issue May 12, 2020
Test against old Stata versions and remove text indicating support
for versions which do not work reliably

closes #26667
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants