read_sas() is not robust enough #15825

hmeine · 2017-03-28T14:58:43Z

Code Sample, a copy-pastable example if possible

# requires registration-only downloads from https://oai.epi-ucsf.org/datarelease/DataClinical.asp
df1 = pandas.read_sas('allclinical00.sas7bdat')
df2 = pandas.read_sas('AllClinical00.xpt')

Problem description

I downloaded SAS datasets from the Osteoarthritis Initiative (OAI) and tried loading them with Pandas. I do not have SAS myself, and I don't have prior experience with it. The OAI data offers different formats for download, as you can see from the above filenames.

Expected Output

I would expect large datasets (roughly 4800 rows with several hundreds columns), and (as far as I understood) the same data from the .sas7bdat and .xpt files. In fact, out of 12 .sas7bdat files, I can open 3, all others fail with:

('Warning: column count mismatch (%d + %d != %d)\n', 143, 191, 1226)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-9a49a9b73cd0> in <module>()
----> 1 df = pandas.read_sas('allclinical00.sas7bdat')

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/sas/sasreader.pyc in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     59         return reader
     60
---> 61     data = reader.read()
     62     reader.close()
     63     return data

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.pyc in read(self, nrows)
    602             nrows = m
    603
--> 604         nd = (self.column_types == b'd').sum()
    605         ns = (self.column_types == b's').sum()
    606

AttributeError: 'bool' object has no attribute 'sum'

Note the (improperly formatted) warning at the top. self.column_types turns out to be an empty list ([]).

For df2, I also tried read_sas(), but got

ValueError: Header record is not an XPORT file.

Probably, this is related to the following part of the OAI documentation:

Using SAS Transport Files
The SAS dataset(s) in this zip file were created using SAS CPORT and the SAS V9 engine in the Windows environment. We strongly recommend that you use SAS V9 or higher to access the OAI data. …
PROC CPORT creates files in transport format, which uses an environment-independent standard for character encoding and numeric representation. Transport files that are created by PROC CPORT can be transferred across operating environments and read using PROC CIMPORT.
Note: SAS transport files that are created using PROC CPORT are not interchangeable with transport files that are created using the XPORT engine.

I think if I can load the .sas7bdat files, it would be OK if Pandas cannot read CPORT files. It could be
helpful, though, if it would recognize them and be more specific ("This seems to be a CPORT file. CPORT files are currently not supported, only XPORT and SAS7BDAT." or so).

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 34.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.3.1
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: 2.4.5
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.5.3
html5lib: 1.0b10
httplib2: None
apiclient: None
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-28T15:05:52Z

@hmeine well we need sample files. you can push them up on a gist / repo if you want.

hmeine · 2017-03-29T09:27:25Z

The problem is that I cannot redistribute the files, and I don't know how to "strip them down" to make them redistributable.

jreback · 2017-03-29T11:52:12Z

@hmeine need a specific example that doesn't work.

TheKonanKouassi · 2017-11-21T21:59:06Z

Hello @hmeine, I'm facing the same issue; And I don't know how to fix it. I'll be happy, if you can help me. My table has around 600,000 rows. Thanks.

hmeine · 2017-11-22T09:53:05Z

I am now using https://pypi.python.org/pypi/sas7bdat which works much better.

TheKonanKouassi · 2017-11-22T09:59:46Z

I'll try it and let you know. Thanks

sasutils · 2018-03-03T04:46:07Z

It is pretty trivial to tell the different between a XPORT file and CPORT file. Just look at the first few characters.

A version 5 XPORT file will start with a header line like:
HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!000000000000000000000000000000

A CPORT file will start with a header line like:
**COMPRESSED** **COMPRESSED** **COMPRESSED** **COMPRESSED** **COMPRESSED********

akravetz · 2019-04-30T20:26:59Z

Example CPORT file. It would be great if pandas could load these files. Many US government data files are provided in this format. This one, for example, comes from CMS and represents risk adjustment coefficients.

C2318P1Q.zip

hmeine · 2019-05-01T10:42:49Z

@akravetz Did you try sas7bdat? (See my comment above)
Maybe read support does not have to be in pandas itself. However, as I wrote above, it would be nice if pandas could give a more helpful message like "this is a CPORT file for which you may try installing the sas7bdat package". It could also try importing sas7bdat and use it if available.

akravetz · 2019-05-01T11:57:53Z

@hmeine sas7bdat does not support CPORT files either unfortunately.

hmeine · 2019-09-03T04:13:53Z

@akravetz can you clarify how you came to that conclusion? (“citation needed” ;-) )

I am very low on time ATM, and I would need some motivation to look into it again. It contradicts my statements above, and I know I spent quite some effort into researching this issue. So I guess your statement is not true, at least not in its generality.

It would be helpful to know if you found that statement somewhere (and where of course) or if you concluded it from some experiment (and which).

bghill · 2021-08-06T15:02:46Z

@hmeine I just downloaded sas7bdat and confirmed it cannot parse CPORT files. It fails with:

[AllClinical00.xpt] magic number mismatch
[AllClinical00.xpt] Failed to read a meta data page from file
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/brandong.hill/.pyenv/versions/test_sas/lib/python3.9/site-packages/sas7bdat.py", line 512, in readlines
    self._file.seek(self.properties.header_length)
TypeError: 'NoneType' object cannot be interpreted as an integer

Like several others, I have just lost a day digging into this same situation. I was not able to find any code out there that reads CPORT files (I went through the source code of them all including that of Pandas). Further digging shows that while SAS publishes the details needed to parse XPORT files, they do not do so for CPORT files. This post from the Library of Congress also states that CPORT is a proprietary format (as do other old blogs from folks quoting the FDA).

After digging through a CPORT with a hex editor, I don't think it would be terribly difficult to reverse engineer the format. I don't currently have the time. However, I agree it would be a nice addition for libraries to detect CPORT files and give a better warning. The first 80 in the file contain the text (unless the option NOCOMPRESS was used):
COMPRESSED COMPRESSED COMPRESSED COMPRESSED COMPRESSED******
Following this, the is a string of bytes that list the OS and SAS versions used to create the file:
LIB CONTROL X64_SR12¼^F SAS9.4

These should be straight-forward to write code to detect. Ideally, someone who has access to SAS just needs to create a minimal file and post it for the authors to tests against. Sadly, I don't have a copy.

hmeine · 2021-11-15T08:39:42Z

Ok, so I don't know much about SAS or its file formats; maybe the quote from the OAI website that I did above misled me to think that my files were CPORT files. They called them "transport files … created using SAS CPORT" and later "transport files that are created using PROC CPORT are not interchangeable with transport files that are created using the XPORT engine" which sounded as if these were the two variants there are.
Facts that I am sure about:

I have these OAI files for which the official documentation states that they were produced with "PROC CPORT".
pandas.read_sas cannot read them, although pandas version 1.2.4 which I just checked does produce some output, besides a warning ("column count mismatch (143 + 191 != 1226)"), but it does not look right.
I only could (and still can) read these with sas7bdat (I'll put the code below, to be precise about what I'm doing).
The file does not start with COMPRESSED...

f = sas7bdat.SAS7BDAT('ClinicalData/allclinical00.sas7bdat')
raw = f.to_data_frame()
raw['ID'] = raw.ID.astype(int)
df = raw.set_index('ID')

chris-b1 added IO SAS SAS: read_sas Needs Info Clarification about behavior needed to assess issue labels Mar 28, 2017

jreback closed this as completed Mar 29, 2017

lordgrenville mentioned this issue Nov 3, 2021

Detect CPORT header in SAS files #44300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_sas() is not robust enough #15825

read_sas() is not robust enough #15825

hmeine commented Mar 28, 2017

jreback commented Mar 28, 2017

hmeine commented Mar 29, 2017

jreback commented Mar 29, 2017

TheKonanKouassi commented Nov 21, 2017

hmeine commented Nov 22, 2017

TheKonanKouassi commented Nov 22, 2017

sasutils commented Mar 3, 2018

akravetz commented Apr 30, 2019

hmeine commented May 1, 2019 •

edited

Loading

akravetz commented May 1, 2019 •

edited

Loading

hmeine commented Sep 3, 2019

bghill commented Aug 6, 2021

hmeine commented Nov 15, 2021

read_sas() is not robust enough #15825

read_sas() is not robust enough #15825

Comments

hmeine commented Mar 28, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Mar 28, 2017

hmeine commented Mar 29, 2017

jreback commented Mar 29, 2017

TheKonanKouassi commented Nov 21, 2017

hmeine commented Nov 22, 2017

TheKonanKouassi commented Nov 22, 2017

sasutils commented Mar 3, 2018

akravetz commented Apr 30, 2019

hmeine commented May 1, 2019 • edited Loading

akravetz commented May 1, 2019 • edited Loading

hmeine commented Sep 3, 2019

bghill commented Aug 6, 2021

hmeine commented Nov 15, 2021

Output of `pd.show_versions()`

hmeine commented May 1, 2019 •

edited

Loading

akravetz commented May 1, 2019 •

edited

Loading