Skip to content

read_sas() is not robust enough #15825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hmeine opened this issue Mar 28, 2017 · 13 comments
Closed

read_sas() is not robust enough #15825

hmeine opened this issue Mar 28, 2017 · 13 comments
Labels
IO SAS SAS: read_sas Needs Info Clarification about behavior needed to assess issue

Comments

@hmeine
Copy link

hmeine commented Mar 28, 2017

Code Sample, a copy-pastable example if possible

# requires registration-only downloads from https://oai.epi-ucsf.org/datarelease/DataClinical.asp
df1 = pandas.read_sas('allclinical00.sas7bdat')
df2 = pandas.read_sas('AllClinical00.xpt')

Problem description

I downloaded SAS datasets from the Osteoarthritis Initiative (OAI) and tried loading them with Pandas. I do not have SAS myself, and I don't have prior experience with it. The OAI data offers different formats for download, as you can see from the above filenames.

Expected Output

I would expect large datasets (roughly 4800 rows with several hundreds columns), and (as far as I understood) the same data from the .sas7bdat and .xpt files. In fact, out of 12 .sas7bdat files, I can open 3, all others fail with:

('Warning: column count mismatch (%d + %d != %d)\n', 143, 191, 1226)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-9a49a9b73cd0> in <module>()
----> 1 df = pandas.read_sas('allclinical00.sas7bdat')

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/sas/sasreader.pyc in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     59         return reader
     60
---> 61     data = reader.read()
     62     reader.close()
     63     return data

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.pyc in read(self, nrows)
    602             nrows = m
    603
--> 604         nd = (self.column_types == b'd').sum()
    605         ns = (self.column_types == b's').sum()
    606

AttributeError: 'bool' object has no attribute 'sum'

Note the (improperly formatted) warning at the top. self.column_types turns out to be an empty list ([]).

For df2, I also tried read_sas(), but got

ValueError: Header record is not an XPORT file.

Probably, this is related to the following part of the OAI documentation:

Using SAS Transport Files
The SAS dataset(s) in this zip file were created using SAS CPORT and the SAS V9 engine in the Windows environment. We strongly recommend that you use SAS V9 or higher to access the OAI data. …
PROC CPORT creates files in transport format, which uses an environment-independent standard for character encoding and numeric representation. Transport files that are created by PROC CPORT can be transferred across operating environments and read using PROC CIMPORT.
Note: SAS transport files that are created using PROC CPORT are not interchangeable with transport files that are created using the XPORT engine.

I think if I can load the .sas7bdat files, it would be OK if Pandas cannot read CPORT files. It could be
helpful, though, if it would recognize them and be more specific ("This seems to be a CPORT file. CPORT files are currently not supported, only XPORT and SAS7BDAT." or so).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 34.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.3.1
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: 2.4.5
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.5.3
html5lib: 1.0b10
httplib2: None
apiclient: None
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Mar 28, 2017

@hmeine well we need sample files. you can push them up on a gist / repo if you want.

@chris-b1 chris-b1 added IO SAS SAS: read_sas Needs Info Clarification about behavior needed to assess issue labels Mar 28, 2017
@hmeine
Copy link
Author

hmeine commented Mar 29, 2017

The problem is that I cannot redistribute the files, and I don't know how to "strip them down" to make them redistributable.

@jreback jreback closed this as completed Mar 29, 2017
@jreback
Copy link
Contributor

jreback commented Mar 29, 2017

@hmeine need a specific example that doesn't work.

@TheKonanKouassi
Copy link

Hello @hmeine, I'm facing the same issue; And I don't know how to fix it. I'll be happy, if you can help me. My table has around 600,000 rows. Thanks.

@hmeine
Copy link
Author

hmeine commented Nov 22, 2017

I am now using https://pypi.python.org/pypi/sas7bdat which works much better.

@TheKonanKouassi
Copy link

I'll try it and let you know. Thanks

@sasutils
Copy link

sasutils commented Mar 3, 2018

It is pretty trivial to tell the different between a XPORT file and CPORT file. Just look at the first few characters.

A version 5 XPORT file will start with a header line like:
HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!000000000000000000000000000000

A CPORT file will start with a header line like:
**COMPRESSED** **COMPRESSED** **COMPRESSED** **COMPRESSED** **COMPRESSED********

@akravetz
Copy link

Example CPORT file. It would be great if pandas could load these files. Many US government data files are provided in this format. This one, for example, comes from CMS and represents risk adjustment coefficients.

C2318P1Q.zip

@hmeine
Copy link
Author

hmeine commented May 1, 2019

@akravetz Did you try sas7bdat? (See my comment above)
Maybe read support does not have to be in pandas itself. However, as I wrote above, it would be nice if pandas could give a more helpful message like "this is a CPORT file for which you may try installing the sas7bdat package". It could also try importing sas7bdat and use it if available.

@akravetz
Copy link

akravetz commented May 1, 2019

@hmeine sas7bdat does not support CPORT files either unfortunately.

@hmeine
Copy link
Author

hmeine commented Sep 3, 2019

@akravetz can you clarify how you came to that conclusion? (“citation needed” ;-) )

I am very low on time ATM, and I would need some motivation to look into it again. It contradicts my statements above, and I know I spent quite some effort into researching this issue. So I guess your statement is not true, at least not in its generality.

It would be helpful to know if you found that statement somewhere (and where of course) or if you concluded it from some experiment (and which).

@bghill
Copy link

bghill commented Aug 6, 2021

@hmeine I just downloaded sas7bdat and confirmed it cannot parse CPORT files. It fails with:

[AllClinical00.xpt] magic number mismatch
[AllClinical00.xpt] Failed to read a meta data page from file
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/brandong.hill/.pyenv/versions/test_sas/lib/python3.9/site-packages/sas7bdat.py", line 512, in readlines
    self._file.seek(self.properties.header_length)
TypeError: 'NoneType' object cannot be interpreted as an integer

Like several others, I have just lost a day digging into this same situation. I was not able to find any code out there that reads CPORT files (I went through the source code of them all including that of Pandas). Further digging shows that while SAS publishes the details needed to parse XPORT files, they do not do so for CPORT files. This post from the Library of Congress also states that CPORT is a proprietary format (as do other old blogs from folks quoting the FDA).

After digging through a CPORT with a hex editor, I don't think it would be terribly difficult to reverse engineer the format. I don't currently have the time. However, I agree it would be a nice addition for libraries to detect CPORT files and give a better warning. The first 80 in the file contain the text (unless the option NOCOMPRESS was used):
COMPRESSED COMPRESSED COMPRESSED COMPRESSED COMPRESSED******
Following this, the is a string of bytes that list the OS and SAS versions used to create the file:
LIB CONTROL X64_SR12¼^F SAS9.4

These should be straight-forward to write code to detect. Ideally, someone who has access to SAS just needs to create a minimal file and post it for the authors to tests against. Sadly, I don't have a copy.

@hmeine
Copy link
Author

hmeine commented Nov 15, 2021

Ok, so I don't know much about SAS or its file formats; maybe the quote from the OAI website that I did above misled me to think that my files were CPORT files. They called them "transport files … created using SAS CPORT" and later "transport files that are created using PROC CPORT are not interchangeable with transport files that are created using the XPORT engine" which sounded as if these were the two variants there are.
Facts that I am sure about:

  • I have these OAI files for which the official documentation states that they were produced with "PROC CPORT".
  • pandas.read_sas cannot read them, although pandas version 1.2.4 which I just checked does produce some output, besides a warning ("column count mismatch (143 + 191 != 1226)"), but it does not look right.
  • I only could (and still can) read these with sas7bdat (I'll put the code below, to be precise about what I'm doing).
  • The file does not start with COMPRESSED...
f = sas7bdat.SAS7BDAT('ClinicalData/allclinical00.sas7bdat')
raw = f.to_data_frame()
raw['ID'] = raw.ID.astype(int)
df = raw.set_index('ID')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

7 participants