-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_sas() is not robust enough #15825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@hmeine well we need sample files. you can push them up on a gist / repo if you want. |
The problem is that I cannot redistribute the files, and I don't know how to "strip them down" to make them redistributable. |
@hmeine need a specific example that doesn't work. |
Hello @hmeine, I'm facing the same issue; And I don't know how to fix it. I'll be happy, if you can help me. My table has around 600,000 rows. Thanks. |
I am now using https://pypi.python.org/pypi/sas7bdat which works much better. |
I'll try it and let you know. Thanks |
It is pretty trivial to tell the different between a XPORT file and CPORT file. Just look at the first few characters. A version 5 XPORT file will start with a header line like: A CPORT file will start with a header line like: |
Example CPORT file. It would be great if pandas could load these files. Many US government data files are provided in this format. This one, for example, comes from CMS and represents risk adjustment coefficients. |
@akravetz Did you try sas7bdat? (See my comment above) |
@hmeine sas7bdat does not support CPORT files either unfortunately. |
@akravetz can you clarify how you came to that conclusion? (“citation needed” ;-) ) I am very low on time ATM, and I would need some motivation to look into it again. It contradicts my statements above, and I know I spent quite some effort into researching this issue. So I guess your statement is not true, at least not in its generality. It would be helpful to know if you found that statement somewhere (and where of course) or if you concluded it from some experiment (and which). |
@hmeine I just downloaded sas7bdat and confirmed it cannot parse CPORT files. It fails with:
Like several others, I have just lost a day digging into this same situation. I was not able to find any code out there that reads CPORT files (I went through the source code of them all including that of Pandas). Further digging shows that while SAS publishes the details needed to parse XPORT files, they do not do so for CPORT files. This post from the Library of Congress also states that CPORT is a proprietary format (as do other old blogs from folks quoting the FDA). After digging through a CPORT with a hex editor, I don't think it would be terribly difficult to reverse engineer the format. I don't currently have the time. However, I agree it would be a nice addition for libraries to detect CPORT files and give a better warning. The first 80 in the file contain the text (unless the option NOCOMPRESS was used): These should be straight-forward to write code to detect. Ideally, someone who has access to SAS just needs to create a minimal file and post it for the authors to tests against. Sadly, I don't have a copy. |
Ok, so I don't know much about SAS or its file formats; maybe the quote from the OAI website that I did above misled me to think that my files were CPORT files. They called them "transport files … created using SAS CPORT" and later "transport files that are created using PROC CPORT are not interchangeable with transport files that are created using the XPORT engine" which sounded as if these were the two variants there are.
f = sas7bdat.SAS7BDAT('ClinicalData/allclinical00.sas7bdat')
raw = f.to_data_frame()
raw['ID'] = raw.ID.astype(int)
df = raw.set_index('ID') |
Code Sample, a copy-pastable example if possible
Problem description
I downloaded SAS datasets from the Osteoarthritis Initiative (OAI) and tried loading them with Pandas. I do not have SAS myself, and I don't have prior experience with it. The OAI data offers different formats for download, as you can see from the above filenames.
Expected Output
I would expect large datasets (roughly 4800 rows with several hundreds columns), and (as far as I understood) the same data from the .sas7bdat and .xpt files. In fact, out of 12 .sas7bdat files, I can open 3, all others fail with:
Note the (improperly formatted) warning at the top.
self.column_types
turns out to be an empty list ([]
).For
df2
, I also triedread_sas()
, but gotProbably, this is related to the following part of the OAI documentation:
I think if I can load the .sas7bdat files, it would be OK if Pandas cannot read CPORT files. It could be
helpful, though, if it would recognize them and be more specific ("This seems to be a CPORT file. CPORT files are currently not supported, only XPORT and SAS7BDAT." or so).
Output of
pd.show_versions()
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 34.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.3.1
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: 2.4.5
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.5.3
html5lib: 1.0b10
httplib2: None
apiclient: None
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: