Read SAS (sas7bdat) #12654

ywhcuhk · 2016-03-17T04:39:45Z

Very excited to have this new feature in Pandas. I have a few comments to share:

pd.read_sas() doesn't read SAS date variable correctly (this is noted in the doc). Dates are read as numpy.float64. Note in SAS, dates are recorded as numbers relative to 1960-1-1. It would be helpful to allow some sort of arguments to parse the date variable correctly.
Moreover, SAS has some special missing variables such as .B or .R. I wonder how are these cases treated?
Not nearly as fast as read_csv(). To read a 700MB SAS data. The time is

CPU times: user 1min 47s, sys: 955 ms, total: 1min 48s
Wall time: 1min 48s

The time for the same CSV file (I covered the same file to CSV using SAS) is

CPU times: user 3.93 s, sys: 343 ms, total: 4.28 s
Wall time: 4.29 s

The text was updated successfully, but these errors were encountered:

benjello · 2016-03-17T09:14:26Z

To improve performance, the path to follow might be the one indicated in #10517.

jreback · 2016-03-17T12:02:06Z

cc @kshedden

this is mostly in python ATM. You always write for correctness first, profile, then if necessary use things like cython in critical sections. pull-requests are welcome for an improved implm.

kshedden · 2016-03-17T12:32:30Z

Thanks for the feedback. A few comments:

Performance is a major issue. The pandas version is currently not usable for my own use case (files with billions of rows). FWIW I have a golang version that is about 20x faster (https://github.com/kshedden/datareader) which I use for big files. Most of the slowness in pandas.read_sas is in process_byte_array_with_data (already in cython but probably not optimally set up). I have a local version with improvements that is about 30% faster, but I was looking for a much bigger improvement.
The other open sourced SAS7BDAT readers that I know of , including the one from wizard, don't support compression properly or at all, e.g.: Support binary compression in sas7bdat WizardMac/ReadStat#21 I put a lot of time into getting the compression support to work.
Date support needs some work, but this should be relatively easy to fix. There are several different date formats in SAS. We detect and autoconvert some but not all of them.
I don't have documentation for missing value codes, if you can point me to it (i.e. which float values correspond to which codes) I should be able to add it.
I just noticed that wizard supports sas7bcat (categorical data format codes), I will look into porting this over at some point.

marks · 2016-09-04T21:19:16Z

@kshedden any update on sas7bcat support by any chance? Thanks!!

xappppp · 2018-06-16T03:19:08Z

Is there any updated on sas7bdat? I have a 5GB sas data to read into python. But the performance is still a hassle. Thanks!

ofajardo · 2018-08-22T10:29:13Z

@marks @xappppp @kshedden I have released a wrapper around the ReadStat C library called pyreadstat. Because most of the code is C is faster than pandas. It can also read sas7bcat files. It handles value labels and column labels. Missing values tags will come soon (@kshedden it would be nice if you could provide a sample file with tagged missing values).
https://github.com/Roche/pyreadstat

kshedden · 2018-08-22T13:48:44Z

This looks useful. I agree that pandas.read_sas is slow, but I have a few questions about your approach. The links on your page to "module documentation" are all broken so I can't comment on the API. Does it support reading compressed SAS files? Last time I checked, haven could not do that. On your page, you say that Pandas.read_sas does not support reading only the header, but this is not true. If chunksize or iterator is passed, then the class initialization will read the header and no data, then the data can be retrieved if desired. All the SAS metadata is available in the SAS7BDATReader class instance without reading any data. Beyond reading the header, I don't know if your approach allows incremental reading of data (which is essential for files that do not fit into memory). Pandas.read_sas supports reading the data in arbitrary chunk sizes. On your page, you say that Pandas.read_sas does not extract variable labels, but I don't think this is true. The labels should be available through the 'columns' attribute of the SAS7BDATReader class instance. Encoding is very tricky. I am not aware of any way to take a raw byte array and always guess correctly the encoding as you claim. In a SAS7BDAT file, the correct encoding is stored as a 1 byte code at offset 70, so it shouldn't be necessary to guess the encoding in any case. However the mapping from the 1-byte SAS encoding flag to actual encodings is not published as far as I know. I have been able to reverse engineer a few of the common ones.

…

On Wed, Aug 22, 2018 at 6:29 AM Otto Fajardo ***@***.***> wrote: @marks <https://github.com/marks> @xappppp <https://github.com/xappppp> @kshedden <https://github.com/kshedden> I have released a wrapper around the ReadStat C library called pyreadstat. Because most of the code is C is faster than pandas. It can also read sas7bcat files. It handles value labels and column labels. Missing values tags will come soon. https://github.com/Roche/pyreadstat — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12654 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACiwwySMLGZXTYzKUil7isztthowQov5ks5uTTKVgaJpZM4Hyn_b> .

ofajardo · 2018-08-22T15:01:20Z

@kshedden Thanks a lot for your useful comments.

I'll repair the documentation page, thanks for reporting that.
Since I am wrapping readstat, and currently there is no support for compressed SAS files or incremental reading of data on readstat, I cannot do those. I hope it's fixed in the future.
I'll check about pandas reading value labels and reading the header only. Probably I was not careful enough when reading the documentation. I will remove those statements if - as I guess - you are right.
True that readstat cannot guess encodings 100% of times, but on my hands it did it correctly always except for one file (reading your comment that the encoding is stored in the file itself probably readstat is not guessing but just using that information). However for pandas I had to type in the encoding every time. But maybe again I overlooked something? Again I will remove that statement if wrong.

What is true is that the main motivation was speed as we need to do some heavy lifting.

ofajardo · 2018-08-22T16:08:18Z

@kshedden

API documentation fixed (quick fix, have to work on that a bit more later).
Known limitation section added to the documentation listing what it cannot do at the moment.
You are right regarding value labels and reading heading only for pandas. However, as far as I can see those features are not documented (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html and https://github.com/pandas-dev/pandas/blob/master/pandas/io/sas/sas7bdat.py) or at least I was not able to find them. In my opinion it would be nice to advertise them more clearly. In any case I will remove those statements from my documentation.
Regarding the encodings, if you can read the encoding from the file, why not using it instead of returning bytes? I found it a bit annoying to have to specify it everytime. Of course it could fail from time to time, but then you can fall back to returning bytes. Well, just an opinion/suggestion.

kshedden · 2018-08-23T13:19:20Z

AFAIK the encoding flag at byte offset 70 in a SAS7BDAT file uses a secret mapping between byte values and encodings. It would be a massive reverse engineering effort to recover the mapping entirely. I just discovered a few common ones by trial and error. I'm not even sure that we know for certain that this byte value actually represents the encoding. Like everything else in the SAS format, it has been inferred empirically. I haven't worked on this code in awhile. It is far too slow for my needs and I was not successful at making it faster. It would be a good idea for someone to expand on the documentation. Since the SAS7BDATReader class itself is not part of the pandas public API, the documentation for it does not show up in the official pandas docs. But someone could add a "Notes" section in the docstring for read_sas.

…

On Wed, Aug 22, 2018 at 12:08 PM Otto Fajardo ***@***.***> wrote: @kshedden <https://github.com/kshedden> - API documentation fixed (quick fix, have to work on that a bit more later). - You are right regarding value labels and reading heading only for pandas. However, as far as I can see those features are not documented ( https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html and https://github.com/pandas-dev/pandas/blob/master/pandas/io/sas/sas7bdat.py) or at least I was not able to find them. In my opinion it would be nice to advertise them more clearly. In any case I will remove those statements from my documentation. - Regarding the encodings, if you can read the encoding from the file, why not using it instead of returning bytes? I found it a bit annoying to have to specify it everytime. Of course it could fail from time to time, but then you can fall back to returning bytes. Well, just an opinion/suggestion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12654 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACiww9MCdo_pkzUT_dL7vV04DFxuSKicks5uTYIIgaJpZM4Hyn_b> .

seemasinghh · 2019-08-30T07:25:18Z

Hi, I don't have any idea about sas to pandas this is the new term I want to know more about it because now I am a student of SAS in CETPA where I learn basic to advance about SAS.

jreback added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Mar 17, 2016

jreback added this to the Next Major Release milestone Mar 17, 2016

jreback added Difficulty Intermediate labels Mar 17, 2016

jreback modified the milestones: 0.18.1, Next Major Release Mar 17, 2016

kshedden mentioned this issue Mar 19, 2016

Modest performance, address #12647 #12656

Closed

jreback closed this as completed in 33683cc Apr 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read SAS (sas7bdat) #12654

Read SAS (sas7bdat) #12654

ywhcuhk commented Mar 17, 2016

benjello commented Mar 17, 2016

jreback commented Mar 17, 2016

kshedden commented Mar 17, 2016

marks commented Sep 4, 2016

xappppp commented Jun 16, 2018

ofajardo commented Aug 22, 2018 •

edited

Loading

kshedden commented Aug 22, 2018 via email

ofajardo commented Aug 22, 2018 •

edited

Loading

ofajardo commented Aug 22, 2018 •

edited

Loading

kshedden commented Aug 23, 2018 via email

seemasinghh commented Aug 30, 2019

Read SAS (sas7bdat) #12654

Read SAS (sas7bdat) #12654

Comments

ywhcuhk commented Mar 17, 2016

benjello commented Mar 17, 2016

jreback commented Mar 17, 2016

kshedden commented Mar 17, 2016

marks commented Sep 4, 2016

xappppp commented Jun 16, 2018

ofajardo commented Aug 22, 2018 • edited Loading

kshedden commented Aug 22, 2018 via email

ofajardo commented Aug 22, 2018 • edited Loading

ofajardo commented Aug 22, 2018 • edited Loading

kshedden commented Aug 23, 2018 via email

seemasinghh commented Aug 30, 2019

ofajardo commented Aug 22, 2018 •

edited

Loading

ofajardo commented Aug 22, 2018 •

edited

Loading

ofajardo commented Aug 22, 2018 •

edited

Loading