-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Read SAS (sas7bdat) #12654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To improve performance, the path to follow might be the one indicated in #10517. |
cc @kshedden this is mostly in python ATM. You always write for correctness first, profile, then if necessary use things like cython in critical sections. pull-requests are welcome for an improved implm. |
Thanks for the feedback. A few comments:
|
@kshedden any update on |
Is there any updated on sas7bdat? I have a 5GB sas data to read into python. But the performance is still a hassle. Thanks! |
@marks @xappppp @kshedden I have released a wrapper around the ReadStat C library called pyreadstat. Because most of the code is C is faster than pandas. It can also read sas7bcat files. It handles value labels and column labels. Missing values tags will come soon (@kshedden it would be nice if you could provide a sample file with tagged missing values). |
This looks useful. I agree that pandas.read_sas is slow, but I have a few
questions about your approach.
The links on your page to "module documentation" are all broken so I can't
comment on the API.
Does it support reading compressed SAS files? Last time I checked, haven
could not do that.
On your page, you say that Pandas.read_sas does not support reading only
the header, but this is not true. If chunksize or iterator is passed, then
the class initialization will read the header and no data, then the data
can be retrieved if desired. All the SAS metadata is available in the
SAS7BDATReader class instance without reading any data.
Beyond reading the header, I don't know if your approach allows incremental
reading of data (which is essential for files that do not fit into
memory). Pandas.read_sas supports reading the data in arbitrary chunk
sizes.
On your page, you say that Pandas.read_sas does not extract variable
labels, but I don't think this is true. The labels should be available
through the 'columns' attribute of the SAS7BDATReader class instance.
Encoding is very tricky. I am not aware of any way to take a raw byte
array and always guess correctly the encoding as you claim. In a SAS7BDAT
file, the correct encoding is stored as a 1 byte code at offset 70, so it
shouldn't be necessary to guess the encoding in any case. However the
mapping from the 1-byte SAS encoding flag to actual encodings is not
published as far as I know. I have been able to reverse engineer a few of
the common ones.
…On Wed, Aug 22, 2018 at 6:29 AM Otto Fajardo ***@***.***> wrote:
@marks <https://github.com/marks> @xappppp <https://github.com/xappppp>
@kshedden <https://github.com/kshedden> I have released a wrapper around
the ReadStat C library called pyreadstat. Because most of the code is C is
faster than pandas. It can also read sas7bcat files. It handles value
labels and column labels. Missing values tags will come soon.
https://github.com/Roche/pyreadstat
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12654 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACiwwySMLGZXTYzKUil7isztthowQov5ks5uTTKVgaJpZM4Hyn_b>
.
|
@kshedden Thanks a lot for your useful comments.
What is true is that the main motivation was speed as we need to do some heavy lifting. |
|
AFAIK the encoding flag at byte offset 70 in a SAS7BDAT file uses a secret
mapping between byte values and encodings. It would be a massive reverse
engineering effort to recover the mapping entirely. I just discovered a
few common ones by trial and error. I'm not even sure that we know for
certain that this byte value actually represents the encoding. Like
everything else in the SAS format, it has been inferred empirically.
I haven't worked on this code in awhile. It is far too slow for my needs
and I was not successful at making it faster.
It would be a good idea for someone to expand on the documentation. Since
the SAS7BDATReader class itself is not part of the pandas public API, the
documentation for it does not show up in the official pandas docs. But
someone could add a "Notes" section in the docstring for read_sas.
…On Wed, Aug 22, 2018 at 12:08 PM Otto Fajardo ***@***.***> wrote:
@kshedden <https://github.com/kshedden>
- API documentation fixed (quick fix, have to work on that a bit more
later).
- You are right regarding value labels and reading heading only for
pandas. However, as far as I can see those features are not documented (
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html
and
https://github.com/pandas-dev/pandas/blob/master/pandas/io/sas/sas7bdat.py)
or at least I was not able to find them. In my opinion it would be nice to
advertise them more clearly. In any case I will remove those statements
from my documentation.
- Regarding the encodings, if you can read the encoding from the file,
why not using it instead of returning bytes? I found it a bit annoying to
have to specify it everytime. Of course it could fail from time to time,
but then you can fall back to returning bytes. Well, just an
opinion/suggestion.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12654 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACiww9MCdo_pkzUT_dL7vV04DFxuSKicks5uTYIIgaJpZM4Hyn_b>
.
|
Hi, I don't have any idea about sas to pandas this is the new term I want to know more about it because now I am a student of SAS in CETPA where I learn basic to advance about SAS. |
Very excited to have this new feature in Pandas. I have a few comments to share:
pd.read_sas()
doesn't read SAS date variable correctly (this is noted in the doc). Dates are read asnumpy.float64
. Note in SAS, dates are recorded as numbers relative to 1960-1-1. It would be helpful to allow some sort of arguments to parse the date variable correctly..B
or.R
. I wonder how are these cases treated?read_csv()
. To read a 700MB SAS data. The time isThe time for the same CSV file (I covered the same file to CSV using SAS) is
The text was updated successfully, but these errors were encountered: