Reading SAS files from AWS S3 #13939

ankitdhingra · 2016-08-08T17:19:10Z

Currently the read_sas method doesn't support reading SAS7BDAT files from AWS S3 like read_csv.
Can it be added?

The text was updated successfully, but these errors were encountered:

jreback · 2016-08-08T19:30:18Z

actually this should already work, have you tried it?

this goes thru the same filepath as is used by read_csv

ankitdhingra · 2016-08-08T19:42:19Z

I have tried it, but unfortunately it doesn't work. I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-009ad5a2cf9f> in <module>()
      2 pd.show_versions()
      3 
----> 4 pd.read_sas("s3:/bucket-url/temp.sas7bdat")

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     52         reader = SAS7BDATReader(filepath_or_buffer, index=index,
     53                                 encoding=encoding,
---> 54                                 chunksize=chunksize)
     55     else:
     56         raise ValueError('unknown SAS format')

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text)
     94             self._path_or_buf = open(self._path_or_buf, 'rb')
     95 
---> 96         self._get_properties()
     97         self._parse_metadata()
     98 

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in _get_properties(self)
    100 
    101         # Check magic number
--> 102         self._path_or_buf.seek(0)
    103         self._cached_page = self._path_or_buf.read(288)
    104         if self._cached_page[0:len(const.magic)] != const.magic:

AttributeError: 'BotoFileLikeReader' object has no attribute 'seek'

The documentation doesn't mention the ability to read from S3 as well Pandas.read_sas

Output from pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.13-18.26.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

jreback · 2016-08-08T19:45:15Z

hmm, this may require something more then.

cc @kshedden
cc @TomAugspurger; xref #13137

TomAugspurger · 2016-08-08T21:09:40Z

switching over to s3fs should handle this, as it does implement seek.

Although I thing there's a second bug in read_sas only accepting filepaths, and not buffers.

In [31]: with open('test1.sas7bdat', 'rb') as f:
    ...:     pd.read_sas(f)
    ...:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-85bbca72eb65> in <module>()
      1 with open('test1.sas7bdat') as f:
----> 2     pd.read_sas(f)
      3

/Users/tom.augspurger/Envs/surveyer/lib/python3.5/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     43             pass
     44
---> 45     if format.lower() == 'xport':
     46         from pandas.io.sas.sas_xport import XportReader
     47         reader = XportReader(filepath_or_buffer, index=index,

AttributeError: 'NoneType' object has no attribute 'lower'

will make an issue later.

Working around that, and reading from s3,

In [33]: from pandas.io.sas.sas7bdat import SAS7BDATReader
In [34]: fs = s3fs.S3FileSystem(anon=False)

In [35]: f = fs.open('s3://<bucket>/test.sas7bdat')

In [36]: SAS7BDATReader(f).read()
Out[36]:
   Column1       Column2  Column3    Column4  Column5       Column6  Column7  \
0    0.636       b'pear'     84.0 1965-12-10    0.103      b'apple'     20.0
1    0.283        b'dog'     49.0 1977-03-07    0.398       b'pear'     50.0
2    0.452       b'pear'     35.0 1983-08-15    0.117       b'pear'     70.0
3    0.557        b'dog'     29.0 1974-06-28    0.640       b'pear'     34.0
4    0.138           NaN     55.0 1965-03-18    0.583  b'crocodile'     34.0
5    0.948        b'dog'     33.0 1984-07-15    0.691       b'pear'      NaN
6    0.162  b'crocodile'     17.0 1982-06-03    0.002       b'pear'     30.0
7    0.148  b'crocodile'     37.0 1964-10-06    0.411      b'apple'     23.0
8      NaN       b'pear'     15.0 1970-01-27    0.102       b'pear'      1.0
9    0.663       b'pear'      NaN 1981-03-06    0.086      b'apple'     80.0

ankitdhingra · 2016-08-08T21:23:30Z

@TomAugspurger Are you talking about this s3fs.
This is based on boto3 whereas pandas is based on boto2. Will that be fine?

TomAugspurger · 2016-08-08T21:52:11Z

@ankitdhingra yeah, sorry. See #11915 for context. We were having issues with boto2 and are probably going to depend on s3fs for S3-related stuff in the future. That'll make supporting things like this (and writing s3 files) easier. I jsut haven't had time to finish that pull-request.

In the meantime, getting a filebuffer object from s3fs.open(s3://bucket/key) and passing that to any of the read_* methods should work (once we fix that other bug I mentioned).

mroeschke · 2023-03-29T04:54:50Z

I think this fully works with fsspec support now so closing

ankitdhingra changed the title ~~Reading SAS flles from S3~~ Reading SAS files from AWS S3 Aug 8, 2016

jreback added the IO SAS SAS: read_sas label Aug 8, 2016

jreback added this to the Next Major Release milestone Aug 8, 2016

jreback added Enhancement Difficulty Novice labels Aug 8, 2016

jreback removed this from the Next Major Release milestone Aug 8, 2016

jreback removed Difficulty Novice Enhancement labels Aug 8, 2016

jreback added Enhancement Difficulty Intermediate labels Aug 8, 2016

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Dec 11, 2019

mroeschke closed this as completed Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading SAS files from AWS S3 #13939

Reading SAS files from AWS S3 #13939

ankitdhingra commented Aug 8, 2016 •

edited

Loading

jreback commented Aug 8, 2016

ankitdhingra commented Aug 8, 2016 •

edited

Loading

jreback commented Aug 8, 2016

TomAugspurger commented Aug 8, 2016

ankitdhingra commented Aug 8, 2016

TomAugspurger commented Aug 8, 2016

mroeschke commented Mar 29, 2023

Reading SAS files from AWS S3 #13939

Reading SAS files from AWS S3 #13939

Comments

ankitdhingra commented Aug 8, 2016 • edited Loading

jreback commented Aug 8, 2016

ankitdhingra commented Aug 8, 2016 • edited Loading

jreback commented Aug 8, 2016

TomAugspurger commented Aug 8, 2016

ankitdhingra commented Aug 8, 2016

TomAugspurger commented Aug 8, 2016

mroeschke commented Mar 29, 2023

ankitdhingra commented Aug 8, 2016 •

edited

Loading

ankitdhingra commented Aug 8, 2016 •

edited

Loading