Skip to content

Reading SAS files from AWS S3 #13939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ankitdhingra opened this issue Aug 8, 2016 · 7 comments
Closed

Reading SAS files from AWS S3 #13939

ankitdhingra opened this issue Aug 8, 2016 · 7 comments
Labels
Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues IO SAS SAS: read_sas

Comments

@ankitdhingra
Copy link

ankitdhingra commented Aug 8, 2016

Currently the read_sas method doesn't support reading SAS7BDAT files from AWS S3 like read_csv.
Can it be added?

@ankitdhingra ankitdhingra changed the title Reading SAS flles from S3 Reading SAS files from AWS S3 Aug 8, 2016
@jreback jreback added the IO SAS SAS: read_sas label Aug 8, 2016
@jreback jreback added this to the Next Major Release milestone Aug 8, 2016
@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

actually this should already work, have you tried it?

this goes thru the same filepath as is used by read_csv

@jreback jreback removed this from the Next Major Release milestone Aug 8, 2016
@ankitdhingra
Copy link
Author

ankitdhingra commented Aug 8, 2016

I have tried it, but unfortunately it doesn't work. I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-009ad5a2cf9f> in <module>()
      2 pd.show_versions()
      3 
----> 4 pd.read_sas("s3:/bucket-url/temp.sas7bdat")

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     52         reader = SAS7BDATReader(filepath_or_buffer, index=index,
     53                                 encoding=encoding,
---> 54                                 chunksize=chunksize)
     55     else:
     56         raise ValueError('unknown SAS format')

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text)
     94             self._path_or_buf = open(self._path_or_buf, 'rb')
     95 
---> 96         self._get_properties()
     97         self._parse_metadata()
     98 

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in _get_properties(self)
    100 
    101         # Check magic number
--> 102         self._path_or_buf.seek(0)
    103         self._cached_page = self._path_or_buf.read(288)
    104         if self._cached_page[0:len(const.magic)] != const.magic:

AttributeError: 'BotoFileLikeReader' object has no attribute 'seek'

The documentation doesn't mention the ability to read from S3 as well Pandas.read_sas

Output from pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.13-18.26.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

hmm, this may require something more then.

cc @kshedden
cc @TomAugspurger; xref #13137

@TomAugspurger
Copy link
Contributor

switching over to s3fs should handle this, as it does implement seek.

Although I thing there's a second bug in read_sas only accepting filepaths, and not buffers.

In [31]: with open('test1.sas7bdat', 'rb') as f:
    ...:     pd.read_sas(f)
    ...:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-85bbca72eb65> in <module>()
      1 with open('test1.sas7bdat') as f:
----> 2     pd.read_sas(f)
      3

/Users/tom.augspurger/Envs/surveyer/lib/python3.5/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     43             pass
     44
---> 45     if format.lower() == 'xport':
     46         from pandas.io.sas.sas_xport import XportReader
     47         reader = XportReader(filepath_or_buffer, index=index,

AttributeError: 'NoneType' object has no attribute 'lower'

will make an issue later.

Working around that, and reading from s3,

In [33]: from pandas.io.sas.sas7bdat import SAS7BDATReader
In [34]: fs = s3fs.S3FileSystem(anon=False)

In [35]: f = fs.open('s3://<bucket>/test.sas7bdat')

In [36]: SAS7BDATReader(f).read()
Out[36]:
   Column1       Column2  Column3    Column4  Column5       Column6  Column7  \
0    0.636       b'pear'     84.0 1965-12-10    0.103      b'apple'     20.0
1    0.283        b'dog'     49.0 1977-03-07    0.398       b'pear'     50.0
2    0.452       b'pear'     35.0 1983-08-15    0.117       b'pear'     70.0
3    0.557        b'dog'     29.0 1974-06-28    0.640       b'pear'     34.0
4    0.138           NaN     55.0 1965-03-18    0.583  b'crocodile'     34.0
5    0.948        b'dog'     33.0 1984-07-15    0.691       b'pear'      NaN
6    0.162  b'crocodile'     17.0 1982-06-03    0.002       b'pear'     30.0
7    0.148  b'crocodile'     37.0 1964-10-06    0.411      b'apple'     23.0
8      NaN       b'pear'     15.0 1970-01-27    0.102       b'pear'      1.0
9    0.663       b'pear'      NaN 1981-03-06    0.086      b'apple'     80.0

@ankitdhingra
Copy link
Author

@TomAugspurger Are you talking about this s3fs.
This is based on boto3 whereas pandas is based on boto2. Will that be fine?

@TomAugspurger
Copy link
Contributor

@ankitdhingra yeah, sorry. See #11915 for context. We were having issues with boto2 and are probably going to depend on s3fs for S3-related stuff in the future. That'll make supporting things like this (and writing s3 files) easier. I jsut haven't had time to finish that pull-request.

In the meantime, getting a filebuffer object from s3fs.open(s3://bucket/key) and passing that to any of the read_* methods should work (once we fix that other bug I mentioned).

@jbrockmendel jbrockmendel added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Dec 11, 2019
@mroeschke
Copy link
Member

I think this fully works with fsspec support now so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues IO SAS SAS: read_sas
Projects
None yet
Development

No branches or pull requests

5 participants