Skip to content

Reading a SAS file with 0 rows gives None #18198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AbdealiLoKo opened this issue Nov 9, 2017 · 10 comments · Fixed by #47116
Closed

Reading a SAS file with 0 rows gives None #18198

AbdealiLoKo opened this issue Nov 9, 2017 · 10 comments · Fixed by #47116
Assignees
Labels
Bug Error Reporting Incorrect or improved errors from pandas good first issue IO SAS SAS: read_sas

Comments

@AbdealiLoKo
Copy link
Contributor

Code Sample, a copy-pastable example if possible

In [2]: import pandas as pd

In [3]: pd.read_csv("zero_observations.csv")
Out[3]: 
Empty DataFrame
Columns: [a, b, c, d]
Index: []

In [4]: pd.read_sas("zero_observations.sas7bdat")

In [5]: 

This is a.csv:

$ cat zero_observations.csv
a,b,c,d

Problem description

When reading a SAS file with 0 records, it gives None. But when reading a CSV file with 0 records it gives an empty dataframe.

In SAS, we can atleast identify the datatypes correctly and make an empty dataframe with this.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.27.3
numpy: 1.13.3
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.4.9
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 0.9.8
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

is this just an issue for #18184 ?

normally we do open issues first but since u opened the pr we actually don’t need an issue (fyi for the future)

@AbdealiLoKo
Copy link
Contributor Author

This is actually an issue for 0 rows and not the 0 variables case.
The PR #18184 handles the case for 0 variables. Im not very sure how to fix the one for 0 rows hence created a issue for that

@jreback jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas IO SAS SAS: read_sas labels Nov 10, 2017
@jreback jreback added this to the Next Major Release milestone Nov 10, 2017
@AbdealiLoKo
Copy link
Contributor Author

AbdealiLoKo commented Nov 10, 2017

When going through this, I realized there may be some confusion in expected behavior.

In read_csv() if I give a file with 0 rows and iterator, it gives:

In [30]: for i, idf in enumerate(pd.read_csv("/tmp/a.csv", iterator=True)):
    ...:     print("IterCount:", i)
    ...:     print(idf)
    ...:     
IterCount: 0
Empty DataFrame
Columns: [a, b, c, d]
Index: []

This seems non-intuitive because even though there are 0 rows, the iterator runs atleast once. While in read_sas() the equivalent iterator runs 0 times.
In read_csv() when I give a file with 2 records, and chunksize=2 it gives an iterator with 1 item. Not two items (First one with 2 rows, Second with 0 rows)

In [32]: for i, idf in enumerate(pd.read_csv("/tmp/b.csv", iterator=True, chunksize=2)):
    ...:     print("IterCount:", i)
    ...:     print(idf)
    ...:     
IterCount: 0
   a  b  c  d
0  1  2  3  4
1  5  6  7  8

This is definitely a bit of a corner case but I'd like to know what is the expected behavior.

I believe the following makes sense:

  • If it is an iterator, give None so that the iterator can stop correctly
  • If it it not an iterator and the whole dataframe is being read, identify EmptyDataFrames with header names and column types and return the empty data frame

@AbdealiLoKo
Copy link
Contributor Author

Any thoughts on this one ?

@mukesh5
Copy link

mukesh5 commented Jan 12, 2019

Is this issue still open? Can I pick it up?

@jbrockmendel
Copy link
Member

@AbdealiJK this shouldn't be too hard to fix, but in order to test will need a zero-observation sas7bdat file. can you provide one?

@mroeschke mroeschke added the Bug label Apr 5, 2020
@paul-lilley
Copy link
Contributor

paul-lilley commented Aug 31, 2020

Hi

data one_row (compress=no);
	char_field = "abc";
	num_field = 123.4;
run;
data zero_rows_no_compress (compress=no);
	set one_row (obs=0);
run;

The above SAS code created the attached SAS file with 0 rows. Hope this helps @jbrockmendel @mukesh5

zero_rows_no_compress.zip

@AbdealiLoKo
Copy link
Contributor Author

Hi Sorry @jbrockmendel - seems like I missed your ping earlier.
I don't have access to a SAS environment any more for the past year, so haven't been able to follow up on this, nor provide an example file

@ofajardo
Copy link

ofajardo commented Dec 9, 2020

just in case pyreadstat can read this file OK (if this is what you are expecting):

>>> import pyreadstat
>>> df, meta = pyreadstat.read_sas7bdat("zero_rows_no_compress.sas7bdat")
>>> df
Empty DataFrame
Columns: [char_field, num_field]
Index: []

So one option could be to add pyreadstat as backend for read_sas (it is already used in read_spss). It would also help solving other issues: #37088, #35545, #22720

@Condielj
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas good first issue IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants