-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Detect CPORT header in SAS files #44300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/sas/sas_xport.py
Outdated
@@ -279,6 +279,8 @@ def _read_header(self): | |||
# read file header | |||
line1 = self._get_row() | |||
if line1 != _correct_line1: | |||
if "**COMPRESSED**" in _correct_line1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a unit test and test file that could hit this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to. Is it ok to use assertRaises? I ask because the other tests don't use the unittests
module, and I couldn't find an example of testing for a specific exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises(FooError, match=regexp_for_exception_message):
func()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Adding the unit test showed that I had a bug, which I fixed, and the tests are passing now. (As an aside, I missed that the instructions to add unit tests for changes and to use pytest are right there in the Contributing to pandas doc. I really appreciate the patience and guidance 🙂)
96d8828
to
5b43e1b
Compare
lgtm. @bashtage if any comments. |
I just deleted most of the CPT binary to cut down on wasted space - unit tests still pass. |
@@ -279,6 +279,10 @@ def _read_header(self): | |||
# read file header | |||
line1 = self._get_row() | |||
if line1 != _correct_line1: | |||
if "**COMPRESSED**" in line1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this value in any file format document provided by SAS? If so, would be good to include a reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've added it in a new commit
|
||
def test_cport_header_found_raises(self): | ||
# Test with DEMO_PUF.cpt, the beginning of puf2019_1_fall.xpt | ||
# from https://www.cms.gov/files/zip/puf2019.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this public domain or similar? If not, might be good to have someone create a CPT file from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's publicly available data (from the US government's CMS, stored at https://www.cms.gov/files/zip/puf2019.zip). I don't have SAS so can't create a file. But the compressed CPORT files will have **COMPRESSED** **COMPRESSED** **COMPRESSED** **COM
as the first 40 characters (as documented here), so I just opened this file in a text editor and deleted everything after the first few lines.
If you want to keep the shortened CPT file (and why not), you should be sure to rebase and squash to a single commit so that any history is removed from the commits. |
16df8c3
to
8485358
Compare
Oh yeah, my bad! Done. |
thanks @lordgrenville |
Refers #15825 . Doesn't fix the issue (might be unfixable), but gives a more helpful warning.