Skip to content

Detect CPORT header in SAS files #44300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 6, 2021

Conversation

lordgrenville
Copy link
Contributor

Refers #15825 . Doesn't fix the issue (might be unfixable), but gives a more helpful warning.

@@ -279,6 +279,8 @@ def _read_header(self):
# read file header
line1 = self._get_row()
if line1 != _correct_line1:
if "**COMPRESSED**" in _correct_line1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a unit test and test file that could hit this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to. Is it ok to use assertRaises? I ask because the other tests don't use the unittests module, and I couldn't find an example of testing for a specific exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with pytest.raises(FooError, match=regexp_for_exception_message):
     func()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Adding the unit test showed that I had a bug, which I fixed, and the tests are passing now. (As an aside, I missed that the instructions to add unit tests for changes and to use pytest are right there in the Contributing to pandas doc. I really appreciate the patience and guidance 🙂)

@mroeschke mroeschke added Error Reporting Incorrect or improved errors from pandas IO SAS SAS: read_sas labels Nov 3, 2021
@jreback jreback added this to the 1.4 milestone Nov 5, 2021
@jreback
Copy link
Contributor

jreback commented Nov 5, 2021

lgtm. @bashtage if any comments.

@lordgrenville
Copy link
Contributor Author

I just deleted most of the CPT binary to cut down on wasted space - unit tests still pass.

@@ -279,6 +279,10 @@ def _read_header(self):
# read file header
line1 = self._get_row()
if line1 != _correct_line1:
if "**COMPRESSED**" in line1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this value in any file format document provided by SAS? If so, would be good to include a reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've added it in a new commit


def test_cport_header_found_raises(self):
# Test with DEMO_PUF.cpt, the beginning of puf2019_1_fall.xpt
# from https://www.cms.gov/files/zip/puf2019.zip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this public domain or similar? If not, might be good to have someone create a CPT file from scratch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's publicly available data (from the US government's CMS, stored at https://www.cms.gov/files/zip/puf2019.zip). I don't have SAS so can't create a file. But the compressed CPORT files will have **COMPRESSED** **COMPRESSED** **COMPRESSED** **COM as the first 40 characters (as documented here), so I just opened this file in a text editor and deleted everything after the first few lines.

@bashtage
Copy link
Contributor

bashtage commented Nov 5, 2021

I just deleted most of the CPT binary to cut down on wasted space - unit tests still pass.

If you want to keep the shortened CPT file (and why not), you should be sure to rebase and squash to a single commit so that any history is removed from the commits.

@lordgrenville
Copy link
Contributor Author

I just deleted most of the CPT binary to cut down on wasted space - unit tests still pass.

If you want to keep the shortened CPT file (and why not), you should be sure to rebase and squash to a single commit so that any history is removed from the commits.

Oh yeah, my bad! Done.

@jreback jreback merged commit fe62783 into pandas-dev:master Nov 6, 2021
@jreback
Copy link
Contributor

jreback commented Nov 6, 2021

thanks @lordgrenville

@lordgrenville lordgrenville deleted the detect_SAS_CPORT branch November 7, 2021 09:50
@lithomas1 lithomas1 mentioned this pull request Nov 14, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants