-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix SAS7BDAT run-length encoding formula #47099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Similar case here #31243 (comment) |
pyreadstat has the same logic https://github.com/Roche/pyreadstat/blob/master/src/sas/readstat_sas_rle.c#L62-L63 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test case / sample file that replicates the bug & a whatsnew note (I/O bug fixes for 1.5)
See my note above about a test case. Can add a what’s new |
Unfortunately all (shareable) test files I've seen so far, including the ones in pyreadstat, don't reproduce the bug. |
A user provided an example SAS file in #44216. Could be helpful, if this SAS file triggers the error you fixed. |
Unfortunately both of the files do not reproduce the bug |
cc @bashtage WDYT? |
Would be good to have a file that requires this offset Is it possible to create one? |
@bashtage Unfortunately I don't have a file that I can share and pyreadstat can't write SAS7BDAT files. Do you have any ideas where to get such a file? |
LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
# GH 47099 | ||
fname = datapath("io", "sas", "data", "0x00controlbyte.sas7bdat.bz2") | ||
df = next(pd.read_sas(fname, chunksize=11_000)) | ||
assert df.shape == (11_000, 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a review comment so I can approve.
Hello @jonashaag! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-07-02 16:18:30 UTC |
@jreback @mroeschke mind reviewing this? |
@@ -940,6 +940,7 @@ I/O | |||
- Bug in :meth:`DataFrame.to_excel` when writing an empty dataframe with :class:`MultiIndex` (:issue:`19543`) | |||
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (:issue:`31243`) | |||
- Bug in :func:`read_sas` that scrambled column names (:issue:`31243`) | |||
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (:issue:`47099`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have the same note as two lines above? If so, you can just add to the issue references (:issue:..., :issue:...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it’s a different control byte
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks missed that 😅
Thanks @jonashaag |
Thanks :) |
* Fix SAS7BDAT run-length encoding formula See https://bitbucket.org/jaredhobbs/sas7bdat/src/18cbd14407918c1aa90f136c1d6c5d83f307dba0/sas7bdat.py#lines-110:114 * Undo safety change * Undo unnecessary bit mask * Undo syntax change * Update v1.5.0.rst * Update v1.5.0.rst * Add test file * Fix tests * Update test_sas7bdat.py
Run across an error/simplification? of a RLE formula in the SAS7BDAT parser. The original code https://bitbucket.org/jaredhobbs/sas7bdat/src/18cbd14407918c1aa90f136c1d6c5d83f307dba0/sas7bdat.py#lines-110:114 has the correct formula. I am not sure why we have another version.
Unfortunately I do not know how to add a regression test. The SAS file that reproduces the issue is proprietary, as is the SAS7BDAT file format, and I'm not aware of any means of creating a SAS file.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.