Skip to content

Fix SAS7BDAT run-length encoding formula #47099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 5, 2022
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -940,6 +940,7 @@ I/O
- Bug in :meth:`DataFrame.to_excel` when writing an empty dataframe with :class:`MultiIndex` (:issue:`19543`)
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (:issue:`31243`)
- Bug in :func:`read_sas` that scrambled column names (:issue:`31243`)
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (:issue:`47099`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have the same note as two lines above? If so, you can just add to the issue references (:issue:..., :issue:...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it’s a different control byte

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks missed that 😅

-

Period
Expand Down
4 changes: 1 addition & 3 deletions pandas/io/sas/sas.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,7 @@ cdef const uint8_t[:] rle_decompress(int result_length, const uint8_t[:] inbuff)
ipos += 1

if control_byte == 0x00:
if end_of_first_byte != 0:
raise ValueError("Unexpected non-zero end_of_first_byte")
nbytes = <int>(inbuff[ipos]) + 64
nbytes = <int>(inbuff[ipos]) + 64 + end_of_first_byte * 256
ipos += 1
for _ in range(nbytes):
result[rpos] = inbuff[ipos]
Expand Down
Binary file not shown.
7 changes: 7 additions & 0 deletions pandas/tests/io/sas/test_sas7bdat.py
Original file line number Diff line number Diff line change
Expand Up @@ -390,3 +390,10 @@ def test_0x40_control_byte(datapath):
fname = datapath("io", "sas", "data", "0x40controlbyte.csv")
df0 = pd.read_csv(fname, dtype="object")
tm.assert_frame_equal(df, df0)


def test_0x00_control_byte(datapath):
# GH 47099
fname = datapath("io", "sas", "data", "0x00controlbyte.sas7bdat.bz2")
df = next(pd.read_sas(fname, chunksize=11_000))
assert df.shape == (11_000, 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a review comment so I can approve.