Skip to content

Commit cdc2531

Browse files
jonashaagyehoshuadimarsky
authored andcommitted
Fix SAS7BDAT run-length encoding formula (pandas-dev#47099)
* Fix SAS7BDAT run-length encoding formula See https://bitbucket.org/jaredhobbs/sas7bdat/src/18cbd14407918c1aa90f136c1d6c5d83f307dba0/sas7bdat.py#lines-110:114 * Undo safety change * Undo unnecessary bit mask * Undo syntax change * Update v1.5.0.rst * Update v1.5.0.rst * Add test file * Fix tests * Update test_sas7bdat.py
1 parent f463ef0 commit cdc2531

File tree

4 files changed

+9
-3
lines changed

4 files changed

+9
-3
lines changed

doc/source/whatsnew/v1.5.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -945,6 +945,7 @@ I/O
945945
- Bug in :meth:`DataFrame.to_excel` when writing an empty dataframe with :class:`MultiIndex` (:issue:`19543`)
946946
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (:issue:`31243`)
947947
- Bug in :func:`read_sas` that scrambled column names (:issue:`31243`)
948+
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (:issue:`47099`)
948949
-
949950

950951
Period

pandas/io/sas/sas.pyx

+1-3
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,7 @@ cdef const uint8_t[:] rle_decompress(int result_length, const uint8_t[:] inbuff)
2828
ipos += 1
2929

3030
if control_byte == 0x00:
31-
if end_of_first_byte != 0:
32-
raise ValueError("Unexpected non-zero end_of_first_byte")
33-
nbytes = <int>(inbuff[ipos]) + 64
31+
nbytes = <int>(inbuff[ipos]) + 64 + end_of_first_byte * 256
3432
ipos += 1
3533
for _ in range(nbytes):
3634
result[rpos] = inbuff[ipos]
Binary file not shown.

pandas/tests/io/sas/test_sas7bdat.py

+7
Original file line numberDiff line numberDiff line change
@@ -390,3 +390,10 @@ def test_0x40_control_byte(datapath):
390390
fname = datapath("io", "sas", "data", "0x40controlbyte.csv")
391391
df0 = pd.read_csv(fname, dtype="object")
392392
tm.assert_frame_equal(df, df0)
393+
394+
395+
def test_0x00_control_byte(datapath):
396+
# GH 47099
397+
fname = datapath("io", "sas", "data", "0x00controlbyte.sas7bdat.bz2")
398+
df = next(pd.read_sas(fname, chunksize=11_000))
399+
assert df.shape == (11_000, 20)

0 commit comments

Comments
 (0)