Skip to content

Fix SAS7BDAT run-length encoding formula #47099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 5, 2022
Merged

Conversation

jonashaag
Copy link
Contributor

@jonashaag jonashaag commented May 23, 2022

Run across an error/simplification? of a RLE formula in the SAS7BDAT parser. The original code https://bitbucket.org/jaredhobbs/sas7bdat/src/18cbd14407918c1aa90f136c1d6c5d83f307dba0/sas7bdat.py#lines-110:114 has the correct formula. I am not sure why we have another version.

Unfortunately I do not know how to add a regression test. The SAS file that reproduces the issue is proprietary, as is the SAS7BDAT file format, and I'm not aware of any means of creating a SAS file.

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jonashaag
Copy link
Contributor Author

Similar case here #31243 (comment)

@jonashaag
Copy link
Contributor Author

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case / sample file that replicates the bug & a whatsnew note (I/O bug fixes for 1.5)

@jonashaag
Copy link
Contributor Author

See my note above about a test case. Can add a what’s new

@jonashaag
Copy link
Contributor Author

jonashaag commented May 24, 2022

Unfortunately all (shareable) test files I've seen so far, including the ones in pyreadstat, don't reproduce the bug.

@twoertwein
Copy link
Member

A user provided an example SAS file in #44216. Could be helpful, if this SAS file triggers the error you fixed.

@jonashaag
Copy link
Contributor Author

Unfortunately both of the files do not reproduce the bug

@jonashaag jonashaag requested a review from jreback May 30, 2022 14:06
@jreback jreback added the IO SAS SAS: read_sas label Jun 5, 2022
@jreback
Copy link
Contributor

jreback commented Jun 5, 2022

cc @bashtage WDYT?

@bashtage
Copy link
Contributor

bashtage commented Jun 6, 2022

Would be good to have a file that requires this offset Is it possible to create one?

@jonashaag
Copy link
Contributor Author

@bashtage Unfortunately I don't have a file that I can share and pyreadstat can't write SAS7BDAT files. Do you have any ideas where to get such a file?

@jonashaag
Copy link
Contributor Author

@bashtage came up with a test file by zeroing almost all of the bytes of a real world file. PR now depends on #47154

@bashtage
Copy link
Contributor

@bashtage came up with a test file by zeroing almost all of the bytes of a real world file. PR now depends on #47154

Once that is in rebase and then should be good.

@bashtage
Copy link
Contributor

LGTM.

@bashtage bashtage self-requested a review June 21, 2022 07:19
Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

# GH 47099
fname = datapath("io", "sas", "data", "0x00controlbyte.sas7bdat.bz2")
df = next(pd.read_sas(fname, chunksize=11_000))
assert df.shape == (11_000, 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a review comment so I can approve.

@pep8speaks
Copy link

pep8speaks commented Jun 26, 2022

Hello @jonashaag! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-07-02 16:18:30 UTC

@jonashaag
Copy link
Contributor Author

@jreback @mroeschke mind reviewing this?

@@ -940,6 +940,7 @@ I/O
- Bug in :meth:`DataFrame.to_excel` when writing an empty dataframe with :class:`MultiIndex` (:issue:`19543`)
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (:issue:`31243`)
- Bug in :func:`read_sas` that scrambled column names (:issue:`31243`)
- Bug in :func:`read_sas` with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (:issue:`47099`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have the same note as two lines above? If so, you can just add to the issue references (:issue:..., :issue:...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it’s a different control byte

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks missed that 😅

@mroeschke mroeschke added this to the 1.5 milestone Jul 5, 2022
@mroeschke mroeschke merged commit 8e8f627 into pandas-dev:main Jul 5, 2022
@mroeschke
Copy link
Member

Thanks @jonashaag

@jonashaag
Copy link
Contributor Author

Thanks :)

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* Fix SAS7BDAT run-length encoding formula

See https://bitbucket.org/jaredhobbs/sas7bdat/src/18cbd14407918c1aa90f136c1d6c5d83f307dba0/sas7bdat.py#lines-110:114

* Undo safety change

* Undo unnecessary bit mask

* Undo syntax change

* Update v1.5.0.rst

* Update v1.5.0.rst

* Add test file

* Fix tests

* Update test_sas7bdat.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants