-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Dont include deleted rows from sas7bdat files (#15963) #22650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @troels! Thanks for submitting the PR.
|
1b1e8ec
to
b8696f9
Compare
Codecov Report
@@ Coverage Diff @@
## master #22650 +/- ##
==========================================
+ Coverage 92.17% 92.17% +<.01%
==========================================
Files 169 169
Lines 50708 50721 +13
==========================================
+ Hits 46740 46754 +14
+ Misses 3968 3967 -1
Continue to review full report at Codecov.
|
Haven't looked in detail due to size of PR but I noticed your original post only mentions one issue whereas the PR references 3 in the whatsnew. Is that intentional or have you simply pulled in some other concurrent developments accidentally? |
Hi @WillAyd I made a separate PR #22628 for the two first commits. This one I though was large enough to have its own PR, but as there was a dependency on the first two commits I continued this PR from the branch #22628 had sprung from. |
db6ba8b
to
69779d5
Compare
Codecov Report
@@ Coverage Diff @@
## master #22650 +/- ##
==========================================
+ Coverage 92.18% 92.18% +<.01%
==========================================
Files 169 169
Lines 50812 50821 +9
==========================================
+ Hits 46842 46851 +9
Misses 3970 3970
Continue to review full report at Codecov.
|
69779d5
to
fcb1699
Compare
Sas7bdat may contain rows which are actually deleted. If the page_type has bit 128 set, there is a bitmap following the normal row data with a bit set for a given row if it has been deleted. Use that information to not include deleted rows in the resulting dataframe.
page_type = self.current_page_type | ||
subheader_pointers_length = \ | ||
self.subheader_pointer_length * self.current_page_subheaders_count | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments in here
self.calculate_deleted_rows_bitmap_offset() | ||
|
||
cdef bint is_row_deleted(self, int row_number): | ||
cdef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc-string & comments
Can you address comments and merge master? |
can you merge master and address comments |
Closing as stale. Ping if you'd like to continue |
This would be pretty useful as the issue makes read_sas unusable for a lot of sas datasets. |
Sas7bdat files may contain rows which are actually deleted.
After reverse engineering the file format a bit, I found that
if the page_type has bit 7 set, there is a bitmap following
the normal row data with a bit set for a given row if it has been
deleted. Use that information to not include deleted rows in
the resulting dataframe.
This PR is built on top of #22628
I've added two test-cases one from the issue and another constructed
example, which tests parsing on different page types and across multiple sas7bdat-pages.
The constructed example is rather large, mainly to have a test case where the data
flows across several pages.
git diff upstream/master -u -- "*.py" | flake8 --diff