-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Some sas7bdat files with many columns are not parseable by read_sas #22628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @troels! Thanks for updating the PR.
Comment last updated on September 07, 2018 at 19:51 Hours UTC |
@troels : Thanks for the report! Do you mind sharing in this PR what output you get without your patch for reference? A small, reproducible example would be great. |
Codecov Report
@@ Coverage Diff @@
## master #22628 +/- ##
==========================================
+ Coverage 92.17% 92.17% +<.01%
==========================================
Files 169 169
Lines 50708 50712 +4
==========================================
+ Hits 46740 46745 +5
+ Misses 3968 3967 -1
Continue to review full report at Codecov.
|
Hi @gfyoung So the test case I added will fail in the current version of pandas, with the following error (these files can't be parsed):
|
2d81f92
to
8f498d5
Compare
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -690,7 +690,8 @@ I/O | |||
- :func:`read_html()` no longer ignores all-whitespace ``<tr>`` within ``<thead>`` when considering the ``skiprows`` and ``header`` arguments. Previously, users had to decrease their ``header`` and ``skiprows`` values on such tables to work around the issue. (:issue:`21641`) | |||
- :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`) | |||
- :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`) | |||
- | |||
- :func:`read_sas` will correctly parse sas7bdat files with many columns (:issue:`22628`) | |||
- :func:`read_sas` will correctly parse sas7bdat files with odd data page types (:issue:`16615`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand on 'odd' here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to elaborate a bit, but the meaning of bit 7 is still rather unclear. It may have something to do with page also containing a bit map of deleted rows, but don't know precisely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is certain is that it can be parsed as a normal data page (possibly including the deleted rows)
pandas/tests/io/sas/test_sas7bdat.py
Outdated
@@ -183,6 +183,28 @@ def test_date_time(datapath): | |||
tm.assert_frame_equal(df, df0) | |||
|
|||
|
|||
def test_many_columns(datapath): | |||
fname = datapath("io", "sas", "data", "many_columns.sas7bdat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback: likewise.
The test failure here is quite unrelated to the PR. |
The reason is that column definitions may be split up into different pages. Allow column information to be parsed from different pages and add a test for it.
Hi @jreback Anything else preventing this from being merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not @jreback , but I don't think there are any barriers to merging this. LGTM!
thanks! |
Thanks both of you :) |
git diff upstream/master -u -- "*.py" | flake8 --diff
The reason is that column definitions may be split up into different pages.
Allow column information to be parsed from different pages
and add a test for it.