-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Meta issue: SAS7BDAT parser improvements #47339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In general smaller PRs are easier to review. This is assuming that they can be split into bits that make sense independently. |
@jonashaag looks like the PRs have all been merged. is this closable? BTW i spent a little time trying to figure out what it would take to parallelize read_sas (more relevant for dask/modin than pandas directly), mostly concluded the lower-level functions needed something like skip_rows. Any interest in this topic if I revisit it? |
Looks like it! Can you share more context about your use case with parallel processing? |
Btw I’ve recently worked on an entirely new parser implementation, not sure what to do with it yet. I had some code that implemented row and page skipping but it was more complicated to get right than I would have hoped for. Currently working on a pure direct C++ converter of SAS7BDAT -> Parquet https://github.com/jonashaag/sas7bdat |
For my day job I spend some time working on modin (a distributed pandas-alike). The easiest way to make that work would be if pd.read_sas supported something like
IIRC some of the trouble came from the fact that the pandas SASReader/Parser classes are really weird about how they share/update state. The higher-level one passes itself as a param to the lower-level one's constructor, which seems like it must be an antipattern. |
Yeah, page skipping sounds like a good approach. It should be relatively simple to implement and efficient. The data I'm dealing with is almost always compressed (eg. gzip or zstd) so skipping is slowed down considerably by the inability to skip bytes on the filesystem level. |
I have a ~20x SAS7BDAT parser speedup ready to PR. It's a lot of changes. Goal is to avoid Python operations as much as possible. A preview of all of the changes can be found here: jonashaag#7
I want to contribute those changes to Pandas. Is it easier to review if I make a lot of small PRs, or do you prefer reviewing one large PR? Multiple small PRs will be something like: 5 PRs with 10% of the changes plus one larger PR with 50% of the changes. It will be very difficult to split the large PR any further.
One drawback of multiple small PRs is that it's more work for me and some of the changes may not seem useful if done in an isolated fashion.
PRs:
The text was updated successfully, but these errors were encountered: