Skip to content

Meta issue: SAS7BDAT parser improvements #47339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jonashaag opened this issue Jun 13, 2022 · 6 comments
Open

Meta issue: SAS7BDAT parser improvements #47339

jonashaag opened this issue Jun 13, 2022 · 6 comments
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance

Comments

@jonashaag
Copy link
Contributor

jonashaag commented Jun 13, 2022

I have a ~20x SAS7BDAT parser speedup ready to PR. It's a lot of changes. Goal is to avoid Python operations as much as possible. A preview of all of the changes can be found here: jonashaag#7

I want to contribute those changes to Pandas. Is it easier to review if I make a lot of small PRs, or do you prefer reviewing one large PR? Multiple small PRs will be something like: 5 PRs with 10% of the changes plus one larger PR with 50% of the changes. It will be very difficult to split the large PR any further.

One drawback of multiple small PRs is that it's more work for me and some of the changes may not seem useful if done in an isolated fashion.


PRs:

@jbrockmendel
Copy link
Member

Is it easier to review if I make a lot of small PRs, or do you prefer reviewing one large PR?

In general smaller PRs are easier to review. This is assuming that they can be split into bits that make sense independently.

@phofl phofl added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Jun 14, 2022
@jbrockmendel
Copy link
Member

@jonashaag looks like the PRs have all been merged. is this closable?

BTW i spent a little time trying to figure out what it would take to parallelize read_sas (more relevant for dask/modin than pandas directly), mostly concluded the lower-level functions needed something like skip_rows. Any interest in this topic if I revisit it?

@jonashaag
Copy link
Contributor Author

Looks like it!

Can you share more context about your use case with parallel processing?

@jonashaag
Copy link
Contributor Author

jonashaag commented Jan 12, 2023

Btw I’ve recently worked on an entirely new parser implementation, not sure what to do with it yet. I had some code that implemented row and page skipping but it was more complicated to get right than I would have hoped for. Currently working on a pure direct C++ converter of SAS7BDAT -> Parquet https://github.com/jonashaag/sas7bdat

@jbrockmendel
Copy link
Member

Can you share more context about your use case with parallel processing?

For my day job I spend some time working on modin (a distributed pandas-alike). The easiest way to make that work would be if pd.read_sas supported something like pd.read_sas(path, first_page=foo, last_page=bar)

I had some code that implemented row and page skipping but it was more complicated to get right than I would have hoped for

IIRC some of the trouble came from the fact that the pandas SASReader/Parser classes are really weird about how they share/update state. The higher-level one passes itself as a param to the lower-level one's constructor, which seems like it must be an antipattern.

@jonashaag
Copy link
Contributor Author

Yeah, page skipping sounds like a good approach. It should be relatively simple to implement and efficient. The data I'm dealing with is almost always compressed (eg. gzip or zstd) so skipping is slowed down considerably by the inability to skip bytes on the filesystem level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants