Meta issue: SAS7BDAT parser improvements #47339

jonashaag · 2022-06-13T21:50:47Z

I have a ~20x SAS7BDAT parser speedup ready to PR. It's a lot of changes. Goal is to avoid Python operations as much as possible. A preview of all of the changes can be found here: jonashaag#7

I want to contribute those changes to Pandas. Is it easier to review if I make a lot of small PRs, or do you prefer reviewing one large PR? Multiple small PRs will be something like: 5 PRs with 10% of the changes plus one larger PR with 50% of the changes. It will be very difficult to split the large PR any further.

One drawback of multiple small PRs is that it's more work for me and some of the changes may not seem useful if done in an isolated fashion.

PRs:

SAS7BDAT parser: Fast byteswap #47403 10-20% improvement
SAS7BDAT parser: Faster string parsing #47404 10-50% improvement
SAS7BDAT parser: Speed up RLE/RDC decompression #47405 30-50% improvement
SAS7BDAT parser: Improve subheader lookup performance #47656 10% improvement
Do encoding and blank in Python (upcoming) 30-60% improvement

jbrockmendel · 2022-06-14T16:42:15Z

Is it easier to review if I make a lot of small PRs, or do you prefer reviewing one large PR?

In general smaller PRs are easier to review. This is assuming that they can be split into bits that make sense independently.

jbrockmendel · 2023-01-12T18:56:41Z

@jonashaag looks like the PRs have all been merged. is this closable?

BTW i spent a little time trying to figure out what it would take to parallelize read_sas (more relevant for dask/modin than pandas directly), mostly concluded the lower-level functions needed something like skip_rows. Any interest in this topic if I revisit it?

jonashaag · 2023-01-12T22:43:19Z

Looks like it!

Can you share more context about your use case with parallel processing?

jonashaag · 2023-01-12T22:46:29Z

Btw I’ve recently worked on an entirely new parser implementation, not sure what to do with it yet. I had some code that implemented row and page skipping but it was more complicated to get right than I would have hoped for. Currently working on a pure direct C++ converter of SAS7BDAT -> Parquet https://github.com/jonashaag/sas7bdat

jbrockmendel · 2023-01-12T23:40:27Z

Can you share more context about your use case with parallel processing?

For my day job I spend some time working on modin (a distributed pandas-alike). The easiest way to make that work would be if pd.read_sas supported something like pd.read_sas(path, first_page=foo, last_page=bar)

I had some code that implemented row and page skipping but it was more complicated to get right than I would have hoped for

IIRC some of the trouble came from the fact that the pandas SASReader/Parser classes are really weird about how they share/update state. The higher-level one passes itself as a param to the lower-level one's constructor, which seems like it must be an antipattern.

jonashaag · 2023-01-13T08:19:22Z

Yeah, page skipping sounds like a good approach. It should be relatively simple to implement and efficient. The data I'm dealing with is almost always compressed (eg. gzip or zstd) so skipping is slowed down considerably by the inability to skip bytes on the filesystem level.

phofl added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta issue: SAS7BDAT parser improvements #47339

Meta issue: SAS7BDAT parser improvements #47339

jonashaag commented Jun 13, 2022 •

edited

Loading

jbrockmendel commented Jun 14, 2022

jbrockmendel commented Jan 12, 2023

jonashaag commented Jan 12, 2023

jonashaag commented Jan 12, 2023 •

edited

Loading

jbrockmendel commented Jan 12, 2023

jonashaag commented Jan 13, 2023

Meta issue: SAS7BDAT parser improvements #47339

Meta issue: SAS7BDAT parser improvements #47339

Comments

jonashaag commented Jun 13, 2022 • edited Loading

jbrockmendel commented Jun 14, 2022

jbrockmendel commented Jan 12, 2023

jonashaag commented Jan 12, 2023

jonashaag commented Jan 12, 2023 • edited Loading

jbrockmendel commented Jan 12, 2023

jonashaag commented Jan 13, 2023

jonashaag commented Jun 13, 2022 •

edited

Loading

jonashaag commented Jan 12, 2023 •

edited

Loading