Skip to content

SAS7BDAT parser: Faster string parsing #47404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 10, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions pandas/io/sas/sas.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -426,8 +426,10 @@ cdef class Parser:
jb += 1
elif column_types[j] == column_type_string:
# string
string_chunk[js, current_row] = np.array(source[start:(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the perf bottleneck in the np.array call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like both the .tobytes and the .rstrip are non-optimized. could cython optimize it if the bytes object were declared as such?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check. IIRC tobytes will never be optimized and it’s very slow. The call to np.array is completely useless so I suggest to remove that in any case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some benchmarks on 2 files, times in ms for min duration of 10 reads of each file:

  • Baseline: 7.6 / 15
  • Remove redundant np.array(): 6.6 / 12.7 (0.87x / 0.85x)
  • Additionally decare the slice as cdef bytes: 6.5 / 12.5
  • Additionally use rfind plus manually slicing instead of rstrip: 6.4 / 11.5
  • My solution: 6.2 / 10.6 (0.82x / 0.71x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for slowness is that calls to bytes.xyz() are not optimized by Cython.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this. is it worth opening an issue in cython about optimizing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start + lngt)]).tobytes().rstrip(b"\x00 ")
# Skip trailing whitespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments here (e.g. this is like ..... but slower so we are doing xyz). also do we have asv's for this case? and can you add a whatsnew note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a what's new for all 3 PRs here, that ok?

while lngt > 0 and source[start+lngt-1] in b"\x00 ":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source[start+lngt-1] is a uint8_t isnt it? am i wrong to think this looks weird?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read this as: check if the last byte is a null byte or white space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah i get what it means, i just suspect there's an implicit casting going on in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of casting? Actually I think it doesn’t matter because the both characters are on the positive side of char

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython will compile this to a switch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        switch ((*((__pyx_t_6pandas_2io_3sas_4_sas_uint8_t const  *) ( /* dim=0 */ (__pyx_v_source.data + __pyx_t_17 * __pyx_v_source.strides[0]) )))) {
          case '\x00':
          case ' ':
          __pyx_t_8 = 1;
          break;
          default:
          __pyx_t_8 = 0;
          break;
        }

with

typedef unsigned char __pyx_t_6pandas_2io_3sas_4_sas_uint8_t

lngt -= 1
string_chunk[js, current_row] = (&source[start])[:lngt]
js += 1

self.current_row_on_page_index += 1
Expand Down