-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: read_csv macro updates #52632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: read_csv macro updates #52632
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the shout-out @WillAyd . Back in the day, I was part of a team writing performance-critical code in C, and we used to do optimizations like this all the time.
If anyone cared for the details here is the verbose annotated assembly output (from .L246:
# pandas/_libs/src/parser/tokenizer.c:929: } else if (IS_DELIMITER(c)) {
.loc 3 929 28 is_stmt 1
movq -136(%rbp), %rax # self, tmp683
movl 188(%rax), %eax # self_452(D)->delim_whitespace, _180
# pandas/_libs/src/parser/tokenizer.c:929: } else if (IS_DELIMITER(c)) {
.loc 3 929 27
testl %eax, %eax # _180
jne .L247 #,
# pandas/_libs/src/parser/tokenizer.c:929: } else if (IS_DELIMITER(c)) {
.loc 3 929 28 discriminator 1
movq -136(%rbp), %rax # self, tmp684
movzbl 184(%rax), %eax # self_452(D)->delimiter, _181
cmpb %al, -49(%rbp) # _181, c
je .L248 #,
.L247:
# pandas/_libs/src/parser/tokenizer.c:929: } else if (IS_DELIMITER(c)) {
.loc 3 929 28 is_stmt 0 discriminator 3
movq -136(%rbp), %rax # self, tmp685
movl 188(%rax), %eax # self_452(D)->delim_whitespace, _182
testl %eax, %eax # _182
je .L249 #,
# pandas/_libs/src/parser/tokenizer.c:929: } else if (IS_DELIMITER(c)) {
.loc 3 929 28 discriminator 4
call __ctype_b_loc@PLT #
movq (%rax), %rdx # *_183, _184
movsbq -49(%rbp), %rax # c, _185
addq %rax, %rax # _186
addq %rdx, %rax # _184, _187
movzwl (%rax), %eax # *_187, _188
movzwl %ax, %eax # _188, _189
andl $1, %eax #, _190
testl %eax, %eax # _190
je .L249 #,
.L248:
# pandas/_libs/src/parser/tokenizer.c:930: if (self->delim_whitespace) { and after .L243:
# pandas/_libs/src/parser/tokenizer.c:930: } else if (IS_DELIMITER(c)) {
.loc 3 930 27 is_stmt 1
movzbl -57(%rbp), %eax # c, tmp657
cmpb -41(%rbp), %al # delimiter, tmp657
je .L244 #,
# pandas/_libs/src/parser/tokenizer.c:930: } else if (IS_DELIMITER(c)) {
.loc 3 930 28 discriminator 1
cmpl $0, -40(%rbp) #, delim_whitespace
je .L245 #,
# pandas/_libs/src/parser/tokenizer.c:930: } else if (IS_DELIMITER(c)) {
.loc 3 930 28 is_stmt 0 discriminator 2
call __ctype_b_loc@PLT #
movq (%rax), %rdx # *_171, _172
movsbq -57(%rbp), %rax # c, _173
addq %rax, %rax # _174
addq %rdx, %rax # _172, _175
movzwl (%rax), %eax # *_175, _176
movzwl %ax, %eax # _176, _177
andl $1, %eax #, _178
testl %eax, %eax # _178
je .L245 #,
.L244:
# pandas/_libs/src/parser/tokenizer.c:931: if (self->delim_whitespace) { Very low level...but this is executed for potentially every character in a file |
Can you add a whatsnew? |
doc/source/whatsnew/v2.1.0.rst
Outdated
@@ -88,7 +88,7 @@ Other enhancements | |||
- :meth:`arrays.SparseArray.map` now supports ``na_action`` (:issue:`52096`). | |||
- Add dtype of categories to ``repr`` information of :class:`CategoricalDtype` (:issue:`52179`) | |||
- Adding ``engine_kwargs`` parameter to :meth:`DataFrame.read_excel` (:issue:`52214`) | |||
- | |||
- Performance improvement in :func:`read_csv` (:issue:`52632`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth clarifying that this is for the C engine only.
Thanks @WillAyd |
This gives us a few percentage points back in read_csv performance, especially with larger files. Credit to @Dr-Irv for the idea