-
Notifications
You must be signed in to change notification settings - Fork 74
👌 IMPROVE: Parsing performance #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit aims at lowering the amount of redundant calls to charCodeAt to improve performances. Ord codes are computed once and stored in an attribute for StateBlock, StateCore and StateInline. We then check this attribute rather than calling the function. Transfer ord codes whenever possible between StateCore and StateBlocks so we don't recompute them.
Thanks for submitting your first pull request! You are awesome! 🤗 |
Also, I just realized my pull request included a pre-commit change. The git:// protocol is blocked at my workplace, I think https:// is better suited in general, but let me know if I should make another pull request or just discard it. |
Yeh interesting thanks, I'll have a proper look later but two thoughts:
|
It looks like from the document build failure, that the try/except is necessary (unless its just masking an underlying issue 😏) so yeh that could be added to the method I mention above? |
Also, if you could fix #31 while your at it, that would be great 😆 |
Regarding 1.: It would be slightly worse since name resolution takes some time in Python but it should be negligible. And it would fix the documentation error since we could easily add a try/except clause. However, I think the documentation failing is more of an underlying issue:
I think since we're looking ahead this should probably be:
But this might happen elsewhere. Let me check this, and if we can refrain from using a try/except we should do it since this tends to silence errors that we might want to catch early. |
yep and you might also want to look if there has been any "porting errors" compared to https://github.com/markdown-it/markdown-it/blob/master/lib/rules_inline/image.js |
it was a bit fiddly in some instances to work out what the python equivalent behaviour of the JS |
… exist Checks that the next character exists before trying to access it.
Didn't expect StateCore to be called with None in MyST-NB (parser.py: line 161):
We could explicitly add an Optional[str] in the StateCore definition if this is allowed (rather than just the str type). There's an easy fix to this bug:
but it seems a bit odd. I'll think about it tomorrow :) |
I added the easy fix I mentioned earlier. This allows us to pass tests without relying on a try/except clause. On the benchmark with 100 iterations we go from
to
Returning to your suggestion to add a charCodeAt() that would call the _ords attribute in order to be closer to the js implementation, I'm not sure it's the best choice. This would imply:
I fear it would be harder to keep in mind the differences between both types of charCodeAt than having two clearly different ways of handling those situations (state.ords[pos] for state operations and charCodeAt(str, pos) for str operations). Let me know if you feel strongly about using a charCodeAt method rather than an attribute and/or if I should change the attribute name. |
Heya, in #33 I have added the benchmarking CI I mentioned above (see these benchmarking graphs 😄) and refactored the unit/benchmark tests. I've updated your branch here. Hopefully, its clear how to run them; you can just install tox and run:
or run them directly with pytest, let me know if you have any issues with this In the CI tests here you'll also see an artefact gets created with the basic results, which currently shows the mean run going from 0.266 secs -> 0.228 secs (-14.29%) 👌 back to the actual PR:
yeh its not a massive problem, I just want to make sure it's very easy to compare the JS/Python source codes. I think change Also, If you could add a note of this "port divergence" (and why it was done) to https://github.com/executablebooks/markdown-it-py/blob/master/markdown_it/port.yaml ta |
thanks! |
Thank you, I was about to rebase the commits to have 3 entries (improvment/test/bugfix) but I'm glad this was enough, rebases are often painful. |
Yeh I didn't want to hold this up any longer 😄 |
Hi!
I was looking at performance and noticed that the charCodeAt function was called a lot with some redundancy.
We very often compare ord() codes, and I think it's justified to store them in an attribute for StateCore, StateBlock and StateInline.
Eg:
Then we just replace each variant of:
charCodeAt(state.src, pos)
by
state.ords[pos]
Furthermore, StateCore and StateBlock can share a significant part of their ord codes. So we can add an optional parameter to the StateBlock constructor to copy the StateCore ord codes:
Here are some benchmark numbers (100 iterations with benchmark.py):
This is a ~10% performance boost.
However, these changes do not strictly copy the behavior of charCodeAt since it bypasses its try/except clause.
Tests are OK but I wonder if this can have an impact on illformed markdown ?
I could create a specific structure to alleviate this issue (a defaultlist that returns None when there is an IndexError, as is done in charCodeAt), but there would be a small downside in code readability, so I'd like to have your input on this.
Let me know what you think!