Skip to content

👌 IMPROVE: Parsing performance #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 17, 2020
Merged

👌 IMPROVE: Parsing performance #32

merged 7 commits into from
Aug 17, 2020

Conversation

sildar
Copy link
Contributor

@sildar sildar commented Aug 14, 2020

Hi!

I was looking at performance and noticed that the charCodeAt function was called a lot with some redundancy.

We very often compare ord() codes, and I think it's justified to store them in an attribute for StateCore, StateBlock and StateInline.
Eg:

class StateCore(StateBase):
    def __init__(self, src: str, md, env, tokens=None):
        self.src = src
        self.ords = [ord(c) for c in src]
        ...

Then we just replace each variant of:
charCodeAt(state.src, pos)

by

state.ords[pos]

Furthermore, StateCore and StateBlock can share a significant part of their ord codes. So we can add an optional parameter to the StateBlock constructor to copy the StateCore ord codes:

class StateBlock(StateBase):
    def __init__(self, src: str, md, env, tokens: List[Token], ords: List[int] = None):
        self.src = src

        if ords is not None:
            self.ords = ords
        else:
            self.ords = [ord(c) for c in src]

Here are some benchmark numbers (100 iterations with benchmark.py):

markdown-it-py (0.4.9): 18.66 s => Original
markdown-it-py (0.4.9): 17.93 s => store ord codes as attributes to remove charCodeAt where possible
markdown-it-py (0.4.9): 16.43 s => share StateCore ord codes with StateBlock

This is a ~10% performance boost.

However, these changes do not strictly copy the behavior of charCodeAt since it bypasses its try/except clause.

def charCodeAt(src: str, pos: int):
    try:
        return ord(src[pos])
    except IndexError:
        return None

Tests are OK but I wonder if this can have an impact on illformed markdown ?
I could create a specific structure to alleviate this issue (a defaultlist that returns None when there is an IndexError, as is done in charCodeAt), but there would be a small downside in code readability, so I'd like to have your input on this.

Let me know what you think!

This commit aims at lowering the amount of redundant calls to charCodeAt
to improve performances.

Ord codes are computed once and stored in an attribute for StateBlock, StateCore and StateInline.
We then check this attribute rather than calling the function.

Transfer ord codes whenever possible between StateCore and StateBlocks
so we don't recompute them.
@welcome
Copy link

welcome bot commented Aug 14, 2020

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out EBP's Code of Conduct and our Contributing Guide, as this will greatly help the review process.

Welcome to the EBP community! 🎉

@sildar
Copy link
Contributor Author

sildar commented Aug 14, 2020

Also, I just realized my pull request included a pre-commit change. The git:// protocol is blocked at my workplace, I think https:// is better suited in general, but let me know if I should make another pull request or just discard it.

@chrisjsewell
Copy link
Member

chrisjsewell commented Aug 14, 2020

Yeh interesting thanks, I'll have a proper look later but two thoughts:

  1. It may be better, in terms of mapping to the original JS code, to change self.ord to self._ord then adding a method:

    def charCodeAt(self, position):
        return self._ord[position]

    I think that would still give you the performance increase?

  2. I was thinking at some point to look into adding the performance benchmarking to the CI in some way, maybe via https://github.com/ionelmc/pytest-benchmark and https://github.com/marketplace/actions/continuous-benchmark
    Not that it necessarily has to be done here, but food for thought

@chrisjsewell
Copy link
Member

It looks like from the document build failure, that the try/except is necessary (unless its just masking an underlying issue 😏) so yeh that could be added to the method I mention above?

@chrisjsewell
Copy link
Member

chrisjsewell commented Aug 14, 2020

Also, if you could fix #31 while your at it, that would be great 😆

@sildar
Copy link
Contributor Author

sildar commented Aug 14, 2020

Regarding 1.:

It would be slightly worse since name resolution takes some time in Python but it should be negligible. And it would fix the documentation error since we could easily add a try/except clause.

However, I think the documentation failing is more of an underlying issue:

  File "/home/docs/checkouts/readthedocs.org/user_builds/markdown-it-py/envs/32/lib/python3.7/site-packages/markdown_it/rules_inline/image.py", line 20, in image
    if state.ords[state.pos + 1] != 0x5B:
IndexError: list index out of range

I think since we're looking ahead this should probably be:

if state.pos + 1 < state.posMax and state.ords[state.pos + 1] != 0x5B:

But this might happen elsewhere.

Let me check this, and if we can refrain from using a try/except we should do it since this tends to silence errors that we might want to catch early.

@chrisjsewell
Copy link
Member

yep and you might also want to look if there has been any "porting errors" compared to https://github.com/markdown-it/markdown-it/blob/master/lib/rules_inline/image.js

@chrisjsewell
Copy link
Member

it was a bit fiddly in some instances to work out what the python equivalent behaviour of the JS for loops was

… exist

Checks that the next character exists before trying to access it.
@sildar
Copy link
Contributor Author

sildar commented Aug 14, 2020

Didn't expect StateCore to be called with None in MyST-NB (parser.py: line 161):

# Note we assume here that these rules never require the actual source text,
# only acting on the existing tokens
state = StateCore(None, md, env, block_tokens)

We could explicitly add an Optional[str] in the StateCore definition if this is allowed (rather than just the str type).

There's an easy fix to this bug:

self.ords = [ord(c) for c in str] if str is not None else []

but it seems a bit odd.

I'll think about it tomorrow :)

@sildar
Copy link
Contributor Author

sildar commented Aug 15, 2020

I added the easy fix I mentioned earlier. This allows us to pass tests without relying on a try/except clause.

On the benchmark with 100 iterations we go from

markdown-it-py (0.4.9): 19.53 s

to

markdown-it-py (0.4.9): 16.62 s

Returning to your suggestion to add a charCodeAt() that would call the _ords attribute in order to be closer to the js implementation, I'm not sure it's the best choice. This would imply:

  • 3 identical state.charCodeAt(pos) methods (one for each StateX) to keep identical (not that big of a deal)
  • the charCodeAt(str, pos) still exists and is used in the code when str does not come directly from a state.src

I fear it would be harder to keep in mind the differences between both types of charCodeAt than having two clearly different ways of handling those situations (state.ords[pos] for state operations and charCodeAt(str, pos) for str operations).
The ords attribute name is perhaps non-intuititve, we could call it srcOrds instead to emphasize its connection with the src attribute.

Let me know if you feel strongly about using a charCodeAt method rather than an attribute and/or if I should change the attribute name.

@chrisjsewell
Copy link
Member

chrisjsewell commented Aug 17, 2020

Heya, in #33 I have added the benchmarking CI I mentioned above (see these benchmarking graphs 😄) and refactored the unit/benchmark tests.

I've updated your branch here. Hopefully, its clear how to run them; you can just install tox and run:

$ tox -p # for unit tests
$ tox -e py38-bench-core  # for basic benchmark
$ tox -e py38-bench-packages # for comparison to other packages
$ tox -e py38-bench-plugins # for benchmarking plugins

or run them directly with pytest, let me know if you have any issues with this

In the CI tests here you'll also see an artefact gets created with the basic results, which currently shows the mean run going from 0.266 secs -> 0.228 secs (-14.29%) 👌

back to the actual PR:

Let me know if you feel strongly about using a charCodeAt method rather than an attribute and/or if I should change the attribute name.

yeh its not a massive problem, I just want to make sure it's very easy to compare the JS/Python source codes. I think change self.ords to self.srcCharCodes, so that its very obvious that state.src.charCodeAt(pos) is equivalent to state.srcCharCodes[pos].

Also, If you could add a note of this "port divergence" (and why it was done) to https://github.com/executablebooks/markdown-it-py/blob/master/markdown_it/port.yaml ta

@chrisjsewell chrisjsewell changed the title 👌 IMPROVE: Performance improvment 👌 IMPROVE: Parsing performance Aug 17, 2020
@chrisjsewell chrisjsewell merged commit 43f956b into executablebooks:master Aug 17, 2020
@welcome
Copy link

welcome bot commented Aug 17, 2020

Congrats on your first merged pull request in this project! 🎉
congrats

Thank you for contributing, we are very proud of you! ❤️

@chrisjsewell
Copy link
Member

thanks!

@sildar
Copy link
Contributor Author

sildar commented Aug 17, 2020

Thank you, I was about to rebase the commits to have 3 entries (improvment/test/bugfix) but I'm glad this was enough, rebases are often painful.

@chrisjsewell
Copy link
Member

Yeh I didn't want to hold this up any longer 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants