👌 IMPROVE: Parsing performance #32

sildar · 2020-08-14T14:56:30Z

Hi!

I was looking at performance and noticed that the charCodeAt function was called a lot with some redundancy.

We very often compare ord() codes, and I think it's justified to store them in an attribute for StateCore, StateBlock and StateInline.
Eg:

class StateCore(StateBase):
    def __init__(self, src: str, md, env, tokens=None):
        self.src = src
        self.ords = [ord(c) for c in src]
        ...

Then we just replace each variant of:
charCodeAt(state.src, pos)

by

state.ords[pos]

Furthermore, StateCore and StateBlock can share a significant part of their ord codes. So we can add an optional parameter to the StateBlock constructor to copy the StateCore ord codes:

class StateBlock(StateBase):
    def __init__(self, src: str, md, env, tokens: List[Token], ords: List[int] = None):
        self.src = src

        if ords is not None:
            self.ords = ords
        else:
            self.ords = [ord(c) for c in src]

Here are some benchmark numbers (100 iterations with benchmark.py):

markdown-it-py (0.4.9): 18.66 s => Original
markdown-it-py (0.4.9): 17.93 s => store ord codes as attributes to remove charCodeAt where possible
markdown-it-py (0.4.9): 16.43 s => share StateCore ord codes with StateBlock

This is a ~10% performance boost.

However, these changes do not strictly copy the behavior of charCodeAt since it bypasses its try/except clause.

def charCodeAt(src: str, pos: int):
    try:
        return ord(src[pos])
    except IndexError:
        return None

Tests are OK but I wonder if this can have an impact on illformed markdown ?
I could create a specific structure to alleviate this issue (a defaultlist that returns None when there is an IndexError, as is done in charCodeAt), but there would be a small downside in code readability, so I'd like to have your input on this.

Let me know what you think!

This commit aims at lowering the amount of redundant calls to charCodeAt to improve performances. Ord codes are computed once and stored in an attribute for StateBlock, StateCore and StateInline. We then check this attribute rather than calling the function. Transfer ord codes whenever possible between StateCore and StateBlocks so we don't recompute them.

welcome · 2020-08-14T14:56:32Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out EBP's Code of Conduct and our Contributing Guide, as this will greatly help the review process.

Welcome to the EBP community! 🎉

sildar · 2020-08-14T14:58:22Z

Also, I just realized my pull request included a pre-commit change. The git:// protocol is blocked at my workplace, I think https:// is better suited in general, but let me know if I should make another pull request or just discard it.

chrisjsewell · 2020-08-14T15:27:31Z

Yeh interesting thanks, I'll have a proper look later but two thoughts:

It may be better, in terms of mapping to the original JS code, to change self.ord to self._ord then adding a method:
```
def charCodeAt(self, position):
    return self._ord[position]
```
I think that would still give you the performance increase?
I was thinking at some point to look into adding the performance benchmarking to the CI in some way, maybe via https://github.com/ionelmc/pytest-benchmark and https://github.com/marketplace/actions/continuous-benchmark
Not that it necessarily has to be done here, but food for thought

chrisjsewell · 2020-08-14T15:32:32Z

It looks like from the document build failure, that the try/except is necessary (unless its just masking an underlying issue 😏) so yeh that could be added to the method I mention above?

chrisjsewell · 2020-08-14T15:33:03Z

Also, if you could fix #31 while your at it, that would be great 😆

sildar · 2020-08-14T15:51:33Z

Regarding 1.:

It would be slightly worse since name resolution takes some time in Python but it should be negligible. And it would fix the documentation error since we could easily add a try/except clause.

However, I think the documentation failing is more of an underlying issue:

  File "/home/docs/checkouts/readthedocs.org/user_builds/markdown-it-py/envs/32/lib/python3.7/site-packages/markdown_it/rules_inline/image.py", line 20, in image
    if state.ords[state.pos + 1] != 0x5B:
IndexError: list index out of range

I think since we're looking ahead this should probably be:

if state.pos + 1 < state.posMax and state.ords[state.pos + 1] != 0x5B:

But this might happen elsewhere.

Let me check this, and if we can refrain from using a try/except we should do it since this tends to silence errors that we might want to catch early.

chrisjsewell · 2020-08-14T16:09:01Z

yep and you might also want to look if there has been any "porting errors" compared to https://github.com/markdown-it/markdown-it/blob/master/lib/rules_inline/image.js

chrisjsewell · 2020-08-14T16:10:25Z

it was a bit fiddly in some instances to work out what the python equivalent behaviour of the JS for loops was

… exist Checks that the next character exists before trying to access it.

sildar · 2020-08-14T22:21:54Z

Didn't expect StateCore to be called with None in MyST-NB (parser.py: line 161):

# Note we assume here that these rules never require the actual source text,
# only acting on the existing tokens
state = StateCore(None, md, env, block_tokens)

We could explicitly add an Optional[str] in the StateCore definition if this is allowed (rather than just the str type).

There's an easy fix to this bug:

self.ords = [ord(c) for c in str] if str is not None else []

but it seems a bit odd.

I'll think about it tomorrow :)

sildar · 2020-08-15T08:34:31Z

I added the easy fix I mentioned earlier. This allows us to pass tests without relying on a try/except clause.

On the benchmark with 100 iterations we go from

markdown-it-py (0.4.9): 19.53 s

to

markdown-it-py (0.4.9): 16.62 s

Returning to your suggestion to add a charCodeAt() that would call the _ords attribute in order to be closer to the js implementation, I'm not sure it's the best choice. This would imply:

3 identical state.charCodeAt(pos) methods (one for each StateX) to keep identical (not that big of a deal)
the charCodeAt(str, pos) still exists and is used in the code when str does not come directly from a state.src

I fear it would be harder to keep in mind the differences between both types of charCodeAt than having two clearly different ways of handling those situations (state.ords[pos] for state operations and charCodeAt(str, pos) for str operations).
The ords attribute name is perhaps non-intuititve, we could call it srcOrds instead to emphasize its connection with the src attribute.

Let me know if you feel strongly about using a charCodeAt method rather than an attribute and/or if I should change the attribute name.

chrisjsewell · 2020-08-17T13:42:44Z

Heya, in #33 I have added the benchmarking CI I mentioned above (see these benchmarking graphs 😄) and refactored the unit/benchmark tests.

I've updated your branch here. Hopefully, its clear how to run them; you can just install tox and run:

$ tox -p # for unit tests
$ tox -e py38-bench-core  # for basic benchmark
$ tox -e py38-bench-packages # for comparison to other packages
$ tox -e py38-bench-plugins # for benchmarking plugins

or run them directly with pytest, let me know if you have any issues with this

In the CI tests here you'll also see an artefact gets created with the basic results, which currently shows the mean run going from 0.266 secs -> 0.228 secs (-14.29%) 👌

back to the actual PR:

Let me know if you feel strongly about using a charCodeAt method rather than an attribute and/or if I should change the attribute name.

yeh its not a massive problem, I just want to make sure it's very easy to compare the JS/Python source codes. I think change self.ords to self.srcCharCodes, so that its very obvious that state.src.charCodeAt(pos) is equivalent to state.srcCharCodes[pos].

Also, If you could add a note of this "port divergence" (and why it was done) to https://github.com/executablebooks/markdown-it-py/blob/master/markdown_it/port.yaml ta

….yaml file

welcome · 2020-08-17T18:11:33Z

Congrats on your first merged pull request in this project! 🎉

Thank you for contributing, we are very proud of you! ❤️

chrisjsewell · 2020-08-17T18:21:11Z

thanks!

sildar · 2020-08-17T18:23:20Z

Thank you, I was about to rebase the commits to have 3 entries (improvment/test/bugfix) but I'm glad this was enough, rebases are often painful.

chrisjsewell · 2020-08-17T18:27:15Z

Yeh I didn't want to hold this up any longer 😄

🐛 FIX: Fix a bug where we look at the next character while it may not…

a90945f

… exist Checks that the next character exists before trying to access it.

Account for possible None as src

e7ea3ef

sildar and others added 2 commits August 15, 2020 10:57

🧪 TEST: Added two tests for empty inputs and States without src content

3f7fd44

Merge branch 'master' into master

4be5f8b

sildar added 2 commits August 17, 2020 18:51

Rename ords to srcCharCode

b92d4a5

Add some explanation for the use of srcCharCode attribute in the port…

c2364d9

….yaml file

chrisjsewell changed the title ~~👌 IMPROVE: Performance improvment~~ 👌 IMPROVE: Parsing performance Aug 17, 2020

chrisjsewell merged commit 43f956b into executablebooks:master Aug 17, 2020

chrisjsewell mentioned this pull request Nov 4, 2020

🔧 MAINTAIN: Don't allow None value for StateBase.src #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

👌 IMPROVE: Parsing performance #32

👌 IMPROVE: Parsing performance #32

Uh oh!

sildar commented Aug 14, 2020

Uh oh!

welcome bot commented Aug 14, 2020

Uh oh!

sildar commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020 •

edited

Loading

Uh oh!

chrisjsewell commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020 •

edited

Loading

Uh oh!

sildar commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020

Uh oh!

sildar commented Aug 14, 2020 •

edited

Loading

Uh oh!

sildar commented Aug 15, 2020 •

edited

Loading

Uh oh!

chrisjsewell commented Aug 17, 2020 •

edited

Loading

Uh oh!

welcome bot commented Aug 17, 2020

Uh oh!

chrisjsewell commented Aug 17, 2020

Uh oh!

sildar commented Aug 17, 2020

Uh oh!

chrisjsewell commented Aug 17, 2020

Uh oh!

Uh oh!

👌 IMPROVE: Parsing performance #32

👌 IMPROVE: Parsing performance #32

Uh oh!

Conversation

sildar commented Aug 14, 2020

Uh oh!

welcome bot commented Aug 14, 2020

Uh oh!

sildar commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisjsewell commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sildar commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020

Uh oh!

chrisjsewell commented Aug 14, 2020

Uh oh!

sildar commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sildar commented Aug 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisjsewell commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

welcome bot commented Aug 17, 2020

Uh oh!

chrisjsewell commented Aug 17, 2020

Uh oh!

sildar commented Aug 17, 2020

Uh oh!

chrisjsewell commented Aug 17, 2020

Uh oh!

Uh oh!

chrisjsewell commented Aug 14, 2020 •

edited

Loading

chrisjsewell commented Aug 14, 2020 •

edited

Loading

sildar commented Aug 14, 2020 •

edited

Loading

sildar commented Aug 15, 2020 •

edited

Loading

chrisjsewell commented Aug 17, 2020 •

edited

Loading