Skip to content

Performance issue with long lines #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
llimllib opened this issue Aug 14, 2024 · 23 comments
Closed
4 tasks done

Performance issue with long lines #8

llimllib opened this issue Aug 14, 2024 · 23 comments
Labels
🏁 area/perf This affects performance 💪 phase/solved Post is done

Comments

@llimllib
Copy link

llimllib commented Aug 14, 2024

Initial checklist

Affected packages and versions

at least the latest version

Link to runnable example

https://gist.github.com/llimllib/4b87fd4042359b4812350038562dba03

Steps to reproduce

  1. Run the test.mjs file from the gist above in the root of this repository. Its output on my machine is:
$ node test.mjs
parsing a file with breaks: 32.809ms
parsing a file without breaks: 10.127s

In this test case, the "file with breaks" is 8000 lines of 100 a characters. The "file without breaks" is 800,000 'a' characters without line breaks, so actually a slightly smaller file but with no line breaks.

here's an observable notebook that demonstrates the behavior, you can play around with the regular expression there and see how performance looks

Expected behavior

parsing a file with a long line should not have such severe performance effects.

(In the terms that affect my team, our customers should not be able to easily DoS us by providing markdown files with long lines)

Actual behavior

The findEmail regular expression here takes a very long time to parse a long line.

(I think this is due to using nested quantifiers, but I am not an expert here)

I noticed this when investigating a slowdown in our app, developed a test case, and profiled it. This regular expression jumped out:

image

Affected runtime and version

node 20.16.0

Affected package manager and version

all

Affected OS and version

all

Build and bundle tools

No response

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Aug 14, 2024
@ChristianMurphy ChristianMurphy added the 🏁 area/perf This affects performance label Aug 14, 2024
@llimllib
Copy link
Author

for what it's worth, I tried finding a regular expression that worked better and was not successful; perhaps it is sensible to do something like check the line for an @ before running the regular expression?

That would still leave a long line full of @ as a DDoS vector, but it'd be an improvement at least?

@ChristianMurphy
Copy link
Member

Welcome @llimllib! 👋
Performance improvements are welcome!

as a DDoS vector

Respectfully a major of "DDoS vectors", including this, are a nothingburger.
Longer content takes longer, if you let users put content of unrestricted length in, it could be slow.

@llimllib
Copy link
Author

@ChristianMurphy we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

@llimllib
Copy link
Author

llimllib commented Aug 14, 2024

Limiting the length of + matches to 1024 speeds up the test file, but it also makes tests fail and I don't quote have a handle on why:

// Use {0,1024} instead of `+` to avoid pathological regular expression
// behavior; see
// https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/issues/8
[
  /([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g,
  findEmail
]

That expression matches emails the same as the current one:

> "[email protected]".match(/([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g)
Array [ "[email protected]" ]

ah ha! it's matching things it ought not to, and of course the proper replacement for + is {1,1024} not {0,...}, that's what + means

@ChristianMurphy
Copy link
Member

we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

Yes, that is how programming languages work.
Please read the MDX docs https://mdxjs.com/docs/getting-started/#security this has already been addressed.

@llimllib
Copy link
Author

@ChristianMurphy I truly have no idea why you're being mean here

@ChristianMurphy
Copy link
Member

ChristianMurphy commented Aug 14, 2024

MDX is a full programming language, it compiles to JavaScript.
If you are, as you imply, letting your untrusted user run content without any restrictions.
You are letting untrusted users run arbitrary JS on your server, and all the security and performance risks that go along with it. That is your choice, but that isn't a "vulnerability" or a "security issue" of MDX.


To use MDX safely you either:

  1. trust your users
  2. or, sandbox and timeout content

In both circumstances this perf issue you bring up, is a non-issue from a security perspective.


Again, happy to discuss improving parser speed.
I'm mostly bringing this up because you seem to fundamentally misunderstand what is and is not a security concern.

@llimllib

This comment was marked as spam.

llimllib added a commit to llimllib/mdast-util-gfm-autolink-literal that referenced this issue Aug 14, 2024
@llimllib
Copy link
Author

@ChristianMurphy I presented a fix in #9

  • I understand exactly what this is
  • For us it is a security issue, I apologize if that bothers you, I guess? I'm not going to be filing a CVE or anything, just trying to make it so that long lines process in a reasonable period of time so we're not vulnerable to CPU usage attacks

@ChristianMurphy
Copy link
Member

ChristianMurphy commented Aug 14, 2024

This parser matches GFM.
GFM allows emails to include more than 1024 characters, for example this has 1500 before @ and 1500 after:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa@aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com

yes I know that!

Then you are aware the premise of this issue is a non-issue.

just trying to make it so that long lines process in a reasonable period of time so we're not vulnerable to CPU usage attacks

I sympathize and agree with this.
This can, and should be done.
You can do this by isolating the the parse and render process and giving it a timeout.

@llimllib
Copy link
Author

llimllib commented Aug 14, 2024

The maximum legal email address is 254 characters

An email address must not exceed 254 characters.

This was accepted by the IETF following submitted erratum. A full diagnosis of any given address is available online. The original version of RFC 3696 described 320 as the maximum length, but John Klensin subsequently accepted an incorrect value, since a Path is defined as


You can do this by isolating the the parse and render process and giving it a timeout.

we do that already! but I would like to have the parser not have to fail if my customers include a long line in it

llimllib added a commit to llimllib/mdast-util-gfm-autolink-literal that referenced this issue Aug 14, 2024
@ChristianMurphy
Copy link
Member

valid emails are defined by https://datatracker.ietf.org/doc/html/rfc5322 and https://datatracker.ietf.org/doc/html/rfc6854
These have no such limitation.

https://datatracker.ietf.org/doc/html/rfc3696 are a collection of recommendations for validators but not a specification of a valid email.
From the introduction

This document draws summaries of the
applicable rules together in one place and supplies references to the
actual standards. It does not add anything to those standards; it
merely draws the information together into a form that may be more
accessible.

@llimllib
Copy link
Author

I have closed my PR, you're clearly not interested.

I feel that you behaved very poorly towards me, so I'll just wish you an excellent day - I feel quite terrible after this interaction.

@ChristianMurphy
Copy link
Member

Thank you for your input and contributions.

I understand your concerns about performance, but it's essential that any solution aligns with GFM specifications. This project must remain compliant with those standards, and it's not designed to cater to specific performance needs of individual organizations.

I’m open to discussing performance improvements, but they must adhere to GFM. If you'd like to explore this further within those constraints, I'm happy to collaborate. Otherwise, I respect your decision to step back.

@wooorm
Copy link
Member

wooorm commented Aug 15, 2024

if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

a) MDX is dangerous itself, if you let people write MDX, you are vulnerable, this is extensively documented
b) the point you are making is indeed something that you should prevent, but it is not something that we should prevent; people can always pass 100mb documents to you, and that is dangerous and you should prevent it; but there are also 100mb documents possible that are (slow but) safe


@llimllib I think you are are being disrespectful. This is an open source project that owes you nothing. You owe us everything. You have never helped us. You have never given us money. You misunderstand what MDX is, what our goals are, and what this problem is. Your PR breaks GFM.

@wooorm
Copy link
Member

wooorm commented Aug 15, 2024

I’ve played with a couple things, and the below code gives the same problem:

const find = /([-.a-z\d+]+)@([-a-z\d]+(?:\.[-a-z\d]+)+)/gi

const hundredA = 'a'.repeat(100)
const fileWithBreaks = `${hundredA}\n`.repeat(800)
const fileWithoutBreaks = `${hundredA}`.repeat(800)

console.time('parsing a file with breaks')
find.lastIndex = 0
find.exec(fileWithBreaks)
console.timeEnd('parsing a file with breaks')

console.time('parsing a file without breaks')
find.lastIndex = 0
find.exec(fileWithoutBreaks)
console.timeEnd('parsing a file without breaks')

So, it has nothing to do with all the projects. This regex with those giant documents has this problem.
It’s also noticeable on Safari btw, so not just V8.
And, changing the regex: changing the other +s doesn‘t do anything. The slowing down happens because the first +.

Also, important to note, we’re talking about 80kb of data here.

@wooorm
Copy link
Member

wooorm commented Aug 15, 2024

Found the limit! copy('a'.repeat(32758) + '@b.c') does not link on GH, copy('a'.repeat(32757) + '@b.c') does.
Weird how it’s close but different to 2**15 (32768)?
Regardless, using {1,32757} instead of + just makes things slower 🤷‍♂️

@wooorm
Copy link
Member

wooorm commented Aug 15, 2024

We can make the regex incredibly fast by adding (?<=^|\p{P}|\p{S}|\s) (and /u) to the regex.

parsing a file with breaks: 0.951ms
parsing a file without breaks: 0.593ms

Without it, it was:

parsing a file with breaks: 12.837ms
parsing a file without breaks: 8.809s

So that’s a giant difference.

The actual disallowed previous is a bit more complex:

(match.index === 0 ||
unicodeWhitespace(code) ||
unicodePunctuation(code)) &&
(!email || code !== 47)

Particularly around the /? Not sure how to best create this regex: nothing, all whitespace, all punctuation, all symbols, not /?

@wooorm wooorm closed this as completed in 2402c21 Aug 19, 2024

This comment has been minimized.

@wooorm wooorm added the 💪 phase/solved Post is done label Aug 19, 2024
@github-actions github-actions bot removed the 🤞 phase/open Post is being triaged manually label Aug 19, 2024
@wooorm
Copy link
Member

wooorm commented Aug 19, 2024

@erunion @bnikom @ilias-t @kellyjosephprice @rafegoldberg @llimllib I will block you if you don’t change your childish ways. @llimllib proposed a bad change. I came up with a good change. And then you act unprofessional? This reflects poorly on your company, its culture.

@kellyjosephprice
Copy link

I'm so sorry that this thread has gone astray. @llimllib is only trying to help. Is there some place in the code of conduct or other supporting documentation that would describe how @llimllib or others have acted inappropriately?

@wooorm
Copy link
Member

wooorm commented Aug 19, 2024

I’d recommend not getting hung up on particular solutions. Users know their problems well. Maintainers have a better idea on solutions. I believe Bill ran into an XY problem. Some places on the internet mentioned 254, 1024 was found somewhere, but those are not numbers GFM uses. Regexes can be improved in other ways. I’d also recommend a thicker skin on the internet, particularly dry technical places like GH. Christian is correct in saying that if you allow attackers to write in a programming language things will get slow and it will cost you. It’s not mean to point that out.

I don’t worry too much about that though. Maybe Christian feels different.

What I found annoying is that despite that, I researched and presented a solution, a good regex, which was ignored. The URL was shared internally, everyone upvoted their friend, downvoted our feedback. That makes me feel like my work is not appreciated.

As for community info, see contributing.md, and code-of-conduct.md.

@ChristianMurphy
Copy link
Member

ChristianMurphy commented Aug 19, 2024

The recent conversation crossed a professional line.

  1. Mischaracterizing the Issue: @llimllib, by persistently framing a performance issue as a security vulnerability, despite multiple clarifications, you diverted the discussion from productive problem-solving. This behavior didn’t align with the code of conduct’s emphasis on focusing on what is best for the community and gracefully accepting constructive criticism.

  2. Argumentative Behavior: @llimllib, your approach in the discussion was not in line with the collaborative and respectful environment we strive to maintain. By dismissing the maintainers' technical explanations and continuing to argue a point that had already been addressed, you created unnecessary conflict and hindered the problem-solving process. This behavior goes against our code of conduct’s standard of maintaining a constructive and respectful dialogue.

  3. Escalation: Closing the PR and expressing that you felt mistreated without further discussion was counterproductive and wasted time that could have been better spent resolving the issue. This action goes against the spirit of our contributing guide, which emphasizes collaboration and open discussion before making decisions.

@kellyjosephprice, while I understand the desire to support a coworker, it’s essential to encourage constructive behavior, especially in challenging situations. Backing unproductive actions reflects poorly on the README team and undermines the professionalism we strive to maintain.

This project serves millions of adopters, and both @wooorm and I have been generous with our time and energy to ensure it runs smoothly. @llimllib, your behavior in this situation was unprofessional and disruptive, which is not acceptable under our code of conduct.

Additionally, I want to express broader frustration with README's approach. Relying heavily on Unified, MDX, and Remark to build an entire product without contributing back, either through code or financially, and then participating in issues like this with a lack of good faith shows a disregard for the principles of open source. It’s frustrating and disappointing to see adopters, whose livelihoods depend on these tools, engage with us in such a way.

Let’s all aim to maintain a respectful and constructive tone in future interactions, ensuring that we adhere to both the contributing guide and the code of conduct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏁 area/perf This affects performance 💪 phase/solved Post is done
Development

Successfully merging a pull request may close this issue.

4 participants