Performance issue with long lines #8

llimllib · 2024-08-14T15:38:19Z

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

at least the latest version

Link to runnable example

https://gist.github.com/llimllib/4b87fd4042359b4812350038562dba03

Steps to reproduce

Run the test.mjs file from the gist above in the root of this repository. Its output on my machine is:

$ node test.mjs
parsing a file with breaks: 32.809ms
parsing a file without breaks: 10.127s

In this test case, the "file with breaks" is 8000 lines of 100 a characters. The "file without breaks" is 800,000 'a' characters without line breaks, so actually a slightly smaller file but with no line breaks.

here's an observable notebook that demonstrates the behavior, you can play around with the regular expression there and see how performance looks

Expected behavior

parsing a file with a long line should not have such severe performance effects.

(In the terms that affect my team, our customers should not be able to easily DoS us by providing markdown files with long lines)

Actual behavior

The findEmail regular expression here takes a very long time to parse a long line.

(I think this is due to using nested quantifiers, but I am not an expert here)

I noticed this when investigating a slowdown in our app, developed a test case, and profiled it. This regular expression jumped out:

Affected runtime and version

node 20.16.0

Affected package manager and version

all

Affected OS and version

all

Build and bundle tools

No response

The text was updated successfully, but these errors were encountered:

llimllib · 2024-08-14T19:56:59Z

for what it's worth, I tried finding a regular expression that worked better and was not successful; perhaps it is sensible to do something like check the line for an @ before running the regular expression?

That would still leave a long line full of @ as a DDoS vector, but it'd be an improvement at least?

ChristianMurphy · 2024-08-14T20:15:36Z

Welcome @llimllib! 👋
Performance improvements are welcome!

as a DDoS vector

Respectfully a major of "DDoS vectors", including this, are a nothingburger.
Longer content takes longer, if you let users put content of unrestricted length in, it could be slow.

llimllib · 2024-08-14T20:16:56Z

@ChristianMurphy we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

llimllib · 2024-08-14T20:18:32Z

Limiting the length of + matches to 1024 speeds up the test file, but it also makes tests fail and I don't quote have a handle on why:

// Use {0,1024} instead of `+` to avoid pathological regular expression
// behavior; see
// https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/issues/8
[
  /([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g,
  findEmail
]

That expression matches emails the same as the current one:

> "[email protected]".match(/([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g)
Array [ "[email protected]" ]

ah ha! it's matching things it ought not to, and of course the proper replacement for + is {1,1024} not {0,...}, that's what + means

ChristianMurphy · 2024-08-14T20:19:24Z

we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

Yes, that is how programming languages work.
Please read the MDX docs https://mdxjs.com/docs/getting-started/#security this has already been addressed.

llimllib · 2024-08-14T20:19:55Z

@ChristianMurphy I truly have no idea why you're being mean here

ChristianMurphy · 2024-08-14T20:26:06Z

MDX is a full programming language, it compiles to JavaScript.
If you are, as you imply, letting your untrusted user run content without any restrictions.
You are letting untrusted users run arbitrary JS on your server, and all the security and performance risks that go along with it. That is your choice, but that isn't a "vulnerability" or a "security issue" of MDX.

To use MDX safely you either:

trust your users
or, sandbox and timeout content

In both circumstances this perf issue you bring up, is a non-issue from a security perspective.

Again, happy to discuss improving parser speed.
I'm mostly bringing this up because you seem to fundamentally misunderstand what is and is not a security concern.

closes syntax-tree#8

llimllib · 2024-08-14T20:37:56Z

@ChristianMurphy I presented a fix in #9

I understand exactly what this is
For us it is a security issue, I apologize if that bothers you, I guess? I'm not going to be filing a CVE or anything, just trying to make it so that long lines process in a reasonable period of time so we're not vulnerable to CPU usage attacks

ChristianMurphy · 2024-08-14T20:43:01Z

This parser matches GFM.
GFM allows emails to include more than 1024 characters, for example this has 1500 before @ and 1500 after:

yes I know that!

Then you are aware the premise of this issue is a non-issue.

just trying to make it so that long lines process in a reasonable period of time so we're not vulnerable to CPU usage attacks

I sympathize and agree with this.
This can, and should be done.
You can do this by isolating the the parse and render process and giving it a timeout.

llimllib · 2024-08-14T20:45:16Z

The maximum legal email address is 254 characters

An email address must not exceed 254 characters.

This was accepted by the IETF following submitted erratum. A full diagnosis of any given address is available online. The original version of RFC 3696 described 320 as the maximum length, but John Klensin subsequently accepted an incorrect value, since a Path is defined as

You can do this by isolating the the parse and render process and giving it a timeout.

we do that already! but I would like to have the parser not have to fail if my customers include a long line in it

closes syntax-tree#8

ChristianMurphy · 2024-08-14T20:53:29Z

valid emails are defined by https://datatracker.ietf.org/doc/html/rfc5322 and https://datatracker.ietf.org/doc/html/rfc6854
These have no such limitation.

https://datatracker.ietf.org/doc/html/rfc3696 are a collection of recommendations for validators but not a specification of a valid email.
From the introduction

This document draws summaries of the
applicable rules together in one place and supplies references to the
actual standards. It does not add anything to those standards; it
merely draws the information together into a form that may be more
accessible.

llimllib · 2024-08-14T20:58:45Z

I have closed my PR, you're clearly not interested.

I feel that you behaved very poorly towards me, so I'll just wish you an excellent day - I feel quite terrible after this interaction.

ChristianMurphy · 2024-08-14T21:10:49Z

Thank you for your input and contributions.

I understand your concerns about performance, but it's essential that any solution aligns with GFM specifications. This project must remain compliant with those standards, and it's not designed to cater to specific performance needs of individual organizations.

I’m open to discussing performance improvements, but they must adhere to GFM. If you'd like to explore this further within those constraints, I'm happy to collaborate. Otherwise, I respect your decision to step back.

wooorm · 2024-08-15T12:04:38Z

if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody

a) MDX is dangerous itself, if you let people write MDX, you are vulnerable, this is extensively documented
b) the point you are making is indeed something that you should prevent, but it is not something that we should prevent; people can always pass 100mb documents to you, and that is dangerous and you should prevent it; but there are also 100mb documents possible that are (slow but) safe

@llimllib I think you are are being disrespectful. This is an open source project that owes you nothing. You owe us everything. You have never helped us. You have never given us money. You misunderstand what MDX is, what our goals are, and what this problem is. Your PR breaks GFM.

wooorm · 2024-08-15T13:38:57Z

I’ve played with a couple things, and the below code gives the same problem:

const find = /([-.a-z\d+]+)@([-a-z\d]+(?:\.[-a-z\d]+)+)/gi

const hundredA = 'a'.repeat(100)
const fileWithBreaks = `${hundredA}\n`.repeat(800)
const fileWithoutBreaks = `${hundredA}`.repeat(800)

console.time('parsing a file with breaks')
find.lastIndex = 0
find.exec(fileWithBreaks)
console.timeEnd('parsing a file with breaks')

console.time('parsing a file without breaks')
find.lastIndex = 0
find.exec(fileWithoutBreaks)
console.timeEnd('parsing a file without breaks')

So, it has nothing to do with all the projects. This regex with those giant documents has this problem.
It’s also noticeable on Safari btw, so not just V8.
And, changing the regex: changing the other +s doesn‘t do anything. The slowing down happens because the first +.

Also, important to note, we’re talking about 80kb of data here.

wooorm · 2024-08-15T13:47:59Z

Found the limit! copy('a'.repeat(32758) + '@b.c') does not link on GH, copy('a'.repeat(32757) + '@b.c') does.
Weird how it’s close but different to 2**15 (32768)?
Regardless, using {1,32757} instead of + just makes things slower 🤷‍♂️

wooorm · 2024-08-15T15:27:31Z

We can make the regex incredibly fast by adding (?<=^|\p{P}|\p{S}|\s) (and /u) to the regex.

parsing a file with breaks: 0.951ms
parsing a file without breaks: 0.593ms

Without it, it was:

parsing a file with breaks: 12.837ms
parsing a file without breaks: 8.809s

So that’s a giant difference.

The actual disallowed previous is a bit more complex:

mdast-util-gfm-autolink-literal/lib/index.js

Lines 274 to 277 in c990a62

    
           (match.index === 0 || 
        
             unicodeWhitespace(code) || 
        
             unicodePunctuation(code)) && 
        
           (!email || code !== 47)

Particularly around the /? Not sure how to best create this regex: nothing, all whitespace, all punctuation, all symbols, not /?

wooorm · 2024-08-19T09:43:14Z

@erunion @bnikom @ilias-t @kellyjosephprice @rafegoldberg @llimllib I will block you if you don’t change your childish ways. @llimllib proposed a bad change. I came up with a good change. And then you act unprofessional? This reflects poorly on your company, its culture.

kellyjosephprice · 2024-08-19T10:16:14Z

I'm so sorry that this thread has gone astray. @llimllib is only trying to help. Is there some place in the code of conduct or other supporting documentation that would describe how @llimllib or others have acted inappropriately?

wooorm · 2024-08-19T10:41:17Z

I’d recommend not getting hung up on particular solutions. Users know their problems well. Maintainers have a better idea on solutions. I believe Bill ran into an XY problem. Some places on the internet mentioned 254, 1024 was found somewhere, but those are not numbers GFM uses. Regexes can be improved in other ways. I’d also recommend a thicker skin on the internet, particularly dry technical places like GH. Christian is correct in saying that if you allow attackers to write in a programming language things will get slow and it will cost you. It’s not mean to point that out.

I don’t worry too much about that though. Maybe Christian feels different.

What I found annoying is that despite that, I researched and presented a solution, a good regex, which was ignored. The URL was shared internally, everyone upvoted their friend, downvoted our feedback. That makes me feel like my work is not appreciated.

As for community info, see contributing.md, and code-of-conduct.md.

ChristianMurphy · 2024-08-19T14:59:48Z

The recent conversation crossed a professional line.

Mischaracterizing the Issue: @llimllib, by persistently framing a performance issue as a security vulnerability, despite multiple clarifications, you diverted the discussion from productive problem-solving. This behavior didn’t align with the code of conduct’s emphasis on focusing on what is best for the community and gracefully accepting constructive criticism.
Argumentative Behavior: @llimllib, your approach in the discussion was not in line with the collaborative and respectful environment we strive to maintain. By dismissing the maintainers' technical explanations and continuing to argue a point that had already been addressed, you created unnecessary conflict and hindered the problem-solving process. This behavior goes against our code of conduct’s standard of maintaining a constructive and respectful dialogue.
Escalation: Closing the PR and expressing that you felt mistreated without further discussion was counterproductive and wasted time that could have been better spent resolving the issue. This action goes against the spirit of our contributing guide, which emphasizes collaboration and open discussion before making decisions.

@kellyjosephprice, while I understand the desire to support a coworker, it’s essential to encourage constructive behavior, especially in challenging situations. Backing unproductive actions reflects poorly on the README team and undermines the professionalism we strive to maintain.

This project serves millions of adopters, and both @wooorm and I have been generous with our time and energy to ensure it runs smoothly. @llimllib, your behavior in this situation was unprofessional and disruptive, which is not acceptable under our code of conduct.

Additionally, I want to express broader frustration with README's approach. Relying heavily on Unified, MDX, and Remark to build an entire product without contributing back, either through code or financially, and then participating in issues like this with a lack of good faith shows a disregard for the principles of open source. It’s frustrating and disappointing to see adopters, whose livelihoods depend on these tools, engage with us in such a way.

Let’s all aim to maintain a respectful and constructive tone in future interactions, ensuring that we adhere to both the contributing guide and the code of conduct.

github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Aug 14, 2024

ChristianMurphy added the 🏁 area/perf This affects performance label Aug 14, 2024

This comment was marked as spam.

Sign in to view

llimllib added a commit to llimllib/mdast-util-gfm-autolink-literal that referenced this issue Aug 14, 2024

fix: limit match length of email regular expression

588c484

closes syntax-tree#8

llimllib mentioned this issue Aug 14, 2024

fix: limit match length of email regular expression #9

Closed

5 tasks

llimllib added a commit to llimllib/mdast-util-gfm-autolink-literal that referenced this issue Aug 14, 2024

fix: limit match length of email regular expression

e3c3d38

closes syntax-tree#8

wooorm closed this as completed in 2402c21 Aug 19, 2024

This comment has been minimized.

Sign in to view

wooorm added the 💪 phase/solved Post is done label Aug 19, 2024

github-actions bot removed the 🤞 phase/open Post is being triaged manually label Aug 19, 2024

wooorm mentioned this issue Oct 4, 2024

New 'negative lookbehind' regex breaks in IOS & Safari versions < 16.4 #10

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with long lines #8

Performance issue with long lines #8

llimllib commented Aug 14, 2024 •

edited

Loading

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024

llimllib commented Aug 14, 2024

llimllib commented Aug 14, 2024 •

edited

Loading

ChristianMurphy commented Aug 14, 2024

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024 •

edited

Loading

This comment was marked as spam.

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024 •

edited

Loading

llimllib commented Aug 14, 2024 •

edited

Loading

ChristianMurphy commented Aug 14, 2024

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024

wooorm commented Aug 15, 2024

wooorm commented Aug 15, 2024 •

edited

Loading

wooorm commented Aug 15, 2024 •

edited

Loading

wooorm commented Aug 15, 2024 •

edited

Loading

This comment has been minimized.

wooorm commented Aug 19, 2024

kellyjosephprice commented Aug 19, 2024

wooorm commented Aug 19, 2024

ChristianMurphy commented Aug 19, 2024 •

edited

Loading

Performance issue with long lines #8

Performance issue with long lines #8

Comments

llimllib commented Aug 14, 2024 • edited Loading

Initial checklist

Affected packages and versions

Link to runnable example

Steps to reproduce

Expected behavior

Actual behavior

Affected runtime and version

Affected package manager and version

Affected OS and version

Build and bundle tools

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024

llimllib commented Aug 14, 2024

llimllib commented Aug 14, 2024 • edited Loading

ChristianMurphy commented Aug 14, 2024

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024 • edited Loading

This comment was marked as spam.

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024 • edited Loading

llimllib commented Aug 14, 2024 • edited Loading

ChristianMurphy commented Aug 14, 2024

llimllib commented Aug 14, 2024

ChristianMurphy commented Aug 14, 2024

wooorm commented Aug 15, 2024

wooorm commented Aug 15, 2024 • edited Loading

wooorm commented Aug 15, 2024 • edited Loading

wooorm commented Aug 15, 2024 • edited Loading

This comment has been minimized.

wooorm commented Aug 19, 2024

kellyjosephprice commented Aug 19, 2024

wooorm commented Aug 19, 2024

ChristianMurphy commented Aug 19, 2024 • edited Loading

llimllib commented Aug 14, 2024 •

edited

Loading

llimllib commented Aug 14, 2024 •

edited

Loading

ChristianMurphy commented Aug 14, 2024 •

edited

Loading

ChristianMurphy commented Aug 14, 2024 •

edited

Loading

llimllib commented Aug 14, 2024 •

edited

Loading

wooorm commented Aug 15, 2024 •

edited

Loading

wooorm commented Aug 15, 2024 •

edited

Loading

wooorm commented Aug 15, 2024 •

edited

Loading

ChristianMurphy commented Aug 19, 2024 •

edited

Loading