-
-
Notifications
You must be signed in to change notification settings - Fork 10
Performance issue with long lines #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
for what it's worth, I tried finding a regular expression that worked better and was not successful; perhaps it is sensible to do something like check the line for an That would still leave a long line full of |
Welcome @llimllib! 👋
Respectfully a major of "DDoS vectors", including this, are a nothingburger. |
@ChristianMurphy we render mdx on our servers for our clients; if somebody were to start spamming us with long strings, it would cost a lot of money and slow down our system for everybody |
Limiting the length of // Use {0,1024} instead of `+` to avoid pathological regular expression
// behavior; see
// https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/issues/8
[
/([-.\w+]{0,1024})@([-\w]{0,1024}(?:\.[-\w]{0,1024}){0,1024})/g,
findEmail
] That expression matches emails the same as the current one:
ah ha! it's matching things it ought not to, and of course the proper replacement for |
Yes, that is how programming languages work. |
@ChristianMurphy I truly have no idea why you're being mean here |
MDX is a full programming language, it compiles to JavaScript. To use MDX safely you either:
In both circumstances this perf issue you bring up, is a non-issue from a security perspective. Again, happy to discuss improving parser speed. |
This comment was marked as spam.
This comment was marked as spam.
@ChristianMurphy I presented a fix in #9
|
This parser matches GFM.
Then you are aware the premise of this issue is a non-issue.
I sympathize and agree with this. |
The maximum legal email address is 254 characters
we do that already! but I would like to have the parser not have to fail if my customers include a long line in it |
valid emails are defined by https://datatracker.ietf.org/doc/html/rfc5322 and https://datatracker.ietf.org/doc/html/rfc6854 https://datatracker.ietf.org/doc/html/rfc3696 are a collection of recommendations for validators but not a specification of a valid email.
|
I have closed my PR, you're clearly not interested. I feel that you behaved very poorly towards me, so I'll just wish you an excellent day - I feel quite terrible after this interaction. |
Thank you for your input and contributions. I understand your concerns about performance, but it's essential that any solution aligns with GFM specifications. This project must remain compliant with those standards, and it's not designed to cater to specific performance needs of individual organizations. I’m open to discussing performance improvements, but they must adhere to GFM. If you'd like to explore this further within those constraints, I'm happy to collaborate. Otherwise, I respect your decision to step back. |
a) MDX is dangerous itself, if you let people write MDX, you are vulnerable, this is extensively documented @llimllib I think you are are being disrespectful. This is an open source project that owes you nothing. You owe us everything. You have never helped us. You have never given us money. You misunderstand what MDX is, what our goals are, and what this problem is. Your PR breaks GFM. |
I’ve played with a couple things, and the below code gives the same problem: const find = /([-.a-z\d+]+)@([-a-z\d]+(?:\.[-a-z\d]+)+)/gi
const hundredA = 'a'.repeat(100)
const fileWithBreaks = `${hundredA}\n`.repeat(800)
const fileWithoutBreaks = `${hundredA}`.repeat(800)
console.time('parsing a file with breaks')
find.lastIndex = 0
find.exec(fileWithBreaks)
console.timeEnd('parsing a file with breaks')
console.time('parsing a file without breaks')
find.lastIndex = 0
find.exec(fileWithoutBreaks)
console.timeEnd('parsing a file without breaks') So, it has nothing to do with all the projects. This regex with those giant documents has this problem. Also, important to note, we’re talking about 80kb of data here. |
Found the limit! |
We can make the regex incredibly fast by adding
Without it, it was:
So that’s a giant difference. The actual disallowed mdast-util-gfm-autolink-literal/lib/index.js Lines 274 to 277 in c990a62
Particularly around the |
This comment has been minimized.
This comment has been minimized.
@erunion @bnikom @ilias-t @kellyjosephprice @rafegoldberg @llimllib I will block you if you don’t change your childish ways. @llimllib proposed a bad change. I came up with a good change. And then you act unprofessional? This reflects poorly on your company, its culture. |
I’d recommend not getting hung up on particular solutions. Users know their problems well. Maintainers have a better idea on solutions. I believe Bill ran into an XY problem. Some places on the internet mentioned I don’t worry too much about that though. Maybe Christian feels different. What I found annoying is that despite that, I researched and presented a solution, a good regex, which was ignored. The URL was shared internally, everyone upvoted their friend, downvoted our feedback. That makes me feel like my work is not appreciated. As for community info, see |
The recent conversation crossed a professional line.
@kellyjosephprice, while I understand the desire to support a coworker, it’s essential to encourage constructive behavior, especially in challenging situations. Backing unproductive actions reflects poorly on the README team and undermines the professionalism we strive to maintain. This project serves millions of adopters, and both @wooorm and I have been generous with our time and energy to ensure it runs smoothly. @llimllib, your behavior in this situation was unprofessional and disruptive, which is not acceptable under our code of conduct. Additionally, I want to express broader frustration with README's approach. Relying heavily on Unified, MDX, and Remark to build an entire product without contributing back, either through code or financially, and then participating in issues like this with a lack of good faith shows a disregard for the principles of open source. It’s frustrating and disappointing to see adopters, whose livelihoods depend on these tools, engage with us in such a way. Let’s all aim to maintain a respectful and constructive tone in future interactions, ensuring that we adhere to both the contributing guide and the code of conduct. |
Initial checklist
Affected packages and versions
at least the latest version
Link to runnable example
https://gist.github.com/llimllib/4b87fd4042359b4812350038562dba03
Steps to reproduce
test.mjs
file from the gist above in the root of this repository. Its output on my machine is:In this test case, the "file with breaks" is 8000 lines of 100
a
characters. The "file without breaks" is 800,000 'a' characters without line breaks, so actually a slightly smaller file but with no line breaks.here's an observable notebook that demonstrates the behavior, you can play around with the regular expression there and see how performance looks
Expected behavior
parsing a file with a long line should not have such severe performance effects.
(In the terms that affect my team, our customers should not be able to easily DoS us by providing markdown files with long lines)
Actual behavior
The
findEmail
regular expression here takes a very long time to parse a long line.(I think this is due to using nested quantifiers, but I am not an expert here)
I noticed this when investigating a slowdown in our app, developed a test case, and profiled it. This regular expression jumped out:
Affected runtime and version
node 20.16.0
Affected package manager and version
all
Affected OS and version
all
Build and bundle tools
No response
The text was updated successfully, but these errors were encountered: