Skip to content

Sanitize RegexSet input so alternation is properly handled #1756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 24, 2022

Conversation

agausmann
Copy link
Contributor

@agausmann agausmann commented Mar 25, 2020

Resolves #1755 by wrapping the regexes provided to a RegexSet in a capture group, so the added anchors work properly with alternation.

This doesn't resolve the issue of implicit/hidden regex modifications performed by this crate, which should still be properly documented.

@agausmann agausmann changed the title Sanitize RegexSet input to properly handle alternation Sanitize RegexSet input so alternation is properly handled Mar 25, 2020
@agausmann
Copy link
Contributor Author

One thing to note: This will cause breakages to crates that make incorrect assumptions about how the regex inputs are handled and do not account for the implicitly-added anchors (such as emscripten-sys).

Copy link
Contributor

@emilio emilio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really sorry for the lag in getting to this, this somehow slipped through my inbox.

This looks fine to me though as you note is a (minor) breaking change.

I've retriggered the CI builds just to confirm they're green but.. would there be a chance of adding a test for this? It should be trivial by specifying the bindgen-flags: in the test. There should be plenty of examples available but let me know if you find any issue.

Really sorry for the massive lag again.

@agausmann
Copy link
Contributor Author

agausmann commented May 14, 2020

Sorry this got stalled, I've added a test case.

@agausmann agausmann requested a review from emilio May 14, 2020 13:48
@emilio
Copy link
Contributor

emilio commented May 14, 2020

This is breaking a couple existing test-cases unfortunately. That needs looking into. Depending on how much of an edge case they are, it may be ok to change their arguments or such.

@agausmann
Copy link
Contributor Author

agausmann commented May 14, 2020

Those failures are because of * being provided where a regex is expected. That used to be converted to ^*$ which is accepted, it is now being converted to ^(*)$ which is not accepted.

(Playground)

@agausmann
Copy link
Contributor Author

I did some research/analysis on dependent crates in #1783 , which might be useful for determining how much this breaks. The list of all uses of | may be of particular interest.

@emilio
Copy link
Contributor

emilio commented Jun 8, 2020

I think changing those tests to use .* and landing this should be ok, I'll make sure to document this change prominently for the next release if so. what are your thoughts?

Thanks again for doing that research.

@kulp kulp self-assigned this Jun 2, 2022
@kulp kulp force-pushed the sanitize_regex_alternates branch from 0aba94c to cbce27c Compare June 7, 2022 01:20
@kulp
Copy link
Member

kulp commented Jun 7, 2022

I took the liberty of rebasing upon master at c27578f 74b3867 and fixing up the failing tests. If CI passes, this might be ready for the next major release, since it is a breaking change.

@kulp kulp force-pushed the sanitize_regex_alternates branch 2 times, most recently from 74b3867 to c14079d Compare June 7, 2022 23:14
@kulp
Copy link
Member

kulp commented Jun 7, 2022

I think changing those tests to use .* and landing this should be ok

@emilio Now each commit in this branch independently passes tests. I assume this should wait until v0.61.0, whenever that happens, since it is a breaking change, but I thought I would make it clear I think the PR is ready.

@pvdrz
Copy link
Contributor

pvdrz commented Sep 22, 2022

@kulp. Do you think it would make sense to emit a warning if the user still uses "*" explaining this change?

@kulp
Copy link
Member

kulp commented Sep 24, 2022

@kulp. Do you think it would make sense to emit a warning if the user still uses "*" explaining this change?

@pvdrz that seems like a good idea to me!

@bors-servo
Copy link

☔ The latest upstream changes (presumably #2284) made this pull request unmergeable. Please resolve the merge conflicts.

@pvdrz
Copy link
Contributor

pvdrz commented Oct 5, 2022

@kulp @emilio do you think we should wait to merge this?

@pvdrz pvdrz force-pushed the sanitize_regex_alternates branch from c14079d to ca3d861 Compare October 13, 2022 15:04
@pvdrz
Copy link
Contributor

pvdrz commented Oct 13, 2022

I took the liberty of rebasing against master again, adding a warning when using wildcard patterns and update the changelog. Do you think we can merge this before the next release? @emilio @kulp

@kulp
Copy link
Member

kulp commented Oct 22, 2022

I took the liberty of rebasing against master again, adding a warning when using wildcard patterns and update the changelog. Do you think we can merge this before the next release? @emilio @kulp

@pvdrz thanks for rebasing. It looks like I missed v0.61.0.

Maybe I was being paranoid though, and we could merge this any time, since although it is a breaking change, that becomes visible only the next time we do a release. So maybe there is nothing that should hold this up any longer?

I will leave it to your collective judgment since I cannot spend more cycles on it in the near term.

@kulp
Copy link
Member

kulp commented Oct 22, 2022

Maybe I was being paranoid though, and we could merge this any time, since although it is a breaking change, that becomes visible only the next time we do a release. So maybe there is nothing that should hold this up any longer?

Oh, now I remember; I was thinking it needed to be ensured not to be released in a point-release like 0.60.5, but instead in a semver-ish breaking release like 0.61.0.

Probably this is putting the cart before the horse, and instead we should just merge this, and then make sure the next release bumps the correct version number.

@pvdrz
Copy link
Contributor

pvdrz commented Oct 24, 2022

I'm going to bump versions in this PR and merge it

@pvdrz pvdrz force-pushed the sanitize_regex_alternates branch from 697273a to 8338caa Compare October 24, 2022 15:21
@pvdrz pvdrz force-pushed the sanitize_regex_alternates branch from 8338caa to 94167f3 Compare October 24, 2022 15:25
@pvdrz pvdrz merged commit 6086694 into rust-lang:master Oct 24, 2022
qsdrqs pushed a commit to qsdrqs/rust-bindgen that referenced this pull request Oct 26, 2022
…#1756)

* tests: Avoid using globs as regexes

* Sanitize regex set input to properly handle alternation

* Add test case for alternates/anchors interaction

* emit warning if wildcard pattern is used

* update changelog and bump versions

Co-authored-by: Darren Kulp <[email protected]>
Co-authored-by: Christian Poveda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Last alternative gets dropped from whitelist regex
6 participants