Skip to content

Last alternative gets dropped from whitelist regex #1755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
agausmann opened this issue Mar 24, 2020 · 1 comment · Fixed by #1756
Closed

Last alternative gets dropped from whitelist regex #1755

agausmann opened this issue Mar 24, 2020 · 1 comment · Fixed by #1756

Comments

@agausmann
Copy link
Contributor

agausmann commented Mar 24, 2020

When using whitelists with alternation, it seems that the last alternative is not checked during matching, which can result in some items not getting matched.

Input C/C++ Header

test.h:

#define EMSCRIPTEN_RESULT_SUCCESS 0
#define EMSCRIPTEN_RESULT_DEFERRED 1
#define EM_TRUE 1
#define EM_FALSE 0
int emscripten_test();

Bindgen Invocation

# Pick one of the following whitelists:
WHITELIST='^_?em_|^_?emscripten_|^_?EM_|^_?EMSCRIPTEN_'
#WHITELIST='^_?em_|^_?EMSCRIPTEN_|^_?EM_|^_?emscripten_'

bindgen --whitelist-type="$WHITELIST" --whitelist-function="$WHITELIST" --whitelist-var="$WHITELIST" test.h

Actual Results

With the first whitelist:

pub const EM_TRUE: u32 = 1;
pub const EM_FALSE: u32 = 0;
extern "C" {
    pub fn emscripten_test() -> ::std::os::raw::c_int;
}

With the second whitelist:

pub const EMSCRIPTEN_RESULT_SUCCESS: u32 = 0;
pub const EMSCRIPTEN_RESULT_DEFERRED: u32 = 1;
pub const EM_TRUE: u32 = 1;
pub const EM_FALSE: u32 = 0;

Expected Results

All of the definitions from the given header should match under both whitelists, resulting in the following bindings under either whitelist:

pub const EMSCRIPTEN_RESULT_SUCCESS: u32 = 0;
pub const EMSCRIPTEN_RESULT_DEFERRED: u32 = 1;
pub const EM_TRUE: u32 = 1;
pub const EM_FALSE: u32 = 0;
extern "C" {
    pub fn emscripten_test() -> ::std::os::raw::c_int;
}

Appending a non-matching alternative to the whitelist produces the expected output:

WHITELIST='^_?em_|^_?emscripten_|^_?EM_|^_?EMSCRIPTEN_|^$'
@agausmann agausmann changed the title Last OR-clause gets dropped from whitelist regex Last alternative gets dropped from whitelist regex Mar 24, 2020
@agausmann
Copy link
Contributor Author

agausmann commented Mar 24, 2020

The problem is in RegexSet::build(). When the regular expressions are given to the inner regex::RegexSet, they are wrapped in anchors using format!("^{}$", item), which isn't completely sanitary.

In my case, my whitelist regex becomes ^^_?em_|^_?emscripten_|^_?EM_|^_?EMSCRIPTEN_$. Since alternation has the lowest precedence, the leading anchor only applies to the first alternative, and the trailing anchor only applies to the last alternative.

A quick fix without modifying the behavior for other uses would be to change the format string to "^({})$", since alternation stays within capture groups.

This makes me wonder why the regular expression is implicitly wrapped in the first place. It might not be a good practice, especially if it is not well-documented. It's possible that the original author of this whitelist regex thought it was a raw regular expression input since they added their own anchors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant