Skip to content

Tests for mixed Arabic-Indic digits and Extended Arabic-Indic digits violates the Bidi rule #737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
OptimumCode opened this issue Apr 15, 2024 · 10 comments · Fixed by #734
Closed

Comments

@OptimumCode
Copy link

OptimumCode commented Apr 15, 2024

The current test cases (like this one) violate the Bidi rule and does not test what they are supposed to test.

The Bidi rule is violated because:

  • it contains a character of type AN (Arabic Number) meaning it is an RTL label and the Bidi rule is applied

    An RTL label is a label that contains at least one character of type R, AL, or AN.

  • the first character is not of type R or AL

    The first character must be a character with Bidi property L, R, or AL.

By adding the first character with type AL the string will pass the Bidi rule validation but still will fail the Arabic-Indic digits rule.

Please, correct me if I missed something here and my conclusions are wrong.

@OptimumCode
Copy link
Author

Hi, @Julian @gregsdennis
Could you please take a look at the issue and share your thoughts about it? Or maybe you could add somebody else from contributors who could review the issue.
Thank you!

@gregsdennis
Copy link
Member

I'm not sure what the issue is here. Is there a technical reason this character needs to be added (i.e. is this testing something that's not currently covered?), or is it merely a linguistic problem?

@Julian
Copy link
Member

Julian commented Apr 24, 2024

@OptimumCode I think but haven't re-checked that this is the same as #675 -- can you perhaps have a look there, I recall having a look into this and concluding it seemed correct as is to me (and OP on that issue didn't follow up). But yeah perhaps let me know if you see something wrong there and/or whether you agree you're referring to the same thing.

@OptimumCode
Copy link
Author

Thank you, @Julian. I did not see this issue when tried to look for an existing one.
Yes, this issue seems to be similar to this one except the author was wrong about KATAKANA MIDDLE DOT cases.

I am not an expert here either so I might be wrong somewhere. But let me try to explain why I think this change should be made.

Answering @gregsdennis question:
This change should be made for a technical reason. Right now this string is an invalid idn-hostname for two reasons:

  • It violates the Bidi rule (because the first character is not L, R, or AL);
  • It mixes Arabic-Indic digits and Extended Arabic-Indic digits.

These are 3 tests for Arabic-Indic digits and Extended Arabic-Indic digits in the JSON schema test suite. One test has them mixed and two tests have only either Arabic-Indic digits or Extended Arabic-Indic digits.
Let's now imagine an implementation that does not check for mixed Arabic-Indic digits and Extended Arabic-Indic digits at all. Two later tests will pass because nothing is wrong with them. The first test will fail because it violates the Bidi rule. As a result, all tests for idn-hostname are green.

Now the hardest part - why does it violate the Bidi rule?

According to paragraph 4.2.3.4 in RFC5891 the labels that have right-to-left characters MUST meet Bidi criteria

If the proposed label contains any characters from scripts that are written from right to left, it MUST meet the Bidi criteria [RFC5893]

What is right-to-left is explained in RFC5893 which is referenced above. In section 1.4 we can find definition or right-to-left:

An RTL label is a label that contains at least one character of type R, AL, or AN.

In our case, it contains characters with the types AN (Arabic Number) and UN (European Number). This would mean that it is an RTL label.

Also, the header for RFC5893 is Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA). Because of that, I believe the Bidi domain name and IDN hostname with RTL label are the same thing.
As a result, the Bidi rule should be applied here.

Could you please share your thoughts on this considering all the above?

There is also one interesting thing: if we take a look at Bidi criteria in point 4 we will see that mix of AN and EN in RTL label is not allowed

In an RTL label, if an EN is present, no AN may be present, and vice versa.

This would make a mix of Arabic-Indic digits and Extended Arabic-Indic digits invalid in any case because Extended Arabic-Indic digits have type EN.
Considering this, the string will still fail the Bidi rule even after adding the first character with type AL. But at least it would violate the 4th point of Bidi criteria

@gregsdennis
Copy link
Member

Okay. Thanks for the explanation. I just didn't know about these kinds of language issues, and it appeared like it might just be a language thing.

@OptimumCode
Copy link
Author

Hi, could you please advise what should I do with PR?

If you disagree with some points in my explanation please let me know - I will try to elaborate.

Also, what do you think about adding test cases for Bidi rule violations? (as a separate PR)

@Julian
Copy link
Member

Julian commented Apr 28, 2024

It will probably be a few days until I can reread the spec but haven't forgotten.

@OptimumCode
Copy link
Author

Thank you, @Julian. If you need something from my side please let me know

@Julian
Copy link
Member

Julian commented Apr 30, 2024

In our case, it contains characters with the types AN (Arabic Number) and UN (European Number). This would mean that it is an RTL label.

Took another quick look today -- yes I think you look correct, the other four examples in #675 were not RTL I think (they weren't R, AL or AN as you quoted) but this one is, so yeah I agree this one is indeed a bidi label and needs to follow the rule.

@OptimumCode
Copy link
Author

Thank you, @Julian for reviewing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants