-
Notifications
You must be signed in to change notification settings - Fork 35
Normalize language tag component of i18n datatype IRI? #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@gkellogg right, I do not think we discussed this. That being said, normalizing for the purpose of the URI handling does make sense imho. For the sake of consistency, we should follow the same rules as in #167. We have to realize, however, that the language tags have a tradition that is more complex than that. AFAIK, they usually write 'en-US' instead of 'en-us'. The rules in #167 leave that intact, they just say that the tag must be valid, and we should probably say the same thing for this datatype. Which does mean that the datatype comparison cannot be character by character but, instead, based on a regex. Ugly, but that is the world we live in:-) |
This issue was discussed in a meeting.
View the transcript3.1. Normalize language tagslink: #337 Rob Sanderson: This is about our workaround language tags in i18n namespace. Gregg Kellogg: We removed requirements to normalize language tags to lowercase, because it is problematic for many people in i18n community. When creating RDF, we have possibility that 2 processors create different data types. … The question is if that is what we intended. To allow 2 diff datatypes by 2 different processors. Ivan Herman: That would be wrong in terms of RDF. … If you have 2 datatypes with different cases, RDF sees those as not equal. … Maybe pchampin_ will say I’m wrong… Pierre-Antoine Champin: I slightly disagree. 2 different URLs may denote different things, but also the same, but this depends on implementation. … It’s hard to require all impls to support these different i18n datatypes and consider them equal. … We could say that they are semantically equivalent. Ivan Herman: Yes, we can do that. I don’t know if we are discussing something that is insignificant. … If we do that, and have an implementation that does datatype reasoning, then that impl will likely fall on its face. … Datatype reasoning is quite a challenge. Many implementations just check char-by-char. … We could say that if you use these datatypes, that you are supposed to lowercase language tags. … It’s ugly, but I don’t see a better choice. Rob Sanderson: Is there some i18n requirement? Ivan Herman: No, it’s a habit that there is mixing of cases. … Usual way is: Ivan Herman: the usual way is : en-US and not en-us Ivan Herman: We should not require normalization when using lang tags the old way, but we should when using i18n datatypes. Rob Sanderson: Is the set of characters that is permissible in URIs and language tags compatible? Ivan Herman: Just ASCII characters. Pierre-Antoine Champin: There will be 2 kinds of RDF impls: ones not recognizing our custom IRIs, and those that do. … Those that will take into account our custom datatypes, can interpret them as lang tags and do smart things. … The roundtripping would be lost when direction is used. I’m still in favour of not normalizing them. Gregg Kellogg: I’m neutral on normalization. We should add a non-normative note in any case. Dave Longley: Was it a mistake to not normalize language tags when they were invented? Ivan Herman: Invented by whom? Dave Longley: Not JSON-LD, but the group that came up with it 30 years ago… Ivan Herman: We can not change it because it’s out there already. Rob Sanderson: We can fix it for reduced datatype IRI. Dave Longley: It looks like this grew organically, so the spec was built around it. … What we introduce is new, so we can enforce normalization. … So we simplify part of the space. Ivan Herman: I don’t disagree. … How important is it to roundtrip on such a detail? … Because that is why we are discussing this. … I don’t think it’s important. … So I would normalize it. Gregg Kellogg: From RDF Concepts: “A literal is a language-tagged string if the third element is present. Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case.” Gregg Kellogg: We did change the language of JSON-LD, which always normalized language tags, which was over-strict. … RDF spec says that language tags may be lowercased. … We are talking here about special case: roundtripping. … It’s a minor thing what we are going to do. … I would support that we change the language in toRdf, that language tags be normalized in compound literals and i18n. Rob Sanderson: We ran into this in practice when having to do case-insensitive language tag comparison. Proposed resolution: In toRDF recommend normalization of language tag based URIs (Rob Sanderson) Pierre-Antoine Champin: +1 Rob Sanderson: +1 Ivan Herman: +1 Benjamin Young: +0 Gregg Kellogg: +1 Dave Longley: +1 Ruben Taelman: +1 Harold Solbrig: +0 David I. Lehn: +1 Adam Soroka: +1 Resolution #2: In toRDF recommend normalization of language tag based URIs Proposed resolution: … also compound literal form (Rob Sanderson) Rob Sanderson: +1 Pierre-Antoine Champin: +1 Ruben Taelman: +1 Dave Longley: +1 Gregg Kellogg: +1 Adam Soroka: +1 Resolution #3: … also compound literal form Ivan Herman: +1 David I. Lehn: +1 Benjamin Young: +1 |
…e language tags. For w3c/json-ld-api#337.
The syntax document describes both i18n datatypes and compound literals as experimental and non-normative. The algorithms condition this behavior on the |
Note that toRdf/di10 and toRdf/di12 require this normalization. |
FWIW, tests like this in SPARQL use new values asserted for the |
Yes, adding something like |
…e language tags. For w3c/json-ld-api#337.
Object to RDF Conversion step 13.2 constructs a datatype IRI based on language and direction:
While there is "MAY" normative text about normalizing language tags, there doesn't seem to be any text about normalizing the language tag values when included in a datatype IRI like this. Test tdi10 seems to assume the value is normalized in lowercase. Moreover, I would think that not normalizing could cause lots of trouble with unexpected data (e.g. two literals that differ only in the case of the language tag component of this datatype IRI; shouldn't such literals have the same value?).
The text was updated successfully, but these errors were encountered: