Skip to content

Normalize language tag component of i18n datatype IRI? #337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kasei opened this issue Jan 17, 2020 · 9 comments
Closed

Normalize language tag component of i18n datatype IRI? #337

kasei opened this issue Jan 17, 2020 · 9 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. wr:commenter-agreed wr:spec-updated

Comments

@kasei
Copy link
Contributor

kasei commented Jan 17, 2020

Object to RDF Conversion step 13.2 constructs a datatype IRI based on language and direction:

If rdfDirection is i18n-datatype, set datatype to the result of appending language and the value of @direction in item separated by an underscore ("_") to https://www.w3.org/ns/i18n#. Initialize literal as an RDF literal using value and datatype.

While there is "MAY" normative text about normalizing language tags, there doesn't seem to be any text about normalizing the language tag values when included in a datatype IRI like this. Test tdi10 seems to assume the value is normalized in lowercase. Moreover, I would think that not normalizing could cause lots of trouble with unexpected data (e.g. two literals that differ only in the case of the language tag component of this datatype IRI; shouldn't such literals have the same value?).

@gkellogg
Copy link
Member

This as introduced in #167, but I don't think we discussed anything specifically about continuing to do the lower-case normalization when creating the i18n datatype, which makes sense. @iherman do you remember discussing this?

@himorin himorin added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Jan 19, 2020
@iherman
Copy link
Member

iherman commented Jan 24, 2020

@gkellogg right, I do not think we discussed this. That being said, normalizing for the purpose of the URI handling does make sense imho. For the sake of consistency, we should follow the same rules as in #167.

We have to realize, however, that the language tags have a tradition that is more complex than that. AFAIK, they usually write 'en-US' instead of 'en-us'. The rules in #167 leave that intact, they just say that the tag must be valid, and we should probably say the same thing for this datatype.

Which does mean that the datatype comparison cannot be character by character but, instead, based on a regex. Ugly, but that is the world we live in:-)

@iherman
Copy link
Member

iherman commented Jan 24, 2020

This issue was discussed in a meeting.

  • RESOLVED: In toRDF recommend normalization of language tag based URIs
  • RESOLVED: … also compound literal form
View the transcript 3.1. Normalize language tags
link: #337
Rob Sanderson: This is about our workaround language tags in i18n namespace.
Gregg Kellogg: We removed requirements to normalize language tags to lowercase, because it is problematic for many people in i18n community. When creating RDF, we have possibility that 2 processors create different data types.
… The question is if that is what we intended. To allow 2 diff datatypes by 2 different processors.
Ivan Herman: That would be wrong in terms of RDF.
… If you have 2 datatypes with different cases, RDF sees those as not equal.
… Maybe pchampin_ will say I’m wrong…
Pierre-Antoine Champin: I slightly disagree. 2 different URLs may denote different things, but also the same, but this depends on implementation.
… It’s hard to require all impls to support these different i18n datatypes and consider them equal.
… We could say that they are semantically equivalent.
Ivan Herman: Yes, we can do that. I don’t know if we are discussing something that is insignificant.
… If we do that, and have an implementation that does datatype reasoning, then that impl will likely fall on its face.
… Datatype reasoning is quite a challenge. Many implementations just check char-by-char.
… We could say that if you use these datatypes, that you are supposed to lowercase language tags.
… It’s ugly, but I don’t see a better choice.
Rob Sanderson: Is there some i18n requirement?
Ivan Herman: No, it’s a habit that there is mixing of cases.
… Usual way is:
Ivan Herman: the usual way is : en-US and not en-us
Ivan Herman: We should not require normalization when using lang tags the old way, but we should when using i18n datatypes.
Rob Sanderson: Is the set of characters that is permissible in URIs and language tags compatible?
Ivan Herman: Just ASCII characters.
Pierre-Antoine Champin: There will be 2 kinds of RDF impls: ones not recognizing our custom IRIs, and those that do.
… Those that will take into account our custom datatypes, can interpret them as lang tags and do smart things.
… The roundtripping would be lost when direction is used. I’m still in favour of not normalizing them.
Gregg Kellogg: I’m neutral on normalization. We should add a non-normative note in any case.
Dave Longley: Was it a mistake to not normalize language tags when they were invented?
Ivan Herman: Invented by whom?
Dave Longley: Not JSON-LD, but the group that came up with it 30 years ago…
Ivan Herman: We can not change it because it’s out there already.
Rob Sanderson: We can fix it for reduced datatype IRI.
Dave Longley: It looks like this grew organically, so the spec was built around it.
… What we introduce is new, so we can enforce normalization.
… So we simplify part of the space.
Ivan Herman: I don’t disagree.
… How important is it to roundtrip on such a detail?
… Because that is why we are discussing this.
… I don’t think it’s important.
… So I would normalize it.
Gregg Kellogg: From RDF Concepts: “A literal is a language-tagged string if the third element is present. Lexical representations of language tags may be converted to lower case. The value space of language tags is always in lower case.”
Gregg Kellogg: We did change the language of JSON-LD, which always normalized language tags, which was over-strict.
… RDF spec says that language tags may be lowercased.
… We are talking here about special case: roundtripping.
… It’s a minor thing what we are going to do.
… I would support that we change the language in toRdf, that language tags be normalized in compound literals and i18n.
Rob Sanderson: We ran into this in practice when having to do case-insensitive language tag comparison.
Proposed resolution: In toRDF recommend normalization of language tag based URIs (Rob Sanderson)
Pierre-Antoine Champin: +1
Rob Sanderson: +1
Ivan Herman: +1
Benjamin Young: +0
Gregg Kellogg: +1
Dave Longley: +1
Ruben Taelman: +1
Harold Solbrig: +0
David I. Lehn: +1
Adam Soroka: +1
Resolution #2: In toRDF recommend normalization of language tag based URIs
Proposed resolution: … also compound literal form (Rob Sanderson)
Rob Sanderson: +1
Pierre-Antoine Champin: +1
Ruben Taelman: +1
Dave Longley: +1
Gregg Kellogg: +1
Adam Soroka: +1
Resolution #3: … also compound literal form
Ivan Herman: +1
David I. Lehn: +1
Benjamin Young: +1

@gkellogg
Copy link
Member

@kasei (and @himorin), the algorithm was updated in PR #363 to normalize language tags when creating i18n literals or compound objects. Note that these operations are non-normative.

@kasei
Copy link
Contributor Author

kasei commented Jan 29, 2020

@gkellogg Looks good, though I'm not sure what you mean by "these operations are non-normative." Nothing in #363 suggests non-normative operations, does it? If it is non-normative, are there any tests that should be marked as requiring this normalization to pass?

@gkellogg
Copy link
Member

The syntax document describes both i18n datatypes and compound literals as experimental and non-normative. The algorithms condition this behavior on the rdfDirection option, which defaults to null. The tests which use this are marked using the rdfDirection option. We don't really have a way of describing tests as being non-normative themselves.

@gkellogg
Copy link
Member

gkellogg commented Jan 29, 2020

Note that toRdf/di10 and toRdf/di12 require this normalization.

@kasei
Copy link
Contributor Author

kasei commented Jan 29, 2020

FWIW, tests like this in SPARQL use new values asserted for the mf:requires predicate. That was enough to indicate that you could be conformant without passing those tests if you didn't support the named feature.

@gkellogg
Copy link
Member

Yes, adding something like mf:requires might be a good idea for this and some other cases such as HTML content extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. wr:commenter-agreed wr:spec-updated
Projects
None yet
Development

No branches or pull requests

4 participants