Skip to content

Define Validation spec vocabularies #697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
handrews opened this issue Dec 17, 2018 · 15 comments
Closed

Define Validation spec vocabularies #697

handrews opened this issue Dec 17, 2018 · 15 comments

Comments

@handrews
Copy link
Contributor

Now that the $vocabulary PR is nearing approval, we need to figure out how many vocabularies are present in the Validation spec, and what to call them.

Minimal arrangement

I'm guessing something like:

I'm not super-attached to those names, but I do think this nicely encapsulates both the different purposes of some groups of keywords, and the optional nature of format and content*. Optional keyword groups were always particularly confusing for users and implementors.

The reason for basic-format is that it's extensible so I assume other people will probably have vocabularies using format in the name. But I could be convinced to go with standard-format, format, or the plural of any of those (standard-formats, etc.).

Multiple format vocabularies?

While format currently says that if you implement the keyword at all, you SHOULD implement all of the formats, this has become increasingly burdensome as the set of standard formats has expanded.

We could break them up and provide vocabularies that each only declare the semantics for a subset of the standard formats. An example division could be:

  • date-time-formats
  • internet-formats (hostnames, IP addresses, email, URIs, IRIs)
  • json-pointer-formats (JSON Pointer and Relative JSON Pointer)
  • regex-format

The internet-formats could be split up more, although I do wonder at what point it's more trouble than its worth.

@Julian @johandorland @gregsdennis as implementors what would you find useful here?

@Julian
Copy link
Member

Julian commented Dec 17, 2018

From a practical standpoint, what I do to be honest is that formats that have an existing library supporting them get supported, and ones that don't... don't :/ until someone writes one.

E.g., IIRC there isn't a decent RFC 3339 parser for time that I could find, so it didn't go in.

Which isn't to say something like this wouldn't be useful -- probably it would just from categorizing purposes.

@johandorland
Copy link
Collaborator

I don't think splitting up the format vocabulary does anything besides making things more complicated. The reality is as @Julian noted that implementations will look for existing implementations and use them if available. And even then there will be deviations from spec, especially with the more complicated formats, not to mention those implementations that just wing it by using regular expressions.

I think the current wording is fine. The reality is that probably not a single implementation supports all formats, but in my opinion that's acceptable.

@handrews
Copy link
Contributor Author

@johandorland to clarify, by "current wording" you mean the wording that is in the spec about optional implementation already? And that would go with my "minimal arrangement" proposal of four vocabularies corresponding to the four sections (currently sections 6-9 on master)?

@handrews
Copy link
Contributor Author

Also paging @jgonzalezdr @philsturgeon @dlax @awwright @Relequestual and basically everyone 😁

@handrews
Copy link
Contributor Author

@Julian do you think that we should add wording for format noting that implementations may vary in how thoroughly each keyword is supported? Right now it says that if you support the keyword you SHOULD support all formats listed, but does not really say what that means.

@philsturgeon
Copy link
Collaborator

I do like the idea of, date-time-formats, internet-formats, json-pointer-formats, regex-format forcing folks to support a much larger list is a tough one. Whilst sure, some individual formats within these smaller groupings might be lacking a library in whatever particular language the implementation is built in, that's not something that can be avoided unless you make every single format optional.

Implementations can implement most of these small groups and then optionally a few extra formats, and that's just a case for them putting a thing on their README. Over time users will PR support for more formats into their thing, and all is well.

@gregsdennis
Copy link
Member

gregsdennis commented Dec 18, 2018

Vocabularies were created to define sets of keywords, but you're now proposing that a vocabulary be used to define additional values for an existing keyword. (Sorry, I'm just trying to think it through from the point of view of my implementation.)

(Thinking out loud here. Maybe it'll help someone else...)

I have a handler for each keyword. With vocabularies added in, it becomes a handler for each keyword/vocabulary combination. I would have to register multiple format keywords, each one handling a different set of values. When parsing the schema, I'd have to look at the value of format to know which handler to use. Then the schema is modeled, and it should just work.

(Okay, done musing.)

So it looks like you're not defining new values for an existing keyword; you're redefining the keyword altogether. And when multiple vocabularies that define a single keyword are used in a meta-schema, it merely expands the acceptable set of values for that keyword. Then, the value determines which vocabulary keyword definition to use.

Finally, if two vocabularies define the same keyword and value, this results in an ambiguity and it must be considered an undefined behavior.

@johandorland
Copy link
Collaborator

johandorland commented Dec 18, 2018

@handrews Yes that is what I meant with current wording. However I wouldn't be opposed to weakening the language further. I'm still kind of new to all this RFC speak, so maybe I'm interpreting it a bit different than others.

As to the "minimal arrangement" proposal that seems fine by me. When you introduce something fundamental like $vocabulary it makes sense to use it in your own spec to make it more modular. As an implementer I'm not sure if that's also how it will be implemented under the hood. The upcoming changes to draft-08 are kind of drastic, so I might as well just rewrite the whole thing around annotation, $vocabulary, etc.

@gregsdennis
Copy link
Member

@johandorland don't forget output 😄

@handrews
Copy link
Contributor Author

@gregsdennis

So it looks like you're not defining new values for an existing keyword; you're redefining the keyword altogether.

No, that's not the intention, and this is covered explicitly in section 6.5 Extending JSON Schema:

Vocabularies may build on each other, such as by defining the behavior of their keywords with respect to the behavior of keywords from another vocabulary, or by using a keyword from another vocabulary with a restricted or expanded set of acceptable values. Not all such vocabulary re-use will result in a new vocabulary that is compatible with the vocabulary on which it is built. Vocabulary authors SHOULD clearly document what level of compatibility, if any, is expected.

The "expanded set of acceptable values" language was added specifically to give us a way to manage custom format values through vocabularies. You can already add custom formats, but (just like custom keywords) there was never a clear way to communicate expectations.

Whether we publish one standard-formats vocabulary, or one for each group of closely related formats, or even one min-vocabulary for each format value in the standard (which I wouldn't recommend, but could be done), in all cases those vocabularies are talking about the same format keyword.

The format keyword is defined as an open-ended, extensible enumeration, so this is exactly what it's for. As long as you add more string values, and don't try to re-define an existing one, you're fine. Which is how it works now, and we would not want to make it less useful under vocabularies!

@gregsdennis
Copy link
Member

@handrews I think the difference is only apparent in the implementation. I would have to have two separate keyword classes, both declaring they handle "format" but for different vocabularies. Semantically, though, yes, it's the same keyword.

As long as you add more string values, and don't try to re-define an existing one, you're fine.

I think this is the big part, and what I was after. When you have two vocabularies defining the same keyword with different values, I think the behavior has to be undefined. There may be cases where it's obvious which vocabulary applies (like one expects and array while the other expects a string), but I don't know if we can "spec" that beyond "implementations MAY use their own logic to determine which vocabulary to apply."

@handrews
Copy link
Contributor Author

@gregsdennis

When you have two vocabularies defining the same keyword with different values

Do you mean with different value types here? That's the example you give, and I agree that that behavior is undefined. I don't even want to encourage implementations to attempt to make sense of it, really.

@jgonzalezdr
Copy link
Contributor

Regarding format validation, my opinion is to keep it simple.

It's fine to separate format into its own vocabulary. I would just name it "format" (no need to qualify it as "basic" or "standard", as it will actually be identified by its URI, so there will be no confusion with any potential "extension" format vocabularies).

I see no need to create separate format into different vocabularies. There are right now just a few different formats defined, and all of them are pretty common and useful for almost all JSON Schema users, so I really think that all of then should be supported. As a user, my interpretation of "should" is strictly the one defined in RFC 2119, so I would expect an implementation to support all formats, but I would also accept that an implementation did not support any of them (I would also expect that to be clearly indicated in the implementation's docs).

@handrews
Copy link
Contributor Author

@jgonzalezdr

There are right now just a few different formats defined, and all of them are pretty common and useful for almost all JSON Schema users, so I really think that all of then should be supported. As a user, my interpretation of "should" is strictly the one defined in RFC 2119, so I would expect an implementation to support all formats, but I would also accept that an implementation did not support any of them (I would also expect that to be clearly indicated in the implementation's docs).

That's not really how it plays out in practice. The degree of "support" provided when format is "supported" is highly variable.

That said, for this draft I will go with one vocabulary for the format keyword as it is currently defined, in the interest of making the least complicated change and waiting for further feedback. We can always rely on people to complain about format, so I'm sure we'll get more data.

@handrews
Copy link
Contributor Author

OK, I think I'm just going to go with the shortest sensible names- as @jgonzalezdr noted, the fact that there is a whole URI and not just the file name means that they will not be ambiguous:

  • validation (section 6)
  • format (section 7)
  • content (section 8)
  • meta (section 9, formerly section 10)

The hyper-schema vocabulary (links and base) will, of course, be called hyper-schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants