Skip to content

Adopt Minimum Union Validity Rules #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chorman0773 opened this issue Mar 1, 2024 · 24 comments
Open

Adopt Minimum Union Validity Rules #494

chorman0773 opened this issue Mar 1, 2024 · 24 comments
Labels
A-unions Topic: Related to unions A-validity Topic: Related to validity invariants T-opsem

Comments

@chorman0773
Copy link
Contributor

chorman0773 commented Mar 1, 2024

I'd like a T-opsem FCP on adopting at least a minimum set of Union Validity Rules. This would allow documenting all type validity invariants in the dynamic.layout spec chapter (spec#15).

Based on discussions on zulip (https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Can.20we.20decide.20the.20validity.20invariant.20of.20unions.3F) the proposed rule is (in prose)

A value of a union type is valid if each value byte it was computed from was either initialized, or in at least one field of the union, that byte of that field is either a padding byte, or valid when computed from an uninitialized value.

The term "value byte" is used in the above chapter to mean "Any byte that is not a padding byte". Notably, any byte that is padding in all fields is padding in the union, and thus always allowed to be uninit.

The minimum rules ensure we can widen the Validity Invariant in the future to something closer to (or exactly) a trivial validity invariant.

@chorman0773
Copy link
Contributor Author

@rustbot label +A-unions +A-validity +T-opsem

@rustbot rustbot added A-unions Topic: Related to unions A-validity Topic: Related to validity invariants T-opsem labels Mar 1, 2024
@chorman0773
Copy link
Contributor Author

I'm not sure whether I should also add S-pending-team-decision.

@Lokathor
Copy link
Contributor

Lokathor commented Mar 1, 2024

I don't wanna pick on your wording too closely because it's a draft and all, but I think I know what you're trying to say, and I can barely tell what that means.

@chorman0773
Copy link
Contributor Author

Do you have a suggestion for better wording?

@Lokathor
Copy link
Contributor

Lokathor commented Mar 2, 2024

I think what's giving me trouble is that it's all one sentence with too many little parts in the sentence. Also, the use of the past tense strikes me as a little odd, though maybe it fits in the larger context. Also, it doesn't matter why a byte can be uninit, just that it can be for whatever reason. Keep things simple by not trying to list particular reasons.

Try this on for size:

A value of a union type is valid if each value byte is one of:

  • initialized
  • uninitialized and at least one field of the union is also allowed to be uninitialized at the same byte position

@chorman0773
Copy link
Contributor Author

chorman0773 commented Mar 2, 2024

The issue is that the way the rest of the chapter is I think makes it ill-defined to ask about the validity invariants of a padding byte. Validity checking is spec'd after value computation (and talks about the actual value), and padding bytes are completely ignored.

@Lokathor
Copy link
Contributor

Lokathor commented Mar 2, 2024

what? Then change the rest of the chapter, I guess.

If a spec can't tell me, very plainly, which bytes within a type must be initialized or not... well then something is very wrong with the organization of that spec.

You could say "padding bytes are valid when initialized and when uninitialized" somewhere, and that seems sufficient explanation.

@chorman0773
Copy link
Contributor Author

If a spec can't tell me, very plainly, which bytes within a type must be initialized or not... well then something is very wrong with the organization of that spec.

The spec tells you when a given value is valid. Parts of the chapter (IE. validity for scalar types), say

A value of is valid when all bytes the value was computed from are initialized.

Talking about validity in terms of individual bytes is actually ill-defined IMO, because validity is a compound property of a whole value (NonZeroU32 doesn't forbid any particular byte from being 0). So the chapter is written with that fact in mind. It follows that padding bytes, which don't participate in determining the value of a type, can be any value - initialized or not.

@Lokathor
Copy link
Contributor

Lokathor commented Mar 2, 2024

I'm not talking about which specific initialized values each byte can be though. Needing to be initialized or not is an "earlier" property than needing to pick a specific value. All four bytes in NonZeroU32 need to be initialized, and we can state this about every byte or not. As far as I know, we can state the initialization requirements of all bytes in all types, actually.

@CAD97
Copy link

CAD97 commented Mar 2, 2024

I think that {initialized INCLUSIVE-OR byte can be uninitialized/padding in at least one variant} being sufficient validity should be fairly non-controversial1, and support an FCP canonicalizing such2. I don't know exactly how the draft spec is handling provenance, but the FCP should explicitly acknowledge it (we've got T-lang FCP accepted that provenance exists, IIRC) even if to only say that nothing is determined yet.

Minor side notes:

  • Since #[repr(Rust)] unions do not specify layout (variant offsets), this spec has no implications at the repr-generic level, but only at the levels after instantiating with a specific layout choice.
  • To maintain ABI transparency, it may be required that union padding can get reset when manipulated by-value (at least for non-#[repr(Rust)] unions).

Footnotes

  1. Although not a given, since "untagged enum" representation wants stronger validity that preserves niche availability.

  2. Provided the FCP has a local copy of the disclaimer that while sufficient, it may be relaxed and not be required, thus is not a guarantee that unsafe may soundly expect from arbitrary code.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Mar 2, 2024

The spec chapter is quite clear the repr(Rust) unions don't have defined offsets for each field. In general, I did my best to encapsilate only what guarantees currently exist while being complete. Unions were the only place I was unable to do that.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Mar 2, 2024

A value of a union type is valid if each value byte is one of:

  • initialized
  • uninitialized and at least one field of the union is either not a value byte in that field or is allowed to be uninitialized at the corresponding offset

@Lokathor
Copy link
Contributor

Lokathor commented Mar 2, 2024

Looks pretty good, but that does still sorta imply that some padding bytes aren't allowed to be uninit.

EDIT: you may also want some wording that bytes past the end of a field count as being padding for this determination.

@chorman0773
Copy link
Contributor Author

Fixed to exclude bytes that aren't part of the field's storage, either before or after that storage.

@RalfJung
Copy link
Member

RalfJung commented Mar 5, 2024

Focusing on "init vs uninit" is very strange to me. Validity can vary between "must be this particular byte" and "can be anything including uninit", why would we special case the init vs uninit distinction? If we require union { u32, i32 } to be init, then why should we allow union { bool, bool } to be 0x03?

Also, validity should be defined as "this is the representation of some value of this type", so really we should be discussing what are the values of union type (in a domain like MiniRust's Value), and their representation, and then derive validity from that.

A value of a union type is valid if each value byte it was computed from was either initialized, or in at least one field of the union, that byte of that field is either a padding byte, or valid when computed from an uninitialized value.

What does it mean for a byte to be "valid when computed from an uninit value"? Validity is a property of sequences of bytes at a given type, you can't ask about just a single byte. And the "computed from" part is also entirely unclear to me.

These things are much better expressed in logical notation than English, IMO.

@RalfJung
Copy link
Member

RalfJung commented Mar 5, 2024

We had tons of prior discussion on this in #438 and the linked issues. Any attempt to come to a decision here should probably start by summarizing the main points we already discussed.

@chorman0773
Copy link
Contributor Author

Focusing on "init vs uninit" is very strange to me. Validity can vary between "must be this particular byte" and "can be anything including uninit", why would we special case the init vs uninit distinction? If we require union { u32, i32 } to be init, then why should we allow union { bool, bool } to be 0x03?

The issue is that the rule gets wildly complex from there. The most simple counterexample is

#[repr(C)]
pub union{
    pub foo: NonZeroU16,
    pub bar: (NonZeroU8, bool),
}

Which, assuming a packed layout for the tuple can be in the state [00 00], which is not valid for any of the variants. Picking a rule that builds this representation will certainly be an exercise is both the minirust spec and the prose spec. I'd rather union validity not taking several essays to specify, when every other type can be done in 1-2 sentences.

@RalfJung
Copy link
Member

RalfJung commented Mar 5, 2024

The issue is that the rule gets wildly complex from there.

I think it's no more complicated than actually defining your "byte is valid when computed from an uninit value".

But I agree it is too complicated, which is why I think unions should allow uninit throughout and leave the rest to explicit opt-in.

@CAD97
Copy link

CAD97 commented Mar 5, 2024

Focusing on "init vs uninit" is very strange to me.

I don't think this is entirely a fair characterization — the property that more fundamentally matters is whether a byte is potentially initialized as a padding byte. The bytes of MaybeUninit<u32> are not padding bytes, but in MaybeUninit::uninit(), they're initialized as padding after the uninit: () field.

So AIUI the spec is acknowledging three "kinds" of bytes — padding bytes (not preserved by value assignment) and value bytes (preserved by value assignment), with value bytes further classified as always initialized or as potentially uninitialized (due to that offset being padding in some enum/union variant). That's how "potentially uninit" can be an independently per-byte property while "valid value byte" isn't.

Although, actually: defining "byte is valid when computed from an uninitialized value" per-byte does need to be careful w.r.t. enums, where whether a particular byte is valid to be uninitialized does depend on other value bytes (namely, the discriminant encoding).

For clarity, I also think trivial validity is the semantics we do eventually want for unions; I conceptualize unions as essentially a set of typed places aliasing an untyped memory region. The benefit of the spec work using a stronger requirement is solely in not committing to trivial validity for unions.

To that end, I think the intermediate spec language should use the maximal requirement (for each construction byte there exists a field where there exists a valid construction of that field with that byte at that offset). The awkward nested "exists" quantification is still required for the "valid when computed from an uninitialized value" construction, just over ADT enum discriminants instead of over value bytes.

@RalfJung
Copy link
Member

I don't think this is entirely a fair characterization — the property that more fundamentally matters is whether a byte is potentially initialized as a padding byte. The bytes of MaybeUninit are not padding bytes, but in MaybeUninit::uninit(), they're initialized as padding after the uninit: () field.

I don't think of padding bytes as a first-class concept that shows up in the opsem, it is merely a derived notion. I don't know what "initialized as padding" is supposed to mean.

So AIUI the spec is acknowledging three "kinds" of bytes — padding bytes (not preserved by value assignment) and value bytes (preserved by value assignment), with value bytes further classified as always initialized or as potentially uninitialized (due to that offset being padding in some enum/union variant). That's how "potentially uninit" can be an independently per-byte property while "valid value byte" isn't.

Yes, it is exactly that that I am objecting to. We can add more classes of bytes, like "non-null" bytes or so. I don't think there is a need to have any such "class of bytes" show up in the definition of validity. They may be a useful way to compare certain validity invariants, but I think introducing them into the basic definitions is unnecessary complexity.

To that end, I think the intermediate spec language should use the maximal requirement (for each construction byte there exists a field where there exists a valid construction of that field with that byte at that offset). The awkward nested "exists" quantification is still required for the "valid when computed from an uninitialized value" construction, just over ADT enum discriminants instead of over value bytes.

I am not convinced that it is very useful to commit to an intermediate spec. This is a spec hole.

If we do for some reason have to commit to something, we could explicitly say: the validity invariant lies somewhere between the following two:

  • all byte sequences are allowed
  • (the thing you wrote)

@chorman0773
Copy link
Contributor Author

chorman0773 commented Mar 10, 2024

If by "somewhere between" you mean "here's a bunch of bounds, the validity invariant is unspecified between them", that's just the same thing as "Here's the strongest validity invariant in the range". It's not really meaningful to place a lower bound on what is considered undefined behaviour.

If you mean literally that the spec says "It something between these two bounds, and Rust knows what it is but isn't going to say", then that's not really useful to anyone - users can't rely on it being anything other than the strictest constraint, and implementors can't rely on it being anything other than the weakest constraint.

@RalfJung
Copy link
Member

If by "somewhere between" you mean "here's a bunch of bounds, the validity invariant is unspecified between them", that's just the same thing as "Here's the strongest validity invariant in the range". It's not really meaningful to place a lower bound on what is considered undefined behaviour.

If this was just about defining UB, you'd be right. But people will inevitably refer to "the validity invariant of union" in their own unsafe reasoning, and it is important to be clear that this may still change. It is also IMO important to note which parts of the spec are final and which are very much still open questions.

So I am arguing that saying "it is between these bounds" is not at all equivalent to saying "it is the strongest possible one". It is logically equivalent when only considering the AM, but not when considering the larger context and human factors.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Apr 16, 2024

If this was just about defining UB, you'd be right. But people will inevitably refer to "the validity invariant of union" in their own unsafe reasoning, and it is important to be clear that this may still change.

That's probably true of a lot more than just union validity, and the spec shouldn't have to worry about specifying too little. That's frankly a lot of excess typing for something that doesn't make a difference for the required observable behavior of a program. It may be worth a note, but certainly not normative text.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Apr 16, 2024

(Constraining ill-formedness to between bounds makes more sense, because the upper bound of well-formedness is semantically meaningful - a valid implementation will not translate a program that exceeds that upper bound - in UB you can't exactly say "The compiler will definitely start doing whatever it wants past this point" because "what it wants" could be completely well-defined anyways)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-unions Topic: Related to unions A-validity Topic: Related to validity invariants T-opsem
Projects
None yet
Development

No branches or pull requests

5 participants