Skip to content

Redo HTTP cookie parsing #510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 6, 2022
Merged

Redo HTTP cookie parsing #510

merged 3 commits into from
Jan 6, 2022

Conversation

karwa
Copy link
Contributor

@karwa karwa commented Dec 2, 2021

This PR does a couple of things:

  • Use strptime for time parsing.

    This is what Apple's documentation recommends for fixed, locale-independent formats. Unfortunately, Glibc needs us to define the _GNU_SOURCE flag, which requires the use of a C shim module.

    Besides URL (which is debatable as to how "required" it is - you can make requests without URLs), this is the only "required" functionality which depends on Foundation.

  • Rewrite cookie parsing

    The previous implementation had more allocations and copies than were necessary, and some really subtle uses of Foundation APIs such as .components(separatedBy: ";"). I've replaced it with something that tries to keep things as Substrings as much as possible and uses only the standard library.

  • Move Foundation APIs on HTTPClient.Cookie to a different file

    It turns out, there are really not many places where AsyncHTTPClient uses Foundation types in public APIs. There are a couple of initializers on Request which take a URL, Cookie has this Date, and Body has a convenience for working with Data. But that's it. By putting these APIs in a separate file, it will be easier to make them opt-out for no-foundation builds.

@karwa
Copy link
Contributor Author

karwa commented Dec 2, 2021

Hmm, seems like some older versions of Swift won't import the function using the header shim. We'll need an actual C function to forward to strptime.

@karwa karwa force-pushed the cookie-monster branch 4 times, most recently from 116faed to d790ad4 Compare December 2, 2021 23:03
Copy link
Collaborator

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a great idea, thanks! I've left some notes in the diff.

@karwa
Copy link
Contributor Author

karwa commented Dec 3, 2021

Also, the API breakage script doesn't seem to be able to build the module (it's failing to build/find the shims library, from what I can gather).

I took a look at it, but it uses a lot of swift commands I'm not familiar with (like dump-package and api-digester), so I'm not sure how to fix it.

@Lukasa Lukasa added the 🔨 semver/patch No public API change. label Dec 3, 2021
@karwa
Copy link
Contributor Author

karwa commented Dec 8, 2021

Just an FYI: I'm a bit swamped at the moment, but I'll push an update to address the feedback later this week

@karwa karwa force-pushed the cookie-monster branch 3 times, most recently from 450ac3e to 36c6010 Compare December 19, 2021 11:13
@karwa
Copy link
Contributor Author

karwa commented Dec 19, 2021

OK apologies for taking so long, and I'm aware people will likely be on holiday so maybe not able to review it for a while.

I think I've addressed all the feedback: C library has been renamed, cookie parsing now uses the UTF-8 view, the Foundation Date's setter has been restored, etc. I've also added a lot more tests for invalid date strings.

The test failures on 5.5/nightly appear to be related to async/Sendable stuff, so nothing to do with this patch.

@karwa karwa force-pushed the cookie-monster branch 2 times, most recently from d4d7188 to 68c51ea Compare December 19, 2021 12:01
return nil
}
let timestampString: String
if timestampUTF8.reversed().starts(with: "TMG".utf8) {
Copy link
Contributor Author

@karwa karwa Dec 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is no general "hasSuffix" function, so this is the best I could do without adding more one-off helper functions 😅

I thought about doing .reversed().starts(with: "GMT".utf8.reversed()), but the double-reversing could have a performance/size impact.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think a one-off helper function will help with clarity here. 😉

@karwa karwa requested a review from Lukasa December 20, 2021 11:46
if !hasPrefix(dquote) || !hasSuffix(dquote) {
return self
fileprivate init(utf8Slice: String.UTF8View.SubSequence, in base: String) {
self.init(base[utf8Slice.startIndex..<utf8Slice.endIndex])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary for us to slice the base? Isn't utf8Slice already the subsequence we're passing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • String.init?(String.UTF8View.Subsequence) can fail if the indexes are not aligned to character boundaries. We don't expect that to happen, but the SSWG guidelines also say to avoid force-unwrapping wherever possible.

  • String.init(Substring) (used here) can not fail. If the indexes are not aligned to character boundaries, it will adjust them. It's... not unreasonable fallback behaviour, and means we can avoid a bunch of optionals that won't be ever be nil.

In terms of performance, they both allocate independent String values, so are both O(n). I would suppose the String.init(Substring) initializer can be slightly faster, as by widening the bounds to aligned character boundaries, it may be able to skip UTF-8 validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a doc-comment, explaining how this is different from String.init?(String.UTF8View.Subsequence)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather have the optionals, tbh. The SSWG guidelines are understandable, but we're asserting that these are ASCII strings, and so there can never be a case where we have alignment issues. However, I'd like to hear from @fabianfett (SSWG member with a lot of interest in AHC) and @tomerd (SSWG member) for some more guidance here.

Copy link
Contributor Author

@karwa karwa Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, technically we're only asserting that our processing (trimming, etc) operates on whole scalars, which is fine since we only do ASCII-level processing. The strings could contain Unicode and there still wouldn't be any possibility of non-aligned indexes.

That being said, this automatic alignment happens all the time - you've only noticed it here because I wrote it in a weird way, but it happens whenever you create a Substring from a Substring.UTF8View. I'm uploading a patch which makes it a little less ugly, but there isn't a great way to avoid it completely. The options are:

  • String?(Substring.UTF8View), meaning allocating a String, and dealing with the optionals somehow. And you've been consistent about wanting to avoid allocations (and I agree with that sentiment), and the optionals are gross, so I assume this is undesirable.

  • Substring(Substring.UTF8View), which does not allocate but performs automatic alignment.

I don't mind either way, but those are the trade-offs.

case ("expires", .some(let value))?:
self.expires_timestamp = parseCookieTime(value, in: header)
case ("max-age", .some(let value))?:
self.maxAge = Int(String(utf8Slice: value, in: header))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid reifying the String here? We can construct Int from an arbitrary StringProtocol, which will save us the conversion cost.

guard !trimmedName.isEmpty else {
return nil
}
return (String(decoding: trimmedName, as: UTF8.self).lowercased(), trimmedValue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do a decoding flow here? I think we can just construct the String directly from the Substring.UTF8View. Additionally, can we apply lowercased on the Substring.UTF8View ourselves? As we believe the text is entirely ASCII it's a simple bitwise operation, while lowercasing a Unicode string is complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, nothing checks that the string is ASCII, and that's the sort of optimisation which I would be very disappointed if the standard library didn't do by itself (especially since it has internal flags which means it already knows whether a String is ASCII).

But adding ASCII checks is only part of it - the set of allowed bytes depends on which part of the cookie string you're talking about (RFC-6265 - note that cookie-octet is not the same as path-value or extension-av, and is actually a restricted subset of ASCII which excludes control characters and certain delimiters).

It would be useful to add those checks at some point, and the move to parsing at the UTF-8 level makes it easier to add that as a follow-up. But those should be accompanied by much more comprehensive tests which would bloat this PR, IMO, and distract from its focus which is timestamp parsing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard library does perform it, though it still has to alloc a new string to do it. C'est la vie. For now I'm happy to delay on this.

return nil
}
let timestampString: String
if timestampUTF8.reversed().starts(with: "TMG".utf8) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think a one-off helper function will help with clarity here. 😉

let timestampString: String
if timestampUTF8.reversed().starts(with: "TMG".utf8) {
let timezoneStart = timestampUTF8.index(timestampUTF8.endIndex, offsetBy: -3)
let trimmedTimestampUTF8 = timestampUTF8[..<timezoneStart].trimmingASCIISpaces()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing index(_:offsetBy:) here, just do dropLast(3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need the index, and this function already includes variables called timestampUTF8, trimmedTimestampUTF8, timestampString, so storing a single index called timezoneStart seems most easily understandable IMO.

And we can't do something like:

var trimmedTimestampUTF8 = timestampUTF8.dropLast(3)
trimmedTimestampUTF8 = trimmedTimestampUTF8.trimmingASCIISpaces()

...because the point is we need to check that trimmedTimestampUTF8.endIndex != timezoneStart (i.e. there was at least one space between the timezone and rest of the timestamp).

@karwa
Copy link
Contributor Author

karwa commented Jan 4, 2022

@Lukasa , I've made the following changes:

  • Added a doc-comment explaining how String(utf8Slice:in:) is different from String.init?(String.UTF8View.Subsequence)
  • Parse max-age directly from the substring
  • Added a hasSuffix function, with tests
  • Added FIXMEs to HTTPCookie.init?(header:defaultDomain) and HTTPCookie.init(key:value:...) noting that they need stricter validation. In particular, the latter needs to be failable (which would be an API-breaking change).

The hasSuffix function is generally quite useful, so I've made it generic. It's also worth testing it for edge-cases, and generics helps with that.

@karwa
Copy link
Contributor Author

karwa commented Jan 4, 2022

OK, 2 more commits:

  1. Renamed String(utf8Slice:from:) to String(aligningUTF8:) (dropping the base-string argument), implemented it by going through Substring, and removed it from timestamp parsing (it actually helped make the code a bit clearer).
  2. While taking a look at Section 5.2 of RFC-6265 (which describes the lenient parser we "MUST" use to parse Set-Cookie headers), I noticed that the way we handle parsed values and decide which ones to use was all wrong - we shouldn't be skipping nil values in all cases (such as the path), and we weren't handling invalid values properly for many of the other cases (we should ignore them, not set to nil). So I fixed it, and added a bunch of test cases for each component. I was going to leave it for a follow-up PR, but I figured that this one already rewrites basically the entire rest of the parser, and misinterpreting cookies could have privacy implications so it's important to fix it ASAP.

@karwa karwa changed the title Redo HTTP cookie parsing using strptime Redo HTTP cookie parsing Jan 4, 2022
Copy link
Collaborator

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor notes here, this looks fantastic. Happy to take your feedback here, take whichever of the notes you feel good about, then let me know and I'll hit the big green button.

/// - maxAge: The cookie's age in seconds, defaults to nil.
/// - httpOnly: Whether this cookie should be used by HTTP servers only, defaults to false.
/// - secure: Whether this cookie should only be sent using secure channels, defaults to false.
public init(name: String, value: String, path: String = "/", domain: String? = nil, expires: Date? = nil, maxAge: Int? = nil, httpOnly: Bool = false, secure: Bool = false) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is public API, should we make some room for ourselves by making this failable now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it failable would be an API-breaking change (semver-major) though, right?

It would be good to do, but I'd rather land this and check that out as a follow-up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I lost track of the fact that this is an existing init. We'd need to deprecate-and-replace it. Happy to delay that to a follow-up.

for component in components {
switch component.parseCookieComponent() {
case ("path", let value)?:
// Unlike other values, unspecified and empty paths reset to the default path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cite comments like this? I love to see them and I'm really pleased this one is here, but it makes my life much easier as a reviewer (and future reader) if I can rapidly follow the citation (in this case, RFC 6265 § 5.2.4).

parsedPath = value
case ("domain", let value)?:
guard var value = value, !value.isEmpty else {
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really mild nit and you can ignore it if you feel strongly, but I kinda hate break in a switch nested in a loop. It takes me a second or two to figure out to what it applies. Can we use continue here instead?

}
self.expires_timestamp = timestamp
case ("max-age", let value)?:
guard let value = value, let age = Int(Substring(value)) else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun task for the future: let's just write our own Int.init(_ bytes: Substring.UTF8View) initializer. It doesn't really matter, but it'd be fun and very slightly faster. Let's not do it now though, just planting seeds to keep you engaged in the future 😉.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not too difficult. I already did a similar thing for WebURL, because the standard library's version only accepts StringProtocol rather than arbitrary buffers of bytes: https://github.com/karwa/swift-url/blob/1f1263f94e25eb43d8688dc668de6638e092f748/Sources/WebURL/Util/ASCII.swift#L502

But yeah, it can be a follow-up. I'll have a patch up which addresses the nits in a little bit.

@karwa
Copy link
Contributor Author

karwa commented Jan 6, 2022

@Lukasa okay, addressed 2 nits:

  • Added a citation for how the "path" value is handled
  • Changed continue to break in the component-parsing loop

As for making the memberwise public initialiser failable, I'm not sure how API-breaking/semver-major changes are handled (leave the PR sitting around until the next release is scheduled? conditional compilation?), but it could be done by a follow-up PR.

@Lukasa Lukasa merged commit 972bcdd into swift-server:main Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔨 semver/patch No public API change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants