Parity: URLCache adoption for HTTP and FTP invocations. #2401

millenomi · 2019-07-10T17:18:00Z

⚠️ Needs extensive testing and adding many of the fixes discussed in the comments of the PR linked below.

Includes the changes from previous PR: Parity: URLCache #2341. They introduce an implementation of URLCache.
Store and provide the storage date for cached responses, for internal use only.
Implement protocol properties.
URLSessionConfiguration now configures the shared and ephemeral sessions with caches of appropriate size and storage setup. Each ephemeral session has a new in-memory-only cache, and the shared session used the shared `URLCache .
The URLProtocolClient that drives URLSession tasks now asks its delegate for caching, and if a cacheable response is determined, uses the URLSession’s cache to store it. It will also fetch from the cache if appropriate, and will initialize the protocol with that response.
Add hooks to _NativeProtocol to affect whether the transfer can be satisfied by the cached response it was initialized with. If it can, the response will be replayed instead.
_HTTPURLProtocol implements those hooks to make the URLCache behave as a RFC 7234 private cache. (Note that _FTPURLProtocol uses the default implementation of those hooks, which will cause related responses to be evicted from the cache and no response to be fulfilled from the cache. This is intentional.)

Implement it all.

- Store and provide the storage date for cached responses, for internal use only. - Implement protocol properties. - URLSessionConfiguration now configures the shared and ephemeral sessions with caches of appropriate size and storage setup. Each ephemeral session has a new in-memory-only cache, and the shared session used the shared URLCache. - The URLProtocolClient that drives URLSession tasks now asks its delegate for caching, and if a cacheable response is determined, uses the URLSession’s cache to store it. It will also fetch from the cache if appropriate, and will initialize the protocol with that response. - Add hooks to _NativeProtocol to affect whether the transfer can be satisfied by the cached response it was initialized with. If it can, the response will be replayed instead. - _HTTPURLProtocol implements those hooks to make the URLCache behave as a RFC 7234 private cache. Note that _FTPURLProtocol uses the default implementation of those hooks, which will cause related responses to be evicted from the cache and no response to be fulfilled from the cache. This is intentional.

millenomi · 2019-07-10T17:20:45Z

cc @drodriguez

millenomi · 2019-07-10T17:20:55Z

cc @ianpartridge

millenomi · 2019-07-10T17:21:54Z

@swift-ci please test linux

Foundation/URLCache.swift

millenomi · 2019-07-10T17:29:38Z

Foundation/URLCache.swift

+        }
+    }
+
+    private func identifier(for request: URLRequest) -> String? {


The identifier now uses only the host, port and path components of a URL to generate a key. — @karthikkeyan @drodriguez

Note that this is not a HTTP cache, but a generic cache usable by any URLProtocol implementation — HTTP concerns go in the appropriate protocol implementation (and are included below).

Foundation/URLCache.swift

millenomi · 2019-07-10T17:35:27Z

@drodriguez I'm also concerned by the absence of locking, but we are both cross-platform (and thus can't rely on the platform having locking) and not all volumes may support locking even on the same OS.

I'm thinking flocking may be an okay thing to do tentatively, but I'd love the current implementation to not have any obvious incorrectness without filesystem locking (assuming that this is a cache and, as all caches do, it services its clients best-effort — I'm okay with a race evicting a cache value that may have been otherwise served.)

millenomi · 2019-07-10T17:35:38Z

@swift-ci please test linux

TestFoundation/TestURLCache.swift

millenomi · 2019-07-10T17:40:31Z

Foundation/URLResponse.swift

-            self.mimeType = encodedMimeType as String
-        }
+
+        let nsmimetype = aDecoder.decodeObject(of: NSString.self, forKey: "NS.mimeType")


@drodriguez re: differing patterns: you're right, I'll harmonize.

millenomi · 2019-07-10T17:41:57Z

Foundation/URLCache.swift

+    required init?(coder aDecoder: NSCoder) {
+        guard let response = aDecoder.decodeObject(of: URLResponse.self, forKey: "response"),
+              let data = aDecoder.decodeObject(of: NSData.self, forKey: "data"),
+              let storagePolicy = URLCache.StoragePolicy(rawValue: UInt(aDecoder.decodeInt64(forKey: "storagePolicy"))),


@drodriguez — Re: bitPattern usage: IIRC that will not compile on 32-bit platforms.

IMO, the current code will crash on 32 bits platforms if the decoded value is higher than the maximum 32 bits value, and will crash on 64 bits platforms if the decoded value is negative. Using decodeInteger and bitPattern should work, because decodeInteger is suppose to give back an Int, which is 32/64 bits depending on the platform, and those methods are never supposed to crash. Using bitPattern is closer to the old pattern in Obj-C of casting the returned value to unsigned values.

millenomi · 2019-07-10T17:43:00Z

Foundation/URLCache.swift

@@ -154,6 +195,9 @@ open class CachedURLResponse : NSObject, NSCopying {

 open class URLCache : NSObject {

+    private static let sharedLock = NSLock()
+    private static var _shared = URLCache(memoryCapacity: 4 * 1024 * 1024, diskCapacity: 20 * 1024 * 1024, diskPath: nil)


@drodriguez IIRC the language guarantees that, for static and global variables, their initializers aren't run until first access anyway.

Yes, that's true, but that initialization seems to happen even if the static variable is a lvalue, like in line 226.

You can type the following in an Xcode playground and check:

var globalCounter = 0 class TestValue { init() { print("TestValue.init") globalCounter += 1 } } class TestClass { private static var _static: TestValue = TestValue() class var accessor: TestValue { get { print("TestClass.accessor.get: before return") return _static } set { print("TestClass.accessor.set: before assign") _static = newValue print("TestClass.accessor.set: after assign") } } } let t = TestValue() TestClass.accessor = t print(globalCounter) // Output: // TestValue.init // TestClass.accessor.set: before assign // TestValue.init // TestClass.accessor.set: after assign // 2

Foundation/URLCache.swift

millenomi · 2019-07-10T17:46:08Z

Foundation/URLCache.swift

+            inMemoryCacheContents[$0]!.cost
+        }
+
+        var totalSize = sizes.reduce(0, +)


@drodriguez I'm avoiding duplication of logic here; I'd rather reason about clearer coed in this first pass, and address performance as needed.

millenomi · 2019-07-10T17:47:05Z

Foundation/URLCache.swift

+        for identifier in identifiersToRemove {
+            inMemoryCacheContents.removeValue(forKey: identifier)
+        }
+        inMemoryCacheOrder.removeAll(where: { identifiersToRemove.contains($0) })


@drodriguez I completely forgot about dropFirst, which obviously is always correct here. I'll adopt your suggestion.

Foundation/URLCache.swift

millenomi · 2019-07-10T17:48:46Z

Foundation/URLCache.swift

+        var identifier: String
+
+        init?(_ url: URL) {
+            if url.pathExtension.localizedCompare(DiskEntry.pathExtension) != .orderedSame {


@drodriguez good catch here — I intended to use caseInsensitiveCompare.

millenomi · 2019-07-10T17:49:14Z

Foundation/URLCache.swift

+
+    private func diskContentsURL(for request: URLRequest, forCreationAt date: Date? = nil) -> URL? {
+        guard let identifier = self.identifier(for: request) else { return nil }
+        guard let directory = cacheDirectory else { return nil }


@drodriguez also good catch here — I'll swap these two.

millenomi · 2019-07-10T18:11:41Z

Foundation/URLCache.swift

+        if let date = date {
+            // If we're trying to _create_ an entry and it already exists, then we can't -- we should evict the old one first.
+            if foundURL != nil {
+                return nil


@drodriguez I cannot guarantee that this check will not be executed even around L491, because we may be racing. You're correct that a behavior guarantee would be easier to achieve with filesystem locking, but as noted we may be operating without and not have a choice in the matter.

I need to sit down and think more deeply about the date being part of the filename, since it may ruin what little atomicity we have in store… and lead to duplicate entries for the same URL. It may be that we can just blindly store and then do an eviction pass at the end for all URLs that have the same key and are not the highest timestamp we find on disk (not the one we just stored!), but I need to think through the racing implications of that.

Yes, my comments about locking the file storage wasn't about locking it against two processes accessing the same cache, but about the same process accessing its own cache from two threads. My original design serialized many of the access to disk to avoid those kind of problems. Knowing that the access was serial simplified many of the checks a lot.

The problem with that reasoning is, given that URLCache.shared is the default, it is not useful to do that anyway — the moment a second process runs on the same node it will use the same path. So we need to be flexible enough to work in that case anyway.

Two processes of the same binary will create problems with the file system storage. That's unavoidable without file level locking. However I don’t think the situation happens that often for consumer software (different story for server software, though). My intention providing in-process serialization was to honor the documentation that states that URLCache is thread safe.

The moment we have to reason about safety at an interprocess level without locking, if we get it right, we by definition have safety at an intra-process, inter-thread level without locking. At that point, adding locking is at best a performance issue and at worst hides interprocess issues that we would've caught otherwise.

- Less bookkeeping and looping in `evictFromMemory…` - Added a helper for getting a snapshot array of disk entries - Added the ability for the URL creation method to return the identifier it used to construct the URL - Fix several smaller issues in the logic and feedback - Do evict an existing entry before storage (to keep peak disk usage down), but then perform a cleanup pass wherein all but the latest cache entry for a specific identifier are deleted. This avoids leaving orphaned cache entries if two concurrent processes or threads race in writing the same identifier to disk.

millenomi · 2019-07-11T18:14:25Z

cc @drodriguez many fixes.

millenomi · 2019-07-11T20:15:25Z

@swift-ci please test

millenomi · 2019-07-11T23:21:57Z

@swift-ci please test Linux

millenomi · 2019-07-12T17:40:56Z

@drodriguez merging today if you see no issue; I want to make sure that our existing URLSession tests are hitting the cache before merging, but I would be satsified if there are no regressions there.

drodriguez · 2019-07-12T18:07:17Z

@millenomi if you need to merge, please do. I haven't had time to go over the new code, but we can always fix them after it lands. This has a lot more code than the original PR. I have been only answering to the comments brought from the previous PR.

millenomi · 2019-07-15T17:41:59Z

I will merge as soon as merging is re-enabled after the build fix.

millenomi · 2019-07-15T17:42:50Z

I have verified that the URLSession tests are using URLSession.shared, meaning they're exercising URLCache.shared.

drodriguez

I just seen that this was merged while I was adding comments. It is a big change and my comments do not cover everything. I add them here just in case someone wants to look at them or talk about them. I might try my best at addressing them in the merged code.

drodriguez · 2019-07-15T17:33:27Z

Foundation/URLCache.swift

+
+            guard let time = Int64(timeString) else { return nil }
+
+            self.date = Date(timeIntervalSinceReferenceDate: TimeInterval(time))


I want to insist that 1 second resolution might not be enough for current processing speeds. The change to use subsecond resolution is quite easy: divide that time by 1000, and multiply by 1000 before generating the URL later.

drodriguez · 2019-07-15T20:10:56Z

Foundation/URLCache.swift

+            do {
+                evictFromDiskCache(maximumSize: diskCapacity - entry.cost)
+
+                let locators = diskContentLocators(for: request, forCreationAt: Date())


Question: I'm trying to understand if there's a reason for doing let locators, and then checking for both locations?.url and locators?.identifier. I was wondering why not if let locators = … { /* user locators.url and locators.identifier without more checks */ }

drodriguez · 2019-07-15T20:24:19Z

Foundation/URLCache.swift

-    public func getCachedResponse(for dataTask: URLSessionDataTask, completionHandler: (CachedURLResponse?) -> Void) { NSUnimplemented() }
-    public func removeCachedResponse(for dataTask: URLSessionDataTask) { NSUnimplemented() }
+    open func storeCachedResponse(_ cachedResponse: CachedURLResponse, for dataTask: URLSessionDataTask) {
+        guard let request = dataTask.currentRequest else { return }


Not completely sure that using currentRequest here is the right thing to do. In most cases, in the code ofgetCachedResponse below, the request currentRequest will be originalRequest, since one will check the request before sending it to the server. However, if this code only saves the response under currentRequest, that would mean that a permanent redirection will always request to the server, instead of using the cached response from earlier. I don't know how to exactly solve this, since it might depend on the kind of response that the data task received. Not sure if there's an easy solution. I just want to point out that this might not work as expected.

drodriguez · 2019-07-15T20:47:20Z

Foundation/URLSession/URLSessionTask.swift

+                guard let urlProtocol = urlProtocolClass as? URLProtocol.Type else { fatalError("A protocol class registered with URLProtocol.register… was not a URLProtocol subclass: \(urlProtocolClass)") }
+                return urlProtocol
+            }
+        }


Can we not make the checks for the URLProtocol.Type in the places where those protocols are set, instead of on query? It will probably help to find the problem easier. I think URLProtocol.registerClass() already does it, but I think URLSessionConfiguration.protocolClasses can do it in a willSet or similar. That should remove the checks here, and move them earlier and will always be hit.

That way a possible improvement might be doing the search only once, since both searches seems the same code to me (please point me to the difference if there's one, I might have missed it), by changing L58 to let protocolClasses = (session.configuration.protocolClasses ?? []) + (URLProtocol.getProtocols() ?? []).

drodriguez · 2019-07-15T20:53:11Z

Foundation/URLSession/URLSessionTask.swift

+    }
+
+    func _getProtocol(_ callback: @escaping (URLProtocol?) -> Void) {
+        _protocolLock.lock() // Must be balanced below, before we call out ⬇


With the performLocked from above, this might be able to be rewritten as, which should avoid problems of missing some point where unlocking was needed.

let continuation: () -> Void = _protocolLock.performLocked { // most of the code from below return { cache.getCachedResponse(for: me) { (response) in … } } // more code from below return {} // more code from below return { callback(urlProtocol) } // etc… } continuation()

drodriguez · 2019-07-15T20:59:27Z

Foundation/URLSession/URLSessionTask.swift

+    }
+
+    func _satisfyProtocolRequest(with urlProtocol: URLProtocol) {
+        _protocolLock.lock() // Must be balanced below, before we call out ⬇


A similar idea can be done here, but the return type can be a simple [(URLProtocol?) -> Void] which can be empty for the branches that do not need to execute anything.

drodriguez · 2019-07-15T21:11:01Z

Foundation/URLSession/URLSessionTask.swift

-                self.workQueue.async {
-                    self._protocol?.stopLoading()
+                self._getProtocol { (urlProtocol) in
+                    self.workQueue.async {


It is complicated to determine, but with the workQueue and the protocol being invalidated inside _protocolLock, some of these pieces of code might not have the same behaviour and might not be expected. If _getProtocol here returns a non-nil protocol, it might have already been invalidated by the time the workQueue executes the code. That didn't happen in the previous version.

I don't know if that's important or not, however.

drodriguez · 2019-07-15T21:37:15Z

Foundation/URLSession/http/HTTPURLProtocol.swift

+            let directives = CacheControlDirectives(headerValue: cacheControl)
+
+            if directives.noCache || directives.noStore {
+                return false


My understanding of no-cache is that the caches should revalidate with the origin server, but the cached content can still be used if the server answers “Not modified”.

millenomi added 3 commits July 5, 2019 10:25

Parity: URLCache

2facc4b

Implement it all.

Add removal tests.

6df76d0

millenomi mentioned this pull request Jul 10, 2019

Parity: URLCache #2341

Closed