Skip to content

Parity: URLCache adoption for HTTP and FTP invocations. #2401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 15, 2019
Merged

Parity: URLCache adoption for HTTP and FTP invocations. #2401

merged 4 commits into from
Jul 15, 2019

Conversation

millenomi
Copy link
Contributor

⚠️ Needs extensive testing and adding many of the fixes discussed in the comments of the PR linked below.

  • Includes the changes from previous PR: Parity: URLCache #2341. They introduce an implementation of URLCache.

  • Store and provide the storage date for cached responses, for internal use only.

  • Implement protocol properties.

  • URLSessionConfiguration now configures the shared and ephemeral sessions with caches of appropriate size and storage setup. Each ephemeral session has a new in-memory-only cache, and the shared session used the shared `URLCache .

  • The URLProtocolClient that drives URLSession tasks now asks its delegate for caching, and if a cacheable response is determined, uses the URLSession’s cache to store it. It will also fetch from the cache if appropriate, and will initialize the protocol with that response.

  • Add hooks to _NativeProtocol to affect whether the transfer can be satisfied by the cached response it was initialized with. If it can, the response will be replayed instead.

  • _HTTPURLProtocol implements those hooks to make the URLCache behave as a RFC 7234 private cache. (Note that _FTPURLProtocol uses the default implementation of those hooks, which will cause related responses to be evicted from the cache and no response to be fulfilled from the cache. This is intentional.)

millenomi added 3 commits July 5, 2019 10:25
Implement it all.
 - Store and provide the storage date for cached responses, for internal use only.

 - Implement protocol properties.

 - URLSessionConfiguration now configures the shared and ephemeral sessions with caches of appropriate size and storage setup. Each ephemeral session has a new in-memory-only cache, and the shared session used the shared URLCache.

 - The URLProtocolClient that drives URLSession tasks now asks its delegate for caching, and if a cacheable response is determined, uses the URLSession’s cache to store it. It will also fetch from the cache if appropriate, and will initialize the protocol with that response.

 - Add hooks to _NativeProtocol to affect whether the transfer can be satisfied by the cached response it was initialized with. If it can, the response will be replayed instead.

 - _HTTPURLProtocol implements those hooks to make the URLCache behave as a RFC 7234 private cache.

Note that _FTPURLProtocol uses the default implementation of those hooks, which will cause related responses to be evicted from the cache and no response to be fulfilled from the cache. This is intentional.
@millenomi
Copy link
Contributor Author

cc @drodriguez

@millenomi
Copy link
Contributor Author

cc @ianpartridge

@millenomi millenomi mentioned this pull request Jul 10, 2019
@millenomi
Copy link
Contributor Author

@swift-ci please test linux

}
}

private func identifier(for request: URLRequest) -> String? {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The identifier now uses only the host, port and path components of a URL to generate a key. — @karthikkeyan @drodriguez

Note that this is not a HTTP cache, but a generic cache usable by any URLProtocol implementation — HTTP concerns go in the appropriate protocol implementation (and are included below).

@millenomi
Copy link
Contributor Author

@drodriguez I'm also concerned by the absence of locking, but we are both cross-platform (and thus can't rely on the platform having locking) and not all volumes may support locking even on the same OS.

I'm thinking flocking may be an okay thing to do tentatively, but I'd love the current implementation to not have any obvious incorrectness without filesystem locking (assuming that this is a cache and, as all caches do, it services its clients best-effort — I'm okay with a race evicting a cache value that may have been otherwise served.)

@millenomi
Copy link
Contributor Author

@swift-ci please test linux

self.mimeType = encodedMimeType as String
}

let nsmimetype = aDecoder.decodeObject(of: NSString.self, forKey: "NS.mimeType")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez re: differing patterns: you're right, I'll harmonize.

required init?(coder aDecoder: NSCoder) {
guard let response = aDecoder.decodeObject(of: URLResponse.self, forKey: "response"),
let data = aDecoder.decodeObject(of: NSData.self, forKey: "data"),
let storagePolicy = URLCache.StoragePolicy(rawValue: UInt(aDecoder.decodeInt64(forKey: "storagePolicy"))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez — Re: bitPattern usage: IIRC that will not compile on 32-bit platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the current code will crash on 32 bits platforms if the decoded value is higher than the maximum 32 bits value, and will crash on 64 bits platforms if the decoded value is negative. Using decodeInteger and bitPattern should work, because decodeInteger is suppose to give back an Int, which is 32/64 bits depending on the platform, and those methods are never supposed to crash. Using bitPattern is closer to the old pattern in Obj-C of casting the returned value to unsigned values.

@@ -154,6 +195,9 @@ open class CachedURLResponse : NSObject, NSCopying {

open class URLCache : NSObject {

private static let sharedLock = NSLock()
private static var _shared = URLCache(memoryCapacity: 4 * 1024 * 1024, diskCapacity: 20 * 1024 * 1024, diskPath: nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez IIRC the language guarantees that, for static and global variables, their initializers aren't run until first access anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true, but that initialization seems to happen even if the static variable is a lvalue, like in line 226.

You can type the following in an Xcode playground and check:

var globalCounter = 0

class TestValue {
    init() {
        print("TestValue.init")
        globalCounter += 1
    }
}

class TestClass {
    private static var _static: TestValue = TestValue()
    
    class var accessor: TestValue {
        get {
            print("TestClass.accessor.get: before return")
            return _static
        }
        set {
            print("TestClass.accessor.set: before assign")
            _static = newValue
            print("TestClass.accessor.set: after assign")
        }
    }
}

let t = TestValue()
TestClass.accessor = t
print(globalCounter)

// Output:
// TestValue.init
// TestClass.accessor.set: before assign
// TestValue.init
// TestClass.accessor.set: after assign
// 2

inMemoryCacheContents[$0]!.cost
}

var totalSize = sizes.reduce(0, +)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez I'm avoiding duplication of logic here; I'd rather reason about clearer coed in this first pass, and address performance as needed.

for identifier in identifiersToRemove {
inMemoryCacheContents.removeValue(forKey: identifier)
}
inMemoryCacheOrder.removeAll(where: { identifiersToRemove.contains($0) })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez I completely forgot about dropFirst, which obviously is always correct here. I'll adopt your suggestion.

var identifier: String

init?(_ url: URL) {
if url.pathExtension.localizedCompare(DiskEntry.pathExtension) != .orderedSame {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez good catch here — I intended to use caseInsensitiveCompare.


private func diskContentsURL(for request: URLRequest, forCreationAt date: Date? = nil) -> URL? {
guard let identifier = self.identifier(for: request) else { return nil }
guard let directory = cacheDirectory else { return nil }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez also good catch here — I'll swap these two.

if let date = date {
// If we're trying to _create_ an entry and it already exists, then we can't -- we should evict the old one first.
if foundURL != nil {
return nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drodriguez I cannot guarantee that this check will not be executed even around L491, because we may be racing. You're correct that a behavior guarantee would be easier to achieve with filesystem locking, but as noted we may be operating without and not have a choice in the matter.

I need to sit down and think more deeply about the date being part of the filename, since it may ruin what little atomicity we have in store… and lead to duplicate entries for the same URL. It may be that we can just blindly store and then do an eviction pass at the end for all URLs that have the same key and are not the highest timestamp we find on disk (not the one we just stored!), but I need to think through the racing implications of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my comments about locking the file storage wasn't about locking it against two processes accessing the same cache, but about the same process accessing its own cache from two threads. My original design serialized many of the access to disk to avoid those kind of problems. Knowing that the access was serial simplified many of the checks a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that reasoning is, given that URLCache.shared is the default, it is not useful to do that anyway — the moment a second process runs on the same node it will use the same path. So we need to be flexible enough to work in that case anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two processes of the same binary will create problems with the file system storage. That's unavoidable without file level locking. However I don’t think the situation happens that often for consumer software (different story for server software, though). My intention providing in-process serialization was to honor the documentation that states that URLCache is thread safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The moment we have to reason about safety at an interprocess level without locking, if we get it right, we by definition have safety at an intra-process, inter-thread level without locking. At that point, adding locking is at best a performance issue and at worst hides interprocess issues that we would've caught otherwise.

 - Less bookkeeping and looping in `evictFromMemory…`
 - Added a helper for getting a snapshot array of disk entries
 - Added the ability for the URL creation method to return the identifier it used to construct the URL
 - Fix several smaller issues in the logic and feedback
 - Do evict an existing entry before storage (to keep peak disk usage down), but then perform a cleanup pass wherein all but the latest cache entry for a specific identifier are deleted. This avoids leaving orphaned cache entries if two concurrent processes or threads race in writing the same identifier to disk.
@millenomi
Copy link
Contributor Author

cc @drodriguez many fixes.

@millenomi
Copy link
Contributor Author

@swift-ci please test

@millenomi
Copy link
Contributor Author

@swift-ci please test Linux

@millenomi
Copy link
Contributor Author

@drodriguez merging today if you see no issue; I want to make sure that our existing URLSession tests are hitting the cache before merging, but I would be satsified if there are no regressions there.

@drodriguez
Copy link
Contributor

@millenomi if you need to merge, please do. I haven't had time to go over the new code, but we can always fix them after it lands. This has a lot more code than the original PR. I have been only answering to the comments brought from the previous PR.

@millenomi
Copy link
Contributor Author

I will merge as soon as merging is re-enabled after the build fix.

@millenomi
Copy link
Contributor Author

millenomi commented Jul 15, 2019

I have verified that the URLSession tests are using URLSession.shared, meaning they're exercising URLCache.shared.

@millenomi millenomi merged commit abf3278 into swiftlang:master Jul 15, 2019
Copy link
Contributor

@drodriguez drodriguez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just seen that this was merged while I was adding comments. It is a big change and my comments do not cover everything. I add them here just in case someone wants to look at them or talk about them. I might try my best at addressing them in the merged code.


guard let time = Int64(timeString) else { return nil }

self.date = Date(timeIntervalSinceReferenceDate: TimeInterval(time))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to insist that 1 second resolution might not be enough for current processing speeds. The change to use subsecond resolution is quite easy: divide that time by 1000, and multiply by 1000 before generating the URL later.

do {
evictFromDiskCache(maximumSize: diskCapacity - entry.cost)

let locators = diskContentLocators(for: request, forCreationAt: Date())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I'm trying to understand if there's a reason for doing let locators, and then checking for both locations?.url and locators?.identifier. I was wondering why not if let locators = … { /* user locators.url and locators.identifier without more checks */ }

public func getCachedResponse(for dataTask: URLSessionDataTask, completionHandler: (CachedURLResponse?) -> Void) { NSUnimplemented() }
public func removeCachedResponse(for dataTask: URLSessionDataTask) { NSUnimplemented() }
open func storeCachedResponse(_ cachedResponse: CachedURLResponse, for dataTask: URLSessionDataTask) {
guard let request = dataTask.currentRequest else { return }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completely sure that using currentRequest here is the right thing to do. In most cases, in the code ofgetCachedResponse below, the request currentRequest will be originalRequest, since one will check the request before sending it to the server. However, if this code only saves the response under currentRequest, that would mean that a permanent redirection will always request to the server, instead of using the cached response from earlier. I don't know how to exactly solve this, since it might depend on the kind of response that the data task received. Not sure if there's an easy solution. I just want to point out that this might not work as expected.

guard let urlProtocol = urlProtocolClass as? URLProtocol.Type else { fatalError("A protocol class registered with URLProtocol.register… was not a URLProtocol subclass: \(urlProtocolClass)") }
return urlProtocol
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not make the checks for the URLProtocol.Type in the places where those protocols are set, instead of on query? It will probably help to find the problem easier. I think URLProtocol.registerClass() already does it, but I think URLSessionConfiguration.protocolClasses can do it in a willSet or similar. That should remove the checks here, and move them earlier and will always be hit.

That way a possible improvement might be doing the search only once, since both searches seems the same code to me (please point me to the difference if there's one, I might have missed it), by changing L58 to let protocolClasses = (session.configuration.protocolClasses ?? []) + (URLProtocol.getProtocols() ?? []).

}

func _getProtocol(_ callback: @escaping (URLProtocol?) -> Void) {
_protocolLock.lock() // Must be balanced below, before we call out ⬇
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the performLocked from above, this might be able to be rewritten as, which should avoid problems of missing some point where unlocking was needed.

let continuation: () -> Void = _protocolLock.performLocked {
  // most of the code from below
  return {
    cache.getCachedResponse(for: me) { (response) in … }
  }
  // more code from below
  return {}
  // more code from below
  return { callback(urlProtocol) }
  // etc…
}

continuation()

}

func _satisfyProtocolRequest(with urlProtocol: URLProtocol) {
_protocolLock.lock() // Must be balanced below, before we call out ⬇
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar idea can be done here, but the return type can be a simple [(URLProtocol?) -> Void] which can be empty for the branches that do not need to execute anything.

self.workQueue.async {
self._protocol?.stopLoading()
self._getProtocol { (urlProtocol) in
self.workQueue.async {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is complicated to determine, but with the workQueue and the protocol being invalidated inside _protocolLock, some of these pieces of code might not have the same behaviour and might not be expected. If _getProtocol here returns a non-nil protocol, it might have already been invalidated by the time the workQueue executes the code. That didn't happen in the previous version.

I don't know if that's important or not, however.

let directives = CacheControlDirectives(headerValue: cacheControl)

if directives.noCache || directives.noStore {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of no-cache is that the caches should revalidate with the origin server, but the cached content can still be used if the server answers “Not modified”.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants