-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Parse JSON with value types #2966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please test |
I will have a deeper looker later but do we currently have any benchmarks for the current Ive also been working on an improved solution for the JSON number parsing to better deal with conversion issues between |
Hey, I took the liberty to bench these changes against nightly and it's quite an improvement. Parsing a sizable JSON doc (~8k bytes) 10,000 times takes Nightly: 1.71179s |
Thanks for doing that @fabianfett . I just want to point out that improvements like this are very important for server-side swift. We use swift-extras-json (previously Pure Swift JSON) in a production environment and it is much faster than the current version of Foundation. I would love to see at least the same numbers for Foundation which come with Swift so we can drop redundant third-party. |
89a6532
to
c7319a1
Compare
@swift-ci please test |
#warning("@Fabian: pretty sure we should throw an error here, if this is invalid data") | ||
let json = String(bytes: ptr, encoding: encoding!)! | ||
return try json.utf8.withContiguousStorageIfAvailable { (utf8) -> JSONValue in | ||
try JSONParser().parse(bytes: utf8) | ||
}! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Darwin, this can fail. On Linux it cannot.
You can write
var json = String(bytes:...)
json.makeContiguousUTF8()
and after that you can !
it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't json always utf8 though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a failure input which is UTF8 with BOM and bad data, Here is a test case to add to
TestJSONSerialization.test_JSONObjectWithData_encodingDetection
:
("{} UTF-8 w/BOM w/ trailing illegal data", [0xEF, 0xBB, 0xBF, 0x7B, 0x7D, 0xff, 0x00]),
(will need a test to confirm the JSON parsing failed and returned nil
)
I would suggest merging (or at least have one call the other) parseBOM
and detectEncoding
- I cant see anywhere else they are used. If the encoding is determined to be UTF-8 then any leading BOM can be skipped as a new String
does not need to be created. If if looks like UTF-8 a quick check if the first character is ASCII should be sufficient conformation as I dont think any JSON can start with a unicode point > 127.
Then the code could be simplified to something like:
guard let (encoding, advanceBy) = parseBOM(ptr) else { throw ... }
let newPtr = ptr[advanceBy..<ptr.count]
if encoding == .utf8 {
parse it from newPtr...
} else {
guard let string = String(newPtr..) else { throw .. }
}
I couldnt find anything on json.org to say that JSON was guaranteed to be UTF-8 but its unlikely not to be so just converting to a new String
as you have done is fine (even though it will do an extra allocation).
I would also avoid any force unwrapping when using String()
and just throw
since you are dealing with external input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from here
3. Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it’s with, Darwin Foundation checks to make sure it’s UTF8 first and if it’s not just converts it to UTF8 for parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parkera I'm afraid you quoted RFC 4672, which has been obsoleted for quite some time by 8258:
JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8
when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding,
to the extent that it is the only encoding that achieves
interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text. In the interests of
interoperability, implementations that parse JSON texts MAY ignore
the presence of a byte order mark rather than treating it as an
error.
Am I right in assuming that the Darwin implementation allows UTF16 and UTF32 encoding and further allows byte order marks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the important takeaway from the 2 RFCs is that the input COULD be in any Unicode encoding but the output MUST always be UTF-8. However since the chance of the input being non-UTF8 is low the conversion from other inputs to UTF-8 as a one-time operation at the start is fine. Your code currently does all of this.
Running the test test_JSONObjectWithData_encodingDetection
on Darwin using DarwinCompatibilityTests
seems to indicate that Darwin doesn't handle some inputs (UTF16BE/LE witout BOM and UTF32LE w/BOM)
@spevans @millenomi I'm afraid this is ready for review. |
} else { | ||
throw NSError(domain: NSCocoaErrorDomain, code: CocoaError.propertyListReadCorrupt.rawValue, | ||
userInfo: [NSDebugDescriptionErrorKey : "Numbers must start with a 1-9 at character \(input)." ]) | ||
self.array = Array(bytes) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does DocumentReader
need to take the input as <Bytes: Collection>
? It looks like the input is always an UnsafeBufferPointer<UInt8>
and simply storing that struct may be faster. Calling Array(bytes)
or as? [UIntt8]
may end up allocating another buffer. I think the generics are just going to add overhead here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure the compiler will create two perfectly optimized methods for this. Which means we won't have any performance hits here. I don't think we should operate on UnsafeBufferPointer<UInt8>
here, since this comes with Security implications, based on the unsafe memory access in release builds.
ace6c73
to
207c5b9
Compare
if bytes.starts(with: [0xFE, 0xFF]) { | ||
return (.utf16BigEndian, 2) | ||
} | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detectEncoding
and parseBOM
should be merged into one function returning (encoding: String.Encoding, skipLength: Int)?
since they both work on the same inputs.
case .utf32LittleEndian where buffer[input+1] == 0 && buffer[input+2] == 0 && buffer[input+3] == 0: | ||
index = input | ||
switch extraCharacter { | ||
case UInt8(ascii: " "), UInt8(ascii: "\r"), UInt8(ascii: "\n"), UInt8(ascii: "\t"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be defined as constants to make it easier to read and reduce errors due to a typo with a characters. Eventually we could merge and reuse the constants across other parsers in sclf.
static let QuotationMark: UInt8 = 0x22 // " | ||
static let Escape: UInt8 = 0x5C // \ | ||
private struct JSONParser { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given the size of the parser now its worth moving into its own file (JSONSerialization+Parser.swift
)? Move any associated helper functions into it as well. JSONSerialization.swift
can be renamed to JSONSerialization+Writer.swift
in a followup PR.
e262f0c
to
13a74c7
Compare
@swift-ci please test |
@swift-ci test |
@swift-ci test linux |
@swift-ci test |
@swift-ci test linux |
@swift-ci please test |
We will have follow-ups as bugfix PRs. |
Motivation
Increase JSON performance by parsing into a
JSONValue
type, instead of using reference types (NSNumber, NSString, ...)Changes