Skip to content

Parse JSON with value types #2966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 1, 2021
Merged

Conversation

fabianfett
Copy link
Contributor

@fabianfett fabianfett commented Jan 11, 2021

Motivation

Increase JSON performance by parsing into a JSONValue type, instead of using reference types (NSNumber, NSString, ...)

Changes

  • the json parsing has been replaced with a new approach

@MaxDesiatov
Copy link
Contributor

@swift-ci please test

@spevans
Copy link
Contributor

spevans commented Jan 12, 2021

I will have a deeper looker later but do we currently have any benchmarks for the current JSONSerialization? From some testing I was doing I saw that sclf is about 2-5 times slower than native foundation but it would be useful to have some to track how much of an improvement this change is and also how close to native speed it gets.

Ive also been working on an improved solution for the JSON number parsing to better deal with conversion issues between Decimal and Double with some slight speedups with bridging - I'll probably put it up as a separate PR as it will be easier to review and then look to merge it in this.

@drexin
Copy link
Contributor

drexin commented Jan 14, 2021

Hey, I took the liberty to bench these changes against nightly and it's quite an improvement.

Parsing a sizable JSON doc (~8k bytes) 10,000 times takes

Nightly: 1.71179s
This PR: 0.67263s

@3a4oT
Copy link

3a4oT commented Jan 15, 2021

Thanks for doing that @fabianfett . I just want to point out that improvements like this are very important for server-side swift. We use swift-extras-json (previously Pure Swift JSON) in a production environment and it is much faster than the current version of Foundation. I would love to see at least the same numbers for Foundation which come with Swift so we can drop redundant third-party.

@MaxDesiatov
Copy link
Contributor

@swift-ci please test

Comment on lines 204 to 211
#warning("@Fabian: pretty sure we should throw an error here, if this is invalid data")
let json = String(bytes: ptr, encoding: encoding!)!
return try json.utf8.withContiguousStorageIfAvailable { (utf8) -> JSONValue in
try JSONParser().parse(bytes: utf8)
}!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever fail? Do we need explicit error handling here? @drexin @weissi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Darwin, this can fail. On Linux it cannot.

You can write

var json = String(bytes:...)
json.makeContiguousUTF8()

and after that you can ! it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't json always utf8 though?

Copy link
Contributor

@spevans spevans Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a failure input which is UTF8 with BOM and bad data, Here is a test case to add to
TestJSONSerialization.test_JSONObjectWithData_encodingDetection:

("{} UTF-8 w/BOM w/ trailing illegal data", [0xEF, 0xBB, 0xBF, 0x7B, 0x7D, 0xff, 0x00]),
(will need a test to confirm the JSON parsing failed and returned nil)

I would suggest merging (or at least have one call the other) parseBOM and detectEncoding - I cant see anywhere else they are used. If the encoding is determined to be UTF-8 then any leading BOM can be skipped as a new String does not need to be created. If if looks like UTF-8 a quick check if the first character is ASCII should be sufficient conformation as I dont think any JSON can start with a unicode point > 127.

Then the code could be simplified to something like:


guard let (encoding, advanceBy) = parseBOM(ptr) else { throw ... }
let newPtr = ptr[advanceBy..<ptr.count]
if encoding == .utf8 {
     parse it from newPtr...
 } else {
  guard let string = String(newPtr..) else { throw .. }
}

I couldnt find anything on json.org to say that JSON was guaranteed to be UTF-8 but its unlikely not to be so just converting to a new String as you have done is fine (even though it will do an extra allocation).

I would also avoid any force unwrapping when using String() and just throw since you are dealing with external input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from here

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it’s with, Darwin Foundation checks to make sure it’s UTF8 first and if it’s not just converts it to UTF8 for parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parkera I'm afraid you quoted RFC 4672, which has been obsoleted for quite some time by 8258:

   JSON text exchanged between systems that are not part of a closed
   ecosystem MUST be encoded using UTF-8 [RFC3629].

   Previous specifications of JSON have not required the use of UTF-8
   when transmitting JSON text.  However, the vast majority of JSON-
   based software implementations have chosen to use the UTF-8 encoding,
   to the extent that it is the only encoding that achieves
   interoperability.

   Implementations MUST NOT add a byte order mark (U+FEFF) to the
   beginning of a networked-transmitted JSON text.  In the interests of
   interoperability, implementations that parse JSON texts MAY ignore
   the presence of a byte order mark rather than treating it as an
   error.

Am I right in assuming that the Darwin implementation allows UTF16 and UTF32 encoding and further allows byte order marks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the important takeaway from the 2 RFCs is that the input COULD be in any Unicode encoding but the output MUST always be UTF-8. However since the chance of the input being non-UTF8 is low the conversion from other inputs to UTF-8 as a one-time operation at the start is fine. Your code currently does all of this.

Running the test test_JSONObjectWithData_encodingDetection on Darwin using DarwinCompatibilityTests seems to indicate that Darwin doesn't handle some inputs (UTF16BE/LE witout BOM and UTF32LE w/BOM)

@fabianfett fabianfett marked this pull request as ready for review January 21, 2021 20:58
@fabianfett
Copy link
Contributor Author

@spevans @millenomi I'm afraid this is ready for review.

} else {
throw NSError(domain: NSCocoaErrorDomain, code: CocoaError.propertyListReadCorrupt.rawValue,
userInfo: [NSDebugDescriptionErrorKey : "Numbers must start with a 1-9 at character \(input)." ])
self.array = Array(bytes)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does DocumentReader need to take the input as <Bytes: Collection> ? It looks like the input is always an UnsafeBufferPointer<UInt8> and simply storing that struct may be faster. Calling Array(bytes) or as? [UIntt8] may end up allocating another buffer. I think the generics are just going to add overhead here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the compiler will create two perfectly optimized methods for this. Which means we won't have any performance hits here. I don't think we should operate on UnsafeBufferPointer<UInt8> here, since this comes with Security implications, based on the unsafe memory access in release builds.

if bytes.starts(with: [0xFE, 0xFF]) {
return (.utf16BigEndian, 2)
}

return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detectEncoding and parseBOM should be merged into one function returning (encoding: String.Encoding, skipLength: Int)? since they both work on the same inputs.

case .utf32LittleEndian where buffer[input+1] == 0 && buffer[input+2] == 0 && buffer[input+3] == 0:
index = input
switch extraCharacter {
case UInt8(ascii: " "), UInt8(ascii: "\r"), UInt8(ascii: "\n"), UInt8(ascii: "\t"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be defined as constants to make it easier to read and reduce errors due to a typo with a characters. Eventually we could merge and reuse the constants across other parsers in sclf.

static let QuotationMark: UInt8 = 0x22 // "
static let Escape: UInt8 = 0x5C // \
private struct JSONParser {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the size of the parser now its worth moving into its own file (JSONSerialization+Parser.swift)? Move any associated helper functions into it as well. JSONSerialization.swift can be renamed to JSONSerialization+Writer.swift in a followup PR.

@tomerd
Copy link

tomerd commented Feb 25, 2021

@swift-ci please test

@drexin
Copy link
Contributor

drexin commented Feb 26, 2021

@swift-ci test

@spevans
Copy link
Contributor

spevans commented Feb 26, 2021

@swift-ci test linux

@tomerd
Copy link

tomerd commented Feb 26, 2021

@swift-ci test

@spevans
Copy link
Contributor

spevans commented Feb 27, 2021

@swift-ci test linux

@millenomi
Copy link
Contributor

@swift-ci please test

@millenomi
Copy link
Contributor

We will have follow-ups as bugfix PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants