Parse JSON with value types #2966

fabianfett · 2021-01-11T23:44:57Z

Motivation

Increase JSON performance by parsing into a JSONValue type, instead of using reference types (NSNumber, NSString, ...)

Changes

the json parsing has been replaced with a new approach

MaxDesiatov · 2021-01-12T15:11:37Z

@swift-ci please test

spevans · 2021-01-12T16:00:15Z

I will have a deeper looker later but do we currently have any benchmarks for the current JSONSerialization? From some testing I was doing I saw that sclf is about 2-5 times slower than native foundation but it would be useful to have some to track how much of an improvement this change is and also how close to native speed it gets.

Ive also been working on an improved solution for the JSON number parsing to better deal with conversion issues between Decimal and Double with some slight speedups with bridging - I'll probably put it up as a separate PR as it will be easier to review and then look to merge it in this.

drexin · 2021-01-14T22:32:06Z

Hey, I took the liberty to bench these changes against nightly and it's quite an improvement.

Parsing a sizable JSON doc (~8k bytes) 10,000 times takes

Nightly: 1.71179s
This PR: 0.67263s

3a4oT · 2021-01-15T12:03:33Z

Thanks for doing that @fabianfett . I just want to point out that improvements like this are very important for server-side swift. We use swift-extras-json (previously Pure Swift JSON) in a production environment and it is much faster than the current version of Foundation. I would love to see at least the same numbers for Foundation which come with Swift so we can drop redundant third-party.

MaxDesiatov · 2021-01-21T20:57:38Z

@swift-ci please test

fabianfett · 2021-01-21T20:57:41Z

Sources/Foundation/JSONSerialization.swift

+                #warning("@Fabian: pretty sure we should throw an error here, if this is invalid data")
+                let json = String(bytes: ptr, encoding: encoding!)!
+                return try json.utf8.withContiguousStorageIfAvailable { (utf8) -> JSONValue in
+                    try JSONParser().parse(bytes: utf8)
+                }!


Can this ever fail? Do we need explicit error handling here? @drexin @weissi

On Darwin, this can fail. On Linux it cannot.

You can write

var json = String(bytes:...) json.makeContiguousUTF8()

and after that you can ! it.

isn't json always utf8 though?

I found a failure input which is UTF8 with BOM and bad data, Here is a test case to add to
TestJSONSerialization.test_JSONObjectWithData_encodingDetection:

("{} UTF-8 w/BOM w/ trailing illegal data", [0xEF, 0xBB, 0xBF, 0x7B, 0x7D, 0xff, 0x00]),
(will need a test to confirm the JSON parsing failed and returned nil)

I would suggest merging (or at least have one call the other) parseBOM and detectEncoding - I cant see anywhere else they are used. If the encoding is determined to be UTF-8 then any leading BOM can be skipped as a new String does not need to be created. If if looks like UTF-8 a quick check if the first character is ASCII should be sufficient conformation as I dont think any JSON can start with a unicode point > 127.

Then the code could be simplified to something like:

guard let (encoding, advanceBy) = parseBOM(ptr) else { throw ... } let newPtr = ptr[advanceBy..<ptr.count] if encoding == .utf8 { parse it from newPtr... } else { guard let string = String(newPtr..) else { throw .. } }

I couldnt find anything on json.org to say that JSON was guaranteed to be UTF-8 but its unlikely not to be so just converting to a new String as you have done is fine (even though it will do an extra allocation).

I would also avoid any force unwrapping when using String() and just throw since you are dealing with external input.

from here

3. Encoding JSON text SHALL be encoded in Unicode. The default encoding is UTF-8. Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets. 00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8

For what it’s with, Darwin Foundation checks to make sure it’s UTF8 first and if it’s not just converts it to UTF8 for parsing.

@parkera I'm afraid you quoted RFC 4672, which has been obsoleted for quite some time by 8258:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629]. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Am I right in assuming that the Darwin implementation allows UTF16 and UTF32 encoding and further allows byte order marks?

I think the important takeaway from the 2 RFCs is that the input COULD be in any Unicode encoding but the output MUST always be UTF-8. However since the chance of the input being non-UTF8 is low the conversion from other inputs to UTF-8 as a one-time operation at the start is fine. Your code currently does all of this.

Running the test test_JSONObjectWithData_encodingDetection on Darwin using DarwinCompatibilityTests seems to indicate that Darwin doesn't handle some inputs (UTF16BE/LE witout BOM and UTF32LE w/BOM)

fabianfett · 2021-01-21T21:10:46Z

@spevans @millenomi I'm afraid this is ready for review.

Sources/Foundation/JSONSerialization.swift

spevans · 2021-01-22T12:09:37Z

Sources/Foundation/JSONSerialization.swift

            } else {
-                throw NSError(domain: NSCocoaErrorDomain, code: CocoaError.propertyListReadCorrupt.rawValue,
-                              userInfo: [NSDebugDescriptionErrorKey : "Numbers must start with a 1-9 at character \(input)." ])
+                self.array = Array(bytes)
            }


Does DocumentReader need to take the input as <Bytes: Collection> ? It looks like the input is always an UnsafeBufferPointer<UInt8> and simply storing that struct may be faster. Calling Array(bytes) or as? [UIntt8] may end up allocating another buffer. I think the generics are just going to add overhead here.

I'm pretty sure the compiler will create two perfectly optimized methods for this. Which means we won't have any performance hits here. I don't think we should operate on UnsafeBufferPointer<UInt8> here, since this comes with Security implications, based on the unsafe memory access in release builds.

Sources/Foundation/JSONSerialization.swift

spevans · 2021-02-04T10:35:12Z

Sources/Foundation/JSONSerialization.swift

+        if bytes.starts(with: [0xFE, 0xFF]) {
+            return (.utf16BigEndian, 2)
+        }
+
        return nil


detectEncoding and parseBOM should be merged into one function returning (encoding: String.Encoding, skipLength: Int)? since they both work on the same inputs.

spevans · 2021-02-04T10:53:50Z

Sources/Foundation/JSONSerialization.swift

-            case .utf32LittleEndian where buffer[input+1] == 0 && buffer[input+2] == 0 && buffer[input+3] == 0:
-                index = input
+            switch extraCharacter {
+            case UInt8(ascii: " "), UInt8(ascii: "\r"), UInt8(ascii: "\n"), UInt8(ascii: "\t"):


These should be defined as constants to make it easier to read and reduce errors due to a typo with a characters. Eventually we could merge and reuse the constants across other parsers in sclf.

spevans · 2021-02-04T11:17:06Z

Sources/Foundation/JSONSerialization.swift

-        static let QuotationMark: UInt8  = 0x22 // "
-        static let Escape: UInt8         = 0x5C // \
+private struct JSONParser {
+


I think given the size of the parser now its worth moving into its own file (JSONSerialization+Parser.swift)? Move any associated helper functions into it as well. JSONSerialization.swift can be renamed to JSONSerialization+Writer.swift in a followup PR.

tomerd · 2021-02-25T23:02:24Z

@swift-ci please test

drexin · 2021-02-26T00:34:13Z

@swift-ci test

spevans · 2021-02-26T14:19:00Z

@swift-ci test linux

tomerd · 2021-02-26T17:43:18Z

@swift-ci test

spevans · 2021-02-27T09:59:20Z

@swift-ci test linux

millenomi · 2021-03-01T17:43:02Z

@swift-ci please test

millenomi · 2021-03-01T21:07:38Z

We will have follow-ups as bugfix PRs.

MaxDesiatov requested review from millenomi and spevans January 12, 2021 15:11

fabianfett force-pushed the ff-json-parsing branch from 89a6532 to c7319a1 Compare January 21, 2021 18:57

fabianfett commented Jan 21, 2021

View reviewed changes

fabianfett marked this pull request as ready for review January 21, 2021 20:58

weissi reviewed Jan 21, 2021

View reviewed changes

Sources/Foundation/JSONSerialization.swift Outdated Show resolved Hide resolved

spevans reviewed Jan 22, 2021

View reviewed changes

Sources/Foundation/JSONSerialization.swift Outdated Show resolved Hide resolved

spevans reviewed Jan 22, 2021

View reviewed changes

Sources/Foundation/JSONSerialization.swift Outdated Show resolved Hide resolved

fabianfett force-pushed the ff-json-parsing branch from ace6c73 to 207c5b9 Compare February 3, 2021 23:30

spevans reviewed Feb 4, 2021

View reviewed changes

spevans mentioned this pull request Feb 4, 2021

JSONSerialization: Improve number parsing for JSON #2980

Closed

fabianfett added 4 commits February 25, 2021 22:28

Parse JSON with value types

5daba85

Added a bunch of tests

9b09903

Remove unused code and address code review

e5f4c47

Code review

13a74c7

fabianfett force-pushed the ff-json-parsing branch from e262f0c to 13a74c7 Compare February 25, 2021 21:29

fabianfett added 4 commits February 25, 2021 23:15

Move JSONParser to an extra file

9552dd9

Fixed a test for Swift 5.4

bd7d0da

Use constants

e181507

Refactor

3a30634

millenomi approved these changes Feb 25, 2021

View reviewed changes

Add new JSONSerialization+Parser.swift to CMakeList.txt

9e5d91d

fabianfett added 3 commits February 28, 2021 19:25

Fix crash

6aa722c

Making everything much easier…

884cd69

Final fixes

222fab8

millenomi merged commit 883dfb8 into swiftlang:main Mar 1, 2021

fabianfett deleted the ff-json-parsing branch March 1, 2021 21:21

This was referenced Mar 2, 2021

JSON Decoding based on JSONValue #2985

Merged

JSON Encoding based on JSONValue #2987

Merged

kneekey23 mentioned this pull request May 3, 2021

chore: alias foundation into clientruntime smithy-lang/smithy-swift#208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse JSON with value types #2966

Parse JSON with value types #2966

fabianfett commented Jan 11, 2021 •

edited

Loading

MaxDesiatov commented Jan 12, 2021

spevans commented Jan 12, 2021

drexin commented Jan 14, 2021

3a4oT commented Jan 15, 2021

MaxDesiatov commented Jan 21, 2021

fabianfett Jan 21, 2021

weissi Jan 21, 2021

weissi Jan 21, 2021

spevans Jan 22, 2021 •

edited

Loading

parkera Jan 30, 2021

parkera Jan 30, 2021

fabianfett Feb 4, 2021

spevans Feb 4, 2021

fabianfett commented Jan 21, 2021

spevans Jan 22, 2021

fabianfett Feb 3, 2021

spevans Feb 4, 2021

spevans Feb 4, 2021

spevans Feb 4, 2021

tomerd commented Feb 25, 2021

drexin commented Feb 26, 2021

spevans commented Feb 26, 2021

tomerd commented Feb 26, 2021

spevans commented Feb 27, 2021

millenomi commented Mar 1, 2021

millenomi commented Mar 1, 2021

Parse JSON with value types #2966

Parse JSON with value types #2966

Conversation

fabianfett commented Jan 11, 2021 • edited Loading

Motivation

Changes

MaxDesiatov commented Jan 12, 2021

spevans commented Jan 12, 2021

drexin commented Jan 14, 2021

3a4oT commented Jan 15, 2021

MaxDesiatov commented Jan 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spevans Jan 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianfett commented Jan 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomerd commented Feb 25, 2021

drexin commented Feb 26, 2021

spevans commented Feb 26, 2021

tomerd commented Feb 26, 2021

spevans commented Feb 27, 2021

millenomi commented Mar 1, 2021

millenomi commented Mar 1, 2021

fabianfett commented Jan 11, 2021 •

edited

Loading

spevans Jan 22, 2021 •

edited

Loading