Skip to content

Commit 338c18e

Browse files
committed
[deps] Update is_utf8 to version 1.3.0
1 parent 569049a commit 338c18e

File tree

4 files changed

+89
-33
lines changed

4 files changed

+89
-33
lines changed

deps/is_utf8/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
build/
2+
src/dependencies/

deps/is_utf8/CMakeLists.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ cmake_minimum_required(VERSION 3.15)
33
project(is_utf8
44
DESCRIPTION "Fast UTF-8 Validation"
55
LANGUAGES CXX
6-
VERSION 1.2.1
6+
VERSION 1.3.0
77
)
88

99
include(GNUInstallDirs)
@@ -25,8 +25,8 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON)
2525
set(CMAKE_CXX_EXTENSIONS OFF)
2626
set(CMAKE_MACOSX_RPATH OFF)
2727

28-
set(IS_UTF8_LIB_VERSION "1.2.1" CACHE STRING "is_utf8 library version")
29-
set(IS_UTF8_LIB_SOVERSION "1" CACHE STRING "is_utf8 library soversion")
28+
set(IS_UTF8_LIB_VERSION "1.3.0" CACHE STRING "is_utf8 library version")
29+
set(IS_UTF8_LIB_SOVERSION "2" CACHE STRING "is_utf8 library soversion")
3030

3131
set(IS_UTF8_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/src)
3232
add_subdirectory(src)

deps/is_utf8/README.md

Lines changed: 63 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,10 @@
33
Most strings online are in unicode using the UTF-8 encoding. Validating strings
44
quickly before accepting them is important.
55

6-
7-
8-
96
## How to use is_utf8
107

11-
This is a simple one-source file library to validate UTF-8 strings at high speeds using SIMD instructions. It works on all platforms (ARM, x64).
8+
This is a simple one-source file library to validate UTF-8 strings at high
9+
speeds using SIMD instructions. It works on all platforms (ARM, x64).
1210

1311
Build and link `is_utf8.cpp` with your project. Code usage:
1412

@@ -21,13 +19,66 @@ Build and link `is_utf8.cpp` with your project. Code usage:
2119

2220
It should be able to validate strings using less than 1 cycle per input byte.
2321

22+
## Requirements
23+
24+
- C++11 compatible compiler. We support LLVM clang, GCC, Visual Studio. (Our
25+
optional benchmark tool requires C++17.)
26+
- For high speed, you should have a recent 64-bit system (e.g., ARM or x64).
27+
- If you rely on CMake, you should use a recent CMake (at least 3.15).
28+
- AVX-512 support require a processor with AVX512-VBMI2 (Ice Lake or better) and
29+
a recent compiler (GCC 8 or better, Visual Studio 2019 or better, LLVM clang 6
30+
or better). You need a correspondingly recent assembler such as gas (2.30+) or
31+
nasm (2.14+): recent compilers usually come with recent assemblers. If you mix
32+
a recent compiler with an incompatible/old assembler (e.g., when using a
33+
recent compiler with an old Linux distribution), you may get errors at build
34+
time because the compiler produces instructions that the assembler does not
35+
recognize: you should update your assembler to match your compiler (e.g.,
36+
upgrade binutils to version 2.30 or better under Linux) or use an older
37+
compiler matching the capabilities of your assembler.
38+
39+
## Build with CMake
40+
41+
```
42+
cmake -B build
43+
cmake --build build
44+
cd build
45+
ctest .
46+
```
47+
48+
Visual Studio users must specify whether they want to build the Release or Debug
49+
version.
50+
51+
To run benchmarks, build and execute the `bench` command.
52+
53+
```
54+
cmake -B build
55+
cmake --build build
56+
./build/benchmarks/bench
57+
```
58+
59+
Instructions are similar for Visual Studio users.
60+
61+
## Real-word usage
62+
63+
This C++ library is part of the JavaScript package
64+
[utf-8-validate](https://github.com/websockets/utf-8-validate). The
65+
utf-8-validate package is routinely downloaded more than
66+
[a million times per week](https://www.npmjs.com/package/utf-8-validate).
67+
68+
If you are using Node JS (19.4.0 or better), you already have access to this
69+
function as
70+
[`buffer.isUtf8(input)`](https://nodejs.org/api/buffer.html#bufferisutf8input).
71+
2472
## Reference
2573

26-
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice & Experience 51 (5), 2021
74+
- John Keiser, Daniel Lemire,
75+
[Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090),
76+
Software: Practice & Experience 51 (5), 2021
2777

28-
### Want more?
78+
## Want more?
2979

30-
If you want a wide range of fast Unicode function for production use, you can rely on the simdutf library. It is as simple as the following:
80+
If you want a wide range of fast Unicode function for production use, you can
81+
rely on the simdutf library. It is as simple as the following:
3182

3283
```C++
3384
#include "simdutf.cpp"
@@ -48,12 +99,11 @@ int main(int argc, char *argv[]) {
4899
49100
See https://github.com/simdutf/
50101
51-
52102
## License
53103
54-
This library is distributed under the terms of any of the following
55-
licenses, at your option:
104+
This library is distributed under the terms of any of the following licenses, at
105+
your option:
56106
57-
* Apache License (Version 2.0) [LICENSE-APACHE](LICENSE-APACHE),
58-
* Boost Software License [LICENSE-BOOST](LICENSE-BOOST), or
59-
* MIT License [LICENSE-MIT](LICENSE-MIT).
107+
- Apache License (Version 2.0) [LICENSE-APACHE](LICENSE-APACHE),
108+
- Boost Software License [LICENSE-BOOST](LICENSE-BOOST), or
109+
- MIT License [LICENSE-MIT](LICENSE-MIT).

deps/is_utf8/src/is_utf8.cpp

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -688,7 +688,7 @@ class implementation {
688688
689689
virtual uint32_t required_instruction_sets() const {
690690
return _required_instruction_sets;
691-
};
691+
}
692692
693693
/**
694694
* Validate the UTF-8 string.
@@ -826,17 +826,15 @@ template <typename T> class atomic_ptr {
826826
/**
827827
* The list of available implementations compiled into simdutf.
828828
*/
829-
extern IS_UTF8_DLLIMPORTEXPORT const internal::available_implementation_list
830-
available_implementations;
829+
extern IS_UTF8_DLLIMPORTEXPORT const internal::available_implementation_list& get_available_implementations();
831830
832831
/**
833832
* The active implementation.
834833
*
835834
* Automatically initialized on first use to the most advanced implementation
836835
* supported by this hardware.
837836
*/
838-
extern IS_UTF8_DLLIMPORTEXPORT internal::atomic_ptr<const implementation>
839-
active_implementation;
837+
extern IS_UTF8_DLLIMPORTEXPORT internal::atomic_ptr<const implementation>& get_active_implementation();
840838
841839
} // namespace is_utf8_internals
842840
@@ -4640,33 +4638,39 @@ detect_best_supported_implementation_on_first_use::set_best() const noexcept {
46404638

46414639
if (force_implementation_name) {
46424640
auto force_implementation =
4643-
available_implementations[force_implementation_name];
4641+
get_available_implementations()[force_implementation_name];
46444642
if (force_implementation) {
4645-
return active_implementation = force_implementation;
4643+
return get_active_implementation() = force_implementation;
46464644
} else {
46474645
// Note: abort() and stderr usage within the library is forbidden.
4648-
return active_implementation = &unsupported_singleton;
4646+
return get_active_implementation() = &unsupported_singleton;
46494647
}
46504648
}
4651-
return active_implementation =
4652-
available_implementations.detect_best_supported();
4649+
return get_active_implementation() =
4650+
get_available_implementations().detect_best_supported();
46534651
}
46544652

46554653
} // namespace internal
46564654

4657-
IS_UTF8_DLLIMPORTEXPORT const internal::available_implementation_list
4658-
available_implementations{};
4659-
IS_UTF8_DLLIMPORTEXPORT internal::atomic_ptr<const implementation>
4660-
active_implementation{
4661-
&internal::detect_best_supported_implementation_on_first_use_singleton};
4655+
IS_UTF8_DLLIMPORTEXPORT const internal::available_implementation_list& get_available_implementations() {
4656+
static const internal::available_implementation_list available_implementations{};
4657+
return available_implementations;
4658+
}
4659+
4660+
IS_UTF8_DLLIMPORTEXPORT internal::atomic_ptr<const implementation>& get_active_implementation() {
4661+
static const internal::detect_best_supported_implementation_on_first_use detect_best_supported_implementation_on_first_use_singleton;
4662+
static internal::atomic_ptr<const implementation> active_implementation{&detect_best_supported_implementation_on_first_use_singleton};
4663+
return active_implementation;
4664+
}
4665+
46624666

46634667
is_utf8_warn_unused bool validate_utf8(const char *buf, size_t len) noexcept {
4664-
return active_implementation->validate_utf8(buf, len);
4668+
return get_active_implementation()->validate_utf8(buf, len);
46654669
}
46664670

46674671
const implementation *builtin_implementation() {
46684672
static const implementation *builtin_impl =
4669-
available_implementations[IS_UTF8_STRINGIFY(
4673+
get_available_implementations()[IS_UTF8_STRINGIFY(
46704674
IS_UTF8_BUILTIN_IMPLEMENTATION)];
46714675
return builtin_impl;
46724676
}

0 commit comments

Comments
 (0)