Unicode support cleanup #868

forejtv · 2017-04-25T10:58:32Z

Remove an unused utf16 to utf8 conversion function that is broken anyway, and tidy utf32 to utf8.
Change utf8 to utf16 conversion to not require codecvt.

This should make the PR #752 unnecessary.

forejtv · 2017-04-25T12:09:25Z

There was a problem with compilation on Windows, it is fixed now.

smowton

Some suggestions, and one significant problem (checking array bounds). It also sucks that we have to write this sort of stuff instead of just use a library, but I guess there's no getting around that.

smowton · 2017-04-25T13:06:56Z

src/util/unicode.cpp

-    return NULL;
+    unsigned int b1=x & 0xFF;
+    unsigned int b2=x & 0xFF00;
+    return (b1 << 8)+(b2 >> 8);


More usual to use |

smowton · 2017-04-25T13:09:00Z

src/util/unicode.cpp


 \*******************************************************************/

-const char **narrow_argv(int argc, const wchar_t **argv_wide)
+unsigned int swap_last_bytes(unsigned int x)


Rather than write "byte-swap least-significant word in dword", might be clearer to write "byte-swap word" and have the caller cast to the smaller type.

smowton · 2017-04-25T13:12:09Z

src/util/unicode.cpp

+    unsigned int a=(swap_bytes)?swap_last_bytes(code):code;
+    result+=static_cast<wchar_t>(a);
+  }
+  else // code is endoded as two UTF16 characters


smowton · 2017-04-25T13:14:18Z

src/util/unicode.cpp

-  // the following never gets deleted
-  const char **argv_narrow=new const char *[argc+1];
-  argv_narrow[argc]=0;
+void utf16_append_code(unsigned int code, bool swap_bytes, std::wstring &result)


Probably worth noting in a comment that D800-DFFF are also passed through without comment

smowton · 2017-04-25T13:20:22Z

src/util/unicode.cpp

+      else if(c<=0xDF)
+      { // in other cases, we need to read the right number of chars and decode
+        code=(c & 0x1F) << 6;
+        c=in[i++];


Should check array bounds when trying to continue reading a code-point, in case the input is not really UTF-8

tautschnig

I'll refrain from marking this approve/request changes for I've got way to many questions. Either way: Thank you to all that have contributed to this patch. I'll happily close #752 now.

tautschnig · 2017-04-25T16:46:08Z

src/util/unicode.cpp


 \*******************************************************************/

-void utf32_to_utf8(unsigned int c, std::string &result)
+void utf8_append_code(unsigned int c, std::string &result)


Should this be marked "static"?

tautschnig · 2017-04-25T16:46:41Z

src/util/unicode.h

@@ -20,7 +20,6 @@ std::string narrow(const std::wstring &s);
 std::wstring widen(const std::string &s);

 std::string utf32_to_utf8(const std::basic_string<unsigned int> &s);


The above needs to be renamed or even removed (see my suggestion for marking it static).

I don't follow this comment.

Are you confusing this function, which still exist, with a function that got renamed to utf8_append_code?

I may well be - there is a function that has been renamed and I cannot see a corresponding change in the header file.

Yes, the renamed one was utf32_to_utf8(unsigned int c, std::string &result), this one stayed the same. I have made that one static.

tautschnig · 2017-04-25T16:48:27Z

src/util/unicode.cpp

@@ -8,10 +8,9 @@ Author: Daniel Kroening, [email protected]

 #include <cstring>
 #include <locale>
-#include <codecvt>


May I hug you for this change? This obsoletes #752 if I'm right? And thus "COMPILING" should say we can actually build using GCC 4.9 again!? THANK YOU!!!

tautschnig · 2017-04-25T16:48:47Z

src/util/unicode.cpp

 #include <iomanip>
 #include <sstream>
-
+#include <cstdint>


Maybe add a blank line after this one?

tautschnig · 2017-04-25T16:51:29Z

src/util/unicode.cpp

+
+bool is_little_endian_arch()
+{
+  uint16_t i=1;


There are architectures where bytes (or characters?) aren't 8 bits wide (but 16 instead). That said: does this have to be a compile-time or rather a runtime check? Or is it even a target-analysis architecture property?

I think this should not be target-analysis architecture, but the architecture on which cbmc runs, because the underlying architecture determines how integers are endianed.

I don't follow the comment on bytes/chars. uint8_t should be 8 bits, or not?

Yes, uint8_t will be 8 bits, but I'm not entirely sure whether a char-is-16bits architecture would yield the expected answer here. You may have to go for uint32_t to make sure endianness affects byte order?

tautschnig · 2017-04-25T16:51:56Z

src/util/unicode.cpp

+
+uint16_t do_swap_bytes(uint16_t x)
+{
+    uint16_t b1=x & 0xFF;


This looks like a 4-character indent? Applies below as well.

tautschnig · 2017-04-25T16:53:54Z

unit/Makefile

@@ -11,6 +11,7 @@ SRC = cpp_parser.cpp \
      smt2_parser.cpp \
      string_utils.cpp \
      wp.cpp \
+      unicode.cpp \


Lexicographic order of files, please.

tautschnig · 2017-04-25T16:55:50Z

src/util/unicode.cpp


 \*******************************************************************/

 std::wstring utf8_to_utf16_big_endian(const std::string& in)
 {
-  std::wstring_convert<std::codecvt_utf8_utf16<wchar_t> > converter;
-  return converter.from_bytes(in);
+  bool swap_bytes=is_little_endian_arch();


Following up to my own comment above: do we actually have use cases for this function where the operating environment decides endianness, as well as cases where it's the program under analysis that matters (and thus information in config)?

As far as I understand the context (and how the original codecvt-based function was implemented), this has nothing to do with analysis endianness.

But anyway, this particular function seems to be never used, but I still implemented it as it was quite easy having the other one, which is used. And I was not sure if we might need this one in the future when we extend string support in cbmc.

git grep tells me:

java_bytecode/java_bytecode_typecheck_expr.cpp: utf8_to_utf16_little_endian(id2string(value))); solvers/refinement/string_constraint_generator_constants.cpp: str=utf8_to_utf16_little_endian(c_str); util/file_util.cpp: delete_directory_utf16(utf8_to_utf16_little_endian(path));

The last one is the runtime environment, which I believe we would thus know at compile team.

The first one processes analysis input into an internal format. I do not think that the internal format should vary across architectures, but instead should always be of a single endianness. (I believe internally we're actually big-endian. @kroening ?)

The middle case I'm not entirely sure about...

I lack deeper knowledge of the usecases, and the more I look at them the more I am confused. We use the little endian function in refinements, is there a reason for this (@smowton?). I thought this has something to do with Java strings, but a quick google search suggests that Java uses big endian utf16 no matter what the platform is.

The use case I originally wrote this for was to translate UTF-8 strings stored in Java class files into UTF-16 (in analysis host endian-ness) string literals. It doesn't matter to us that the JVM would use big-endian UTF-16-- rather that arithmetic on codepoints corresponds to arithmetic on the host doing the analysis. I suspect the code in https://github.com/diffblue/cbmc/blob/c8c9085f38317359bf175d8bb99ab16e4605c1e1/src/java_bytecode/java_bytecode_typecheck_expr.cpp only works on little-endian architectures as a result.

@smowton Should this be raised as a bug and dealt with in a separate PR? And is there any additional functionality needed for this in utils/unicode.cpp?

Probably, yes. There's already a to-big-endian function too so no extra work in unicode.cpp to my knowledge.

Do we thus really need to run-time check is_little_endian_arch - it seems we could get away with a compile-time check.

…utf32 to utf8.

A unit test is added to check (on a few instances) equivalence with the original implementation. Decrease the required gcc version.

forejtv requested review from smowton and tautschnig April 25, 2017 10:58

forejtv added do not merge and removed do not merge labels Apr 25, 2017

smowton suggested changes Apr 25, 2017

View reviewed changes

tautschnig reviewed Apr 25, 2017

View reviewed changes

tautschnig mentioned this pull request Apr 25, 2017

Re-enable compilation using GCC version 4.9 #752

Closed

forejtv added 2 commits April 25, 2017 18:23

Remove an unused conversion function that is broken anyway, and tidy …

549e589

…utf32 to utf8.

Change utf8 to utf16 conversion to not require codecvt.

49b77ce

A unit test is added to check (on a few instances) equivalence with the original implementation. Decrease the required gcc version.

kroening merged commit 065810f into diffblue:master Apr 27, 2017

tautschnig mentioned this pull request Apr 28, 2017

Review endianness configuration in unicode.cpp #878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support cleanup #868

Unicode support cleanup #868

forejtv commented Apr 25, 2017 •

edited

Loading

forejtv commented Apr 25, 2017

smowton left a comment

smowton Apr 25, 2017

smowton Apr 25, 2017

smowton Apr 25, 2017

smowton Apr 25, 2017

smowton Apr 25, 2017

tautschnig left a comment

tautschnig Apr 25, 2017

tautschnig Apr 25, 2017

forejtv Apr 25, 2017

tautschnig Apr 25, 2017

forejtv Apr 25, 2017

tautschnig Apr 25, 2017

tautschnig Apr 25, 2017

tautschnig Apr 25, 2017

forejtv Apr 25, 2017

tautschnig Apr 25, 2017

tautschnig Apr 25, 2017

tautschnig Apr 25, 2017

tautschnig Apr 25, 2017

forejtv Apr 25, 2017

tautschnig Apr 25, 2017

forejtv Apr 25, 2017

smowton Apr 26, 2017

forejtv Apr 26, 2017

smowton Apr 26, 2017

tautschnig Apr 26, 2017

		@@ -20,7 +20,6 @@ std::string narrow(const std::wstring &s);
		std::wstring widen(const std::string &s);

		std::string utf32_to_utf8(const std::basic_string<unsigned int> &s);

Unicode support cleanup #868

Unicode support cleanup #868

Conversation

forejtv commented Apr 25, 2017 • edited Loading

forejtv commented Apr 25, 2017

smowton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tautschnig left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

forejtv commented Apr 25, 2017 •

edited

Loading