-
Notifications
You must be signed in to change notification settings - Fork 273
Unicode support cleanup #868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem with compilation on Windows, it is fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions, and one significant problem (checking array bounds). It also sucks that we have to write this sort of stuff instead of just use a library, but I guess there's no getting around that.
src/util/unicode.cpp
Outdated
return NULL; | ||
unsigned int b1=x & 0xFF; | ||
unsigned int b2=x & 0xFF00; | ||
return (b1 << 8)+(b2 >> 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More usual to use |
src/util/unicode.cpp
Outdated
|
||
\*******************************************************************/ | ||
|
||
const char **narrow_argv(int argc, const wchar_t **argv_wide) | ||
unsigned int swap_last_bytes(unsigned int x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than write "byte-swap least-significant word in dword", might be clearer to write "byte-swap word" and have the caller cast to the smaller type.
src/util/unicode.cpp
Outdated
unsigned int a=(swap_bytes)?swap_last_bytes(code):code; | ||
result+=static_cast<wchar_t>(a); | ||
} | ||
else // code is endoded as two UTF16 characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
src/util/unicode.cpp
Outdated
// the following never gets deleted | ||
const char **argv_narrow=new const char *[argc+1]; | ||
argv_narrow[argc]=0; | ||
void utf16_append_code(unsigned int code, bool swap_bytes, std::wstring &result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth noting in a comment that D800-DFFF are also passed through without comment
src/util/unicode.cpp
Outdated
else if(c<=0xDF) | ||
{ // in other cases, we need to read the right number of chars and decode | ||
code=(c & 0x1F) << 6; | ||
c=in[i++]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check array bounds when trying to continue reading a code-point, in case the input is not really UTF-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll refrain from marking this approve/request changes for I've got way to many questions. Either way: Thank you to all that have contributed to this patch. I'll happily close #752 now.
src/util/unicode.cpp
Outdated
|
||
\*******************************************************************/ | ||
|
||
void utf32_to_utf8(unsigned int c, std::string &result) | ||
void utf8_append_code(unsigned int c, std::string &result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be marked "static"?
@@ -20,7 +20,6 @@ std::string narrow(const std::wstring &s); | |||
std::wstring widen(const std::string &s); | |||
|
|||
std::string utf32_to_utf8(const std::basic_string<unsigned int> &s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above needs to be renamed or even removed (see my suggestion for marking it static
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this comment.
Are you confusing this function, which still exist, with a function that got renamed to utf8_append_code
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may well be - there is a function that has been renamed and I cannot see a corresponding change in the header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the renamed one was utf32_to_utf8(unsigned int c, std::string &result)
, this one stayed the same. I have made that one static
.
@@ -8,10 +8,9 @@ Author: Daniel Kroening, [email protected] | |||
|
|||
#include <cstring> | |||
#include <locale> | |||
#include <codecvt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I hug you for this change? This obsoletes #752 if I'm right? And thus "COMPILING" should say we can actually build using GCC 4.9 again!? THANK YOU!!!
src/util/unicode.cpp
Outdated
#include <iomanip> | ||
#include <sstream> | ||
|
||
#include <cstdint> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a blank line after this one?
src/util/unicode.cpp
Outdated
|
||
bool is_little_endian_arch() | ||
{ | ||
uint16_t i=1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are architectures where bytes (or characters?) aren't 8 bits wide (but 16 instead). That said: does this have to be a compile-time or rather a runtime check? Or is it even a target-analysis architecture property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be target-analysis architecture, but the architecture on which cbmc runs, because the underlying architecture determines how integers are endianed.
I don't follow the comment on bytes/chars. uint8_t should be 8 bits, or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, uint8_t will be 8 bits, but I'm not entirely sure whether a char-is-16bits architecture would yield the expected answer here. You may have to go for uint32_t
to make sure endianness affects byte order?
src/util/unicode.cpp
Outdated
|
||
uint16_t do_swap_bytes(uint16_t x) | ||
{ | ||
uint16_t b1=x & 0xFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a 4-character indent? Applies below as well.
unit/Makefile
Outdated
@@ -11,6 +11,7 @@ SRC = cpp_parser.cpp \ | |||
smt2_parser.cpp \ | |||
string_utils.cpp \ | |||
wp.cpp \ | |||
unicode.cpp \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lexicographic order of files, please.
|
||
\*******************************************************************/ | ||
|
||
std::wstring utf8_to_utf16_big_endian(const std::string& in) | ||
{ | ||
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t> > converter; | ||
return converter.from_bytes(in); | ||
bool swap_bytes=is_little_endian_arch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up to my own comment above: do we actually have use cases for this function where the operating environment decides endianness, as well as cases where it's the program under analysis that matters (and thus information in config
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand the context (and how the original codecvt-based function was implemented), this has nothing to do with analysis endianness.
But anyway, this particular function seems to be never used, but I still implemented it as it was quite easy having the other one, which is used. And I was not sure if we might need this one in the future when we extend string support in cbmc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git grep
tells me:
java_bytecode/java_bytecode_typecheck_expr.cpp: utf8_to_utf16_little_endian(id2string(value)));
solvers/refinement/string_constraint_generator_constants.cpp: str=utf8_to_utf16_little_endian(c_str);
util/file_util.cpp: delete_directory_utf16(utf8_to_utf16_little_endian(path));
The last one is the runtime environment, which I believe we would thus know at compile team.
The first one processes analysis input into an internal format. I do not think that the internal format should vary across architectures, but instead should always be of a single endianness. (I believe internally we're actually big-endian. @kroening ?)
The middle case I'm not entirely sure about...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lack deeper knowledge of the usecases, and the more I look at them the more I am confused. We use the little endian function in refinements, is there a reason for this (@smowton?). I thought this has something to do with Java strings, but a quick google search suggests that Java uses big endian utf16 no matter what the platform is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case I originally wrote this for was to translate UTF-8 strings stored in Java class files into UTF-16 (in analysis host endian-ness) string literals. It doesn't matter to us that the JVM would use big-endian UTF-16-- rather that arithmetic on codepoints corresponds to arithmetic on the host doing the analysis. I suspect the code in https://github.com/diffblue/cbmc/blob/c8c9085f38317359bf175d8bb99ab16e4605c1e1/src/java_bytecode/java_bytecode_typecheck_expr.cpp only works on little-endian architectures as a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smowton Should this be raised as a bug and dealt with in a separate PR? And is there any additional functionality needed for this in utils/unicode.cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, yes. There's already a to-big-endian function too so no extra work in unicode.cpp to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we thus really need to run-time check is_little_endian_arch
- it seems we could get away with a compile-time check.
A unit test is added to check (on a few instances) equivalence with the original implementation. Decrease the required gcc version.
This should make the PR #752 unnecessary.