-
-
Notifications
You must be signed in to change notification settings - Fork 7k
Is it possible to make Serial.print() -and "Serial monitor"- compatible with UTF-8? #2519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I mean...for extension, String class could be compatible too? |
I think that C doesn't know about UTF-8 either - a char is always just a single byte. If the Euro sign you typed translates to multiple UTF-8 bytes, then I'm not sure what happens here. It might be that you get a two-byte constant (wchar or something?). In any case, this is unlikely to work. Using I think the serial and String classes just handle byte arrays, so they should handle literal UTF8 just fine. Things will get more complicated when you start to modify strings (e.g. if you want to take the first three characters of a string instead of the first three bytes). Properly supporting UTF8 for those kind of modifications is probably way too complex and big to implement on Arduino.
That is totally relevant - A serial line is just a byte stream, it's up to the terminal emulator (minicom in this case) to decode bytes into characters. If minicom is not configured for UTF8, then sending UTF8 characters is not going to work. |
Actually, if you run this on a Arduino Due, it prints 14844588. In hex, that's 0xE282AC. E2 82 AC is the UTF8 encoding for '€'. On Arduino Uno, it prints -32084 , which is 0x82AC. It seems the AVR compiler isn't smart enough to know it needs to promote to a 32 bit integer, but it's trying to give an integer constant representing the UTF8 encoded character. One might argue the compiler ought to give 0x20AC (the Unicode value) for the character constant, but clearly the compiler is trying to give UTF8. |
I think the compiler is not doing either - its the text editor that decided on the encoding to use. I guess the compiler only has support for multi-byte char constants and translates to the smallest fitting integer type. In other words, you could also write |
The serial monitor should now support UTF8, but only for strings. If there is a way to use UTF8 that'd be great. I try to implement a multilingual keyboard but this breaks everything. void setup() {}
void loop() {
Serial.println('€', HEX);
Serial.println(uint32_t('€'), HEX);
Serial.println(long('€'), HEX);
Serial.println(L'€', HEX);
Serial.println('€', DEC);
Serial.println('€');
Serial.println(L'€');
Serial.println("€");
Serial.println("€"[0], HEX);
Serial.println("€"[1], HEX);
Serial.println("€"[2], HEX);
Serial.println("€"[3], HEX);
Serial.println("-----");
delay(1000);
}
/*
FFFF82AC
FFFF82AC
FFFF82AC
20AC
-32084
-32084
8364
€
FFFFFFE2
FFFFFF82
FFFFFFAC
0
-----
*/ |
UTF8 is supported in strings. Your own example shows this prints correctly:
|
Yup. But I want to do something like Keyboard.press('€'); |
At least on AVR, you're not going to get Keyboard.press('€'), because UTF8 requires 24 bits to encode that character, but the compiler truncates to only 16, even for functions which take a 32 bit parameter. |
using L'€' would work. why not without the L? |
@NicoHood I don't actually think that UTF-8 characters are natively supported by keyboards? What keycode should be sent for €? AFAIK some keyboard use ctrl-alt-5 for that, but on my Linux keyboard I need to type compose+e, compose+= to get it (where compose is right allt). The Keyboard library has a mapping of ASCII to keycodes, that could perhaps be extended to include UTF-8 characters, provided that those actually have a keycode. In any case, @NicoHood is now talking about Keyboard.press handling UTF-8 characters, which is a totally different issue from the original issue (@NicoHood, if you want to continue this discussion, please open a new issue). The original reporter talks about handling UTF-8 characters in Serial printing (which is supported, in strings, not char types because char is always one byte) and the IDE serial monitor (which supports UTF-8 now according to NicoHood). The original issue also mentions String support for UTF-8, but I'm afraid that's a whole can of worms with all kinds of corner cases and a lot of overhead, for just very few usecases. I guess this issue can be closed? |
The serial monitor doesn't correctly support UTF-8 this is an unresolved issue, it may seems to work but it doesn't, a timing issue exists, for details look here: The same comment explain why using UTF-8 characters in C/C++ strings may be risky (for example extracting a substring from an UTF8 string may produce undefined results because char arrays are unaware of utf8 characters that may be composed of many bytes). |
USB keyboard work different. There is no UTF inside the USB protocol. You just need to press more keys, depending on the layout. Therefor the normal Arduino API is not enough, that is what I was working on. If the layout does not support € for example it is also not possible to press it of course. The issue is still related, since I need to read the data from the API and press the keys. Similar issue can be created with wanting to print Serial.write('€'); Where with the example above the char as '' char instead of "" string will lose some data. So the problem is the same actually, just another usecase. |
No, not really. For printing UTF-8 data, you simply have to print the bytes that compose the character. In terms of code, this just means that these bytes have to be passed through unchanged. As mentioned before, the C For passing UTF-8 characters through the keyboard library, they have to be translated to keycodes, which is a totally different process. Anyway, I'll stop talking here - like I said, open a new issue if you want to continue discussing this. |
A small comment regarding
In other words:
So
In short, just use |
hahaha bro arduino is C++ |
I mean... a simple code like...:
void setup() { Serial.begin(9600); }
void loop() { Serial.println('€'); }
...gives trash.
I'm using minicom -D /dev/ttyACM0 -b 9600 in Linux, but that's not relevant.
Thanks
P.D: As a plus, if "Serial monitor" could handle UTF-8 data too, it would be fantastic!!!!
The text was updated successfully, but these errors were encountered: