Skip to content

Is it possible to make Serial.print() -and "Serial monitor"- compatible with UTF-8? #2519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
q2dg opened this issue Dec 28, 2014 · 15 comments
Closed
Assignees
Milestone

Comments

@q2dg
Copy link

q2dg commented Dec 28, 2014

I mean... a simple code like...:

void setup() { Serial.begin(9600); }
void loop() { Serial.println('€'); }

...gives trash.

I'm using minicom -D /dev/ttyACM0 -b 9600 in Linux, but that's not relevant.
Thanks

P.D: As a plus, if "Serial monitor" could handle UTF-8 data too, it would be fantastic!!!!

@q2dg
Copy link
Author

q2dg commented Dec 28, 2014

I mean...for extension, String class could be compatible too?

@matthijskooijman
Copy link
Collaborator

void loop() { Serial.println('€'); }

I think that C doesn't know about UTF-8 either - a char is always just a single byte. If the Euro sign you typed translates to multiple UTF-8 bytes, then I'm not sure what happens here. It might be that you get a two-byte constant (wchar or something?). In any case, this is unlikely to work. Using "€" might work better, as I think that just translates to a two-byte string, which is just sent as-is through serial.

I think the serial and String classes just handle byte arrays, so they should handle literal UTF8 just fine. Things will get more complicated when you start to modify strings (e.g. if you want to take the first three characters of a string instead of the first three bytes). Properly supporting UTF8 for those kind of modifications is probably way too complex and big to implement on Arduino.

I'm using minicom -D /dev/ttyACM0 -b 9600 in Linux, but that's not relevant.

That is totally relevant - A serial line is just a byte stream, it's up to the terminal emulator (minicom in this case) to decode bytes into characters. If minicom is not configured for UTF8, then sending UTF8 characters is not going to work.

@PaulStoffregen
Copy link
Contributor

Actually, if you run this on a Arduino Due, it prints 14844588. In hex, that's 0xE282AC.

E2 82 AC is the UTF8 encoding for '€'.

On Arduino Uno, it prints -32084 , which is 0x82AC. It seems the AVR compiler isn't smart enough to know it needs to promote to a 32 bit integer, but it's trying to give an integer constant representing the UTF8 encoded character.

One might argue the compiler ought to give 0x20AC (the Unicode value) for the character constant, but clearly the compiler is trying to give UTF8.

@matthijskooijman
Copy link
Collaborator

One might argue the compiler ought to give 0x20AC (the Unicode value) for the character constant, but clearly the compiler is trying to give UTF8.

I think the compiler is not doing either - its the text editor that decided on the encoding to use. I guess the compiler only has support for multi-byte char constants and translates to the smallest fitting integer type. In other words, you could also write 'ABC' and get 0x414243. To the compiler, both just are three bytes contained in quotes - it's the text editor that displays those three bytes as a single euro sign.

@NicoHood
Copy link
Contributor

The serial monitor should now support UTF8, but only for strings. If there is a way to use UTF8 that'd be great. I try to implement a multilingual keyboard but this breaks everything.

void setup() {}

void loop() {
  Serial.println('', HEX);
  Serial.println(uint32_t(''), HEX);
  Serial.println(long(''), HEX);
  Serial.println(L'', HEX);
  Serial.println('', DEC);
  Serial.println('');
  Serial.println(L'');
  Serial.println("");
  Serial.println(""[0], HEX);
  Serial.println(""[1], HEX);
  Serial.println(""[2], HEX);
  Serial.println(""[3], HEX);
  Serial.println("-----");
  delay(1000);
}

/*
  FFFF82AC
  FFFF82AC
  FFFF82AC
  20AC
  -32084
  -32084
  8364

  FFFFFFE2
  FFFFFF82
  FFFFFFAC
  0
  -----
*/

@PaulStoffregen
Copy link
Contributor

UTF8 is supported in strings. Your own example shows this prints correctly:

Serial.println("€");

@NicoHood
Copy link
Contributor

Yup. But I want to do something like Keyboard.press('€');
I am also sick of the damn usb keyboard layouts. Its just an immense work which I dont even personally need. and the IDE makes it even more complicated.

@PaulStoffregen
Copy link
Contributor

At least on AVR, you're not going to get Keyboard.press('€'), because UTF8 requires 24 bits to encode that character, but the compiler truncates to only 16, even for functions which take a 32 bit parameter.

@NicoHood
Copy link
Contributor

using L'€' would work. why not without the L?

@matthijskooijman
Copy link
Collaborator

@NicoHood I don't actually think that UTF-8 characters are natively supported by keyboards? What keycode should be sent for €? AFAIK some keyboard use ctrl-alt-5 for that, but on my Linux keyboard I need to type compose+e, compose+= to get it (where compose is right allt). The Keyboard library has a mapping of ASCII to keycodes, that could perhaps be extended to include UTF-8 characters, provided that those actually have a keycode.

In any case, @NicoHood is now talking about Keyboard.press handling UTF-8 characters, which is a totally different issue from the original issue (@NicoHood, if you want to continue this discussion, please open a new issue). The original reporter talks about handling UTF-8 characters in Serial printing (which is supported, in strings, not char types because char is always one byte) and the IDE serial monitor (which supports UTF-8 now according to NicoHood). The original issue also mentions String support for UTF-8, but I'm afraid that's a whole can of worms with all kinds of corner cases and a lot of overhead, for just very few usecases.

I guess this issue can be closed?

@cmaglie
Copy link
Member

cmaglie commented Oct 27, 2015

The serial monitor doesn't correctly support UTF-8 this is an unresolved issue, it may seems to work but it doesn't, a timing issue exists, for details look here:
#2430 (comment)

The same comment explain why using UTF-8 characters in C/C++ strings may be risky (for example extracting a substring from an UTF8 string may produce undefined results because char arrays are unaware of utf8 characters that may be composed of many bytes).

@cmaglie cmaglie closed this as completed Oct 27, 2015
@NicoHood
Copy link
Contributor

USB keyboard work different. There is no UTF inside the USB protocol. You just need to press more keys, depending on the layout. Therefor the normal Arduino API is not enough, that is what I was working on. If the layout does not support € for example it is also not possible to press it of course.

The issue is still related, since I need to read the data from the API and press the keys. Similar issue can be created with wanting to print Serial.write('€'); Where with the example above the char as '' char instead of "" string will lose some data. So the problem is the same actually, just another usecase.

@matthijskooijman
Copy link
Collaborator

So the problem is the same actually, just another usecase.

No, not really. For printing UTF-8 data, you simply have to print the bytes that compose the character. In terms of code, this just means that these bytes have to be passed through unchanged. As mentioned before, the C char type is not suited for UTF-8 characters (since char is really a byte, not a "character" in practice).

For passing UTF-8 characters through the keyboard library, they have to be translated to keycodes, which is a totally different process. Anyway, I'll stop talking here - like I said, open a new issue if you want to continue discussing this.

@ffissore ffissore modified the milestone: Release 1.6.6 Oct 28, 2015
@cousteaulecommandant
Copy link
Contributor

cousteaulecommandant commented Apr 29, 2016

A small comment regarding '€' in C++:

An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. [...] A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

In other words:

  • A character literal like 'A' is a char (and its value is 65 in ASCII).
  • A character literal like '€' is an int, its implementation is optional, and its value is implementation defined. This is, 'A' and '€' have different types (char and int respectively).

So Serial.write('A') calls Serial.write() with a char argument, but Serial.write('€') does it with an int argument, so it will be the same as doing Serial.write(-32084); C++ has no way to tell the two apart.
Furthermore, the C++ standard doesn't care a bit of how the implementation gives a numeric value to '€' as long as it documents it (apparently this one gives 0x82AC, which happens to be the last 2 bytes of the UTF-8 representation of '€'; and also of 'ガ' so maybe '€'=='ガ' although I haven't tried this). In other words, typing '€' in a program is pretty much useless.

L'€', on the other hand, is better defined: it's a wide-character literal, of type wchar_t, not just a char or a malformed int, and its numeric value is meaningful (in this case 0x20AC, the Unicode value of code point '€'). Therefore it would be possible to make Serial.print() print wchar_t arguments as characters rather than integers; it would be a matter of modifying Print.cpp and Print.h to include a Print::print(wchar_t) method. However I don't think this is important; wide characters and wide character strings are probably never going to be used at all.

In short, just use Serial.print("€") to print a euro symbol. But make sure issue #2430 is fixed first :)

@tedlu-tw
Copy link

void loop() { Serial.println('€'); }

I think that C doesn't know about UTF-8 either - a char is always just a single byte. If the Euro sign you typed translates to multiple UTF-8 bytes, then I'm not sure what happens here. It might be that you get a two-byte constant (wchar or something?). In any case, this is unlikely to work. Using "€" might work better, as I think that just translates to a two-byte string, which is just sent as-is through serial.

I think the serial and String classes just handle byte arrays, so they should handle literal UTF8 just fine. Things will get more complicated when you start to modify strings (e.g. if you want to take the first three characters of a string instead of the first three bytes). Properly supporting UTF8 for those kind of modifications is probably way too complex and big to implement on Arduino.

I'm using minicom -D /dev/ttyACM0 -b 9600 in Linux, but that's not relevant.

That is totally relevant - A serial line is just a byte stream, it's up to the terminal emulator (minicom in this case) to decode bytes into characters. If minicom is not configured for UTF8, then sending UTF8 characters is not going to work.

hahaha bro arduino is C++

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants