Input Method to compose complex characters #2430

Myriads · 2014-11-09T03:27:34Z

There are 2 bugs to use Korean characters:

If any ASCII character such as white space, comma, dot, or number is typed in during composing the Korean character, the currently composed Korean character is disappeared as like deleted in the screen of the Edit and the Serial Monitor windows.
Korean characters are commonly consisted of 3 components: 1) beginning consonant, 2) vowel, and 3) final consonant. Sometimes the final consonant of the previous composed character becomes the beginning consonant of the following composed character. If the previous final consonant becomes the beginning consonant of the next composed character, then the previous composed Korean character should be redrawed with the final completely composed Korean character, but it is still displayed with the final consonant which becomes the beginning consonant of the next Korean character.

Same in 1.5.x

matthijskooijman · 2014-11-10T08:45:38Z

Hmm, I wonder if this is something that the Arduino code can influence, or if we're just dependent on Java to do the right thing here...

Myriads · 2014-11-10T09:44:38Z

I believe these 2 problems are related with sources(source codes) processing Java Input Method. So it will not influence the Arduino code.

ffissore · 2015-05-12T08:12:07Z

This should be fixed with the new editor, available with the latest hourly build http://www.arduino.cc/en/Main/Software#hourly

Myriads · 2015-05-12T10:48:30Z

It works great. Thanks a lot.

But still there is a bug in the Serial monitor.

This sketch directly prints out the Korean characters "한글" to serial monitor and after that displays the inputString received from the serial monitor.

The inputString is displayed well but the directly printed Korean characters "한글" are broken. Korean characters are displayed as like as the following attached picture.

cmaglie · 2015-05-12T11:13:25Z

May you cut&paste your sketch here?

cmaglie · 2015-05-12T13:00:18Z

Nevermind, I reproduced (more or less) the issue with this sketch on an Arduino Due:

void setup() {  Serial.begin(9600); }

void loop() {  Serial.println("한글");  delay(1000); }

Before giving false expectations let me say that the strings functions in Arduino are designed to work with plain ASCII characters, so if you try to use UTF8 characters it may work in simple cases but you may encounter random faulty behaviours on more complex sketches for example if you try to concatenate two strings or extract a substring from a bigger one.

Said that, it happens that the above sketch works if I connect to the serial port with an external terminal program like Putty but it prints random garbage with the serial monitor of the Arduino IDE. So my conclusion is that something weird is happening on the Arduino Serial Monitor.

My guess is that the issue is in how the incoming chars are buffered here:
https://github.com/arduino/Arduino/blob/master/arduino-core/src/processing/app/Serial.java#L156

        byte[] buf = port.readBytes(serialEvent.getEventValue());
        if (buf.length > 0) {
          String msg = new String(buf);
          char[] chars = msg.toCharArray();
          message(chars, chars.length);
        }

an UTF8 char may be composed of many bytes, and the String object can extract the correct UTF8 char only if a complete UTF8 char is received in one single read. If the a multi-byte UTF8 char is fragmented the two consecutive calls to String constructor are not able to build the correct character.

This is a tricky issue, because JSSC doesn't implement the InputStream interface but, instead, has this weird readBytes() method that returns an array of bytes. See https://github.com/scream3r/java-simple-serial-connector/issues/17

The best fix would be to implement an InputStream interface in JSSC and feed the InputStream into an InputStreamReader or a BufferedReader that will do all the correct buffering and decoding.

An alternative is to write an anonymous-InputStream wrappen around the JSSC's Serial object to obtain the same result.

Myriads · 2015-05-12T15:35:35Z

Here is the sketch:

/*
  Serial Event example
 
 When new serial data arrives, this sketch adds it to a String.
 When a newline is received, the loop prints the string and 
 clears it.
 
 A good test for this is to try it with a GPS receiver 
 that sends out NMEA 0183 sentences. 
 
 Created 9 May 2011
 by Tom Igoe
 
 This example code is in the public domain.
 
 http://www.arduino.cc/en/Tutorial/SerialEvent
 
 */

String inputString = "";         // a string to hold incoming data
boolean stringComplete = false;  // whether the string is complete

void setup() {
  // initialize serial:
  Serial.begin(115200);
  // reserve 200 bytes for the inputString:
  inputString.reserve(200);
}

void loop() {
  // print the string when a newline arrives:
  if (stringComplete) {
    Serial.print("한글:"); 
    Serial.print(inputString); 
    // clear the string:
    inputString = "";
    stringComplete = false;
  }
}

/*
  SerialEvent occurs whenever a new data comes in the
 hardware serial RX.  This routine is run between each
 time loop() runs, so using delay inside loop can delay
 response.  Multiple bytes of data may be available.
 */
void serialEvent() {
  while (Serial.available()) {
    // get the new byte:
    char inChar = (char)Serial.read(); 
    // add it to the inputString:
    inputString += inChar;
    // if the incoming character is a newline, set a flag
    // so the main loop can do something about it:
    if (inChar == '\n') {
      stringComplete = true;
    } 
  }
}

cousteaulecommandant · 2016-04-29T13:44:42Z

If the multi-byte UTF8 char is fragmented the two consecutive calls to String constructor are not able to build the correct character.

With UTF-8 it is possible to detect whether a "chunk" of bytes ends in a single-byte (ASCII) character or a multi-byte sequence, and it is relatively easy to manually check whether this multi-byte sequence is complete or not (also the number of bytes that are in this chunk and the number of bytes that are missing). Therefore if a chunk ends in an incomplete multi-byte sequence, this sequence could be stripped and "saved for later", either "pushed back" with something like C's ungetc() if available, or by just saving it in an internal variable that will be prepended to the next chunk.

This involves the serial monitor being a bit smart though; plus the fix I'm mentioning is specific to UTF-8. If the InputStream solution is easy to implement and already takes care of this, it's probably a better solution.

PaulMurrayCbr · 2016-05-12T06:54:06Z

Yeah - the issue is a design one. The serial monitor uses this "message" interface that works with strings, because when sending stuff via the monitor you type something and then hit return. But this doesn't work for receiving bytes. The "message" model is inappropriate for the serial monitor altogether.

aknrdureegaesr · 2017-02-05T18:14:27Z

As noted over at 4452:

The String-constructor documentation advises to use a CharsetDecoder instead, if better control is needed.

I think that's good advice. This would give control over the encoding used, which is the point of arduino/arduino-ide#1728 .

Clean UTF-8 decoding even in the split character case is also a feature included in CharsetDecoder. It has the appropriate buffer that holds back the few bytes that belong to a not-yet completed character. See its documentation.

So using this would be an easy fix, with no need to completely redo the "message" model.

(FWIW: I think that model is not that bad a choice, actually.)

matthijskooijman added the Component: IDE user interface The Arduino IDE's user interface label Nov 10, 2014

ffissore added the Waiting for feedback More information must be provided before we can proceed label May 12, 2015

ffissore self-assigned this May 12, 2015

Myriads closed this as completed May 12, 2015

Myriads reopened this May 12, 2015

ffissore assigned cmaglie and unassigned ffissore May 12, 2015

ffissore removed the Waiting for feedback More information must be provided before we can proceed label Jun 17, 2015

cmaglie mentioned this issue Oct 27, 2015

Is it possible to make Serial.print() -and "Serial monitor"- compatible with UTF-8? #2519

Closed

lmihalkovic mentioned this issue Dec 1, 2022

Serial monitor character encoding option arduino/arduino-ide#1728

Open

cmaglie mentioned this issue Apr 28, 2016

Adding encoding support to serial monitor #4801

Closed

aknrdureegaesr mentioned this issue Feb 9, 2017

Properly decode UTF-8 characters comming in from serial one byte at a time. #5967

Merged

mastrolinux added the in progress Work on this item is in progress label Feb 9, 2017

cmaglie closed this as completed in #5967 Feb 20, 2017

ghost removed the in progress Work on this item is in progress label Feb 20, 2017

cmaglie added this to the Release 1.8.2 milestone Feb 20, 2017

per1234 added Component: IDE Serial monitor Tools > Serial Monitor Type: Bug labels Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input Method to compose complex characters #2430

Input Method to compose complex characters #2430

Myriads commented Nov 9, 2014

matthijskooijman commented Nov 10, 2014

Myriads commented Nov 10, 2014

ffissore commented May 12, 2015

Myriads commented May 12, 2015

cmaglie commented May 12, 2015

cmaglie commented May 12, 2015

Myriads commented May 12, 2015

cousteaulecommandant commented Apr 29, 2016

PaulMurrayCbr commented May 12, 2016

aknrdureegaesr commented Feb 5, 2017 •

edited

Loading

Input Method to compose complex characters #2430

Input Method to compose complex characters #2430

Comments

Myriads commented Nov 9, 2014

matthijskooijman commented Nov 10, 2014

Myriads commented Nov 10, 2014

ffissore commented May 12, 2015

Myriads commented May 12, 2015

cmaglie commented May 12, 2015

cmaglie commented May 12, 2015

Myriads commented May 12, 2015

cousteaulecommandant commented Apr 29, 2016

PaulMurrayCbr commented May 12, 2016

aknrdureegaesr commented Feb 5, 2017 • edited Loading

aknrdureegaesr commented Feb 5, 2017 •

edited

Loading