reading cesu-8 data in wireshark

Question

hi,

iv been doing some work with instant messaging application, and i captured a few messages sent, containing emojis. i filtered the capture, and now i have packets that contain (allegedly, by a coworker) only emojis. as i went throu the data, i noticed i see something like (in asci):

0120########################################################..8.

now, i assume that the ### part is just to fill the length of the message, and the emoji is either at the start or end of the message. i know the data is encoded in CESU-8. what i need is:

is there a way to read the data encoded in CESU-8? because using follow TCP-stream shows me a bunch of options, non of them works for me (i have utf-8, which isnt good since im not in the BMP).

i think what i need is a lua script that adds a decoder, or maybe any other way to convert the data to UTF-8, but of course, any help that can be given or suggested will be appreciated.

thanks, Joseph

Answer 1

0

There's currently nothing in Wireshark itself that understands CESU-8, so any dissector you write (in Lua or C) would have to translate it to UTF-8 (the internal character encoding of Wireshark) itself.

We could add CESU-8 as a character encoding, in which case you could write a dissector (in Lua or C) that used the new encoding to extract a string.

However, if you don't know the IM application's packet format, you can't just write a working dissector - you'll have to write something that tries its best to dissect what you think is the string, guessing what its length is.

We could also add CESU-8 support to the Follow TCP Stream dialog; there's no way to add that with a plugin.

answered 11 Jul '17, 19:07

Guy Harris ♦♦
17.4k●3●35●196
accept rate: 19%

hi, thanks a lot. it makes me a little sad to know that i have to translate the encoding itself, but at least now i know where i stand in that context.

i will follow to see if there will be any support in the future. both suggestions would be great equally.

thanks again, joseph

(12 Jul '17, 13:50) josephg

Encodings tend to be added if there's a protocol that uses them (or, in the case of ISO 8859-1, if support can be added by running a program to generate a translation table and then adding a few lines of code).

There will probably only be CESU-8 support in the future if somebody files an enhancement request on the Wireshark Bugzilla, so as to put the request "on the radar screen".

(12 Jul '17, 13:55) Guy Harris ♦♦

Am I wrong when I understand CESU-8 to be algorithmically translatable into UTF-8 as it in general requires just manipulations of groups of bits?

(12 Jul '17, 13:58) sindy

From the specification for CESU-8, it appears that

 for (everything in the string) {
    code_point = extract_cesu_8_element();
    if (that failed for any reason, including an octet having the 4 upper bits set)
        fail;
    if (code point introduces a surrogate pair) {
        rest_of_surrogate_pair = extract_cesu_8_element();
        if (that failed for any reason, including an octet having the 4 upper bits set)
            fail;
        if (rest_of_surrogate_pair isn't the second half of a surrogate pair)
            fail;
        combine code_point and rest_of_surrogate_pair;
        put the resulting Unicode character;
    } else
        put code_point;
}

where extract_cesu_8_element() is similar to the code to turn a sequence of octets in UTF-8 into a Unicode code point, except that it fails if any octet is of the form 1111xxxx, would generate a UTF-8 string from a valid CESU-8 string (and report an error for an invalid CESU-8 string).

(12 Jul '17, 14:15) Guy Harris ♦♦