hi, iv been doing some work with instant messaging application, and i captured a few messages sent, containing emojis. i filtered the capture, and now i have packets that contain (allegedly, by a coworker) only emojis. as i went throu the data, i noticed i see something like (in asci): 0120########################################################..8. now, i assume that the ### part is just to fill the length of the message, and the emoji is either at the start or end of the message. i know the data is encoded in CESU-8. what i need is: is there a way to read the data encoded in CESU-8? because using follow TCP-stream shows me a bunch of options, non of them works for me (i have utf-8, which isnt good since im not in the BMP). i think what i need is a lua script that adds a decoder, or maybe any other way to convert the data to UTF-8, but of course, any help that can be given or suggested will be appreciated. thanks, Joseph asked 11 Jul '17, 01:28 josephg |
One Answer:
There's currently nothing in Wireshark itself that understands CESU-8, so any dissector you write (in Lua or C) would have to translate it to UTF-8 (the internal character encoding of Wireshark) itself. We could add CESU-8 as a character encoding, in which case you could write a dissector (in Lua or C) that used the new encoding to extract a string. However, if you don't know the IM application's packet format, you can't just write a working dissector - you'll have to write something that tries its best to dissect what you think is the string, guessing what its length is. We could also add CESU-8 support to the Follow TCP Stream dialog; there's no way to add that with a plugin. answered 11 Jul '17, 19:07 Guy Harris ♦♦ |
hi, thanks a lot. it makes me a little sad to know that i have to translate the encoding itself, but at least now i know where i stand in that context.
i will follow to see if there will be any support in the future. both suggestions would be great equally.
thanks again, joseph
Encodings tend to be added if there's a protocol that uses them (or, in the case of ISO 8859-1, if support can be added by running a program to generate a translation table and then adding a few lines of code).
There will probably only be CESU-8 support in the future if somebody files an enhancement request on the Wireshark Bugzilla, so as to put the request "on the radar screen".
Am I wrong when I understand CESU-8 to be algorithmically translatable into UTF-8 as it in general requires just manipulations of groups of bits?
From the specification for CESU-8, it appears that
where
extract_cesu_8_element()
is similar to the code to turn a sequence of octets in UTF-8 into a Unicode code point, except that it fails if any octet is of the form 1111xxxx, would generate a UTF-8 string from a valid CESU-8 string (and report an error for an invalid CESU-8 string).