Dealing with HTML entities within a SOAP envelope

Question

I'm capturing and decoding traffic within a SOAP envelope. The source application passes an XML payload through WCF which then converts all the XML reserved characters into HTML entities. So the less than symbol (<) becomes <. And greater than becomes > And so on and so forth according to W3C rules.

What I see in Wireshark is something like this:

+Frame....
+Ethernet....
+Internet Protocol....
+Transmission Control....
+[Reassembled TCP Segments....
+Hypertext Transfer Protocol....
-eXtensible Markup Language
 +<?xml
 -<SOAP-ENV:Envelope....
  -<SOAP-ENV:Body>
   -<g2s:g2sRequest>
    -<g2s:g2sRequest>
     [truncated] &lt;?<xml version="1.0" encoding="utf-8"?&gt;&#xa;&lt;g2s:g2sMessage....
     </g2s:g2sRequest>
....

The xml in the truncated line can be thousands of bytes long. But it contains fields that I need to do filtering and statistics on. I want to convert the HTML entities back to their original less than and greater than symbols, and then do filtering as I would with any XML document in Wireshark--something like gs2message A=10%, gs2message B=20%, etc. Of course, this would mean XML inside XML and I think the parser would have a fit. Why it wasn't put into a CDATA block to begin with, I don't know. But this is what I have to work with. So can I load it into a CDATA block instead within Wireshark and then reconstitute the XML for display, filtering, stats? And converting this back to real XML, wouldn't that mess up my byte size statistics? If within Wireshark would this be done with a dissector or DTD file? Is it even possible to reconstitute the XML payload within WireShark? Or do I have to do it after the fact?

If I go outside of Wireshark would something like Pilot work? Or do I need to write something custom in say Python? But at the same time, I still want all the Frame, Ethernet, TCP/IP info on data sizes for bandwith and latency analysis. It's just that the filtering fields are inside this locked up XML.

All advice is welcome.

Cheers NewbieBrian

Answer 1

I have a solution.

It turns out that my application payload data, the line above that says [truncated] <?xml...... Well, if the XML dissector has already executed, this data will be held inside of the field xml.cdata By using a Post Dissector (thereby insuring that the HTTP and XML dissectors have executed), you can steal the payload from xml.cdata, sun a series of substitutions to put it back into XML form, and now you're ready to process your application data.

The steps are:

--Creat local field for xml.cdata
f_xml_cdata = Field.new("xml.cdata")
.....
function trivial_proto.dissector(buffer, pinfo, tree)
     --Put field data into local variable inside
     local l_xml_cdata = f_xml_cdata()
     --Check to see if we have xml payload otherwise return
     if l_xml_cdata == nil then return end
     --Convert to String
     s_xml_cdata = tostring(l_xml_cdata)
     .....
     --Substitute HTML entities for real XML reserved characters
     s_xml_cdata = string.gsub(s_xml_cdata, "&lt;", "<")
     s_xml_cdata = string.gsub(s_xml_cdata, "&gt;", ">")
     --Convert \r to nothing
     s_xml_cdata = string.gsub(s_xml_cdata, "&#xA;", "")
     --Convert \n to nothing
     s_xml_cdata = string.gsub(s_xml_cdata, "&#xD;", "")
     --In case some systems don't leave real linefeeds in, convert them to nothing
     s_xml_cdata = string.gsub(s_xml_cdata, "\n", "")
     --Optional statement to get rid of any leading spaces between nodes
     s_xml_cdata = string.gsub(s_xml_cdata, "%s*<", "<")

Hope this was helpful