This is our old Q&A Site. Please post any new questions and answers at ask.wireshark.org.

I have a .pcap file which has packets of HTTP protocol, one of which has text/html written in the info column in Wireshark. When I double click this packet, I get a window showing the packet details, and there is a node Line-based text data. After this, the entire HTML payload is there.

alt text

When I used tshark -r "path\to\pcap\file.pcap" -t ad > "path\for\text\file.txt" -V tshark command to convert the packet details from .pcap file to .txt file, I found the HTML payload (i.e. the entire page HTML):

alt text

I want to get this HTML payload using tshark. So I looked in the Display Filter Expressions window of Wireshark to find the field which would get me this HTML payload, and I found that the field is data-text-lines.

alt text

But when I used tshark -r path\to\pcap\file.pcap" -T fields -e "data-text-lines" > "path\for\txt\file.txt", all I got is one following line in the middle of otherwise empty file:

alt text

So what should I do? How do I extract the entire HTML of the page, which is very much there, but I am unable to extract it using some tshark field?

asked 20 Aug '16, 23:34

Jesss's gravatar image

Jesss
51141720
accept rate: 0%


In this case, there are many data-text-lines in the frame, so in addition to -T fields -e data-text-lines, you need to use also -E occurrence=a to make tshark show all of them. I'm not sure their order will be correct, though, so please try and give us a feedback.

Just bear in mind that this may not work for other encodings of the content of a HTTP response, because if the http dissector finds a higher-layer dissector to handle the content, it does not create the data-text-lines field. So you may have to disable the higher level dissectors.

EDIT: the above was a conclusion based on screenshots. The actual situation is described below in the chain of comments.

permanent link

answered 21 Aug '16, 00:36

sindy's gravatar image

sindy
6.0k4851
accept rate: 24%

edited 21 Aug '16, 03:34

Thank you. I added -E "occurrence=a" to the command, but the output is the same as that without the -E argument. =(

(21 Aug '16, 01:31) Jesss
1

In that case, please upload the capture somewhere (cloudshark is preferred here but any plain file sharing service like google drive, dropbox, microsoft onedrive, ... will do) and edit your Question with a login-free link to it. Guessing by screenshots is a pain.

(21 Aug '16, 01:33) sindy

@sindy Can I do that while ensuring that I don't share my IP address. I am a lil concerned about sharing my IP or MAC address publicly...

(21 Aug '16, 02:02) Jesss
1

Look at the anonymization features of Tracewrangler.

(21 Aug '16, 02:04) sindy

@sindy Tracewrangler has removed all my HTTP packets :s

(21 Aug '16, 02:14) Jesss
1

Try again, but this time uncheck the Remove all unknown layers and ... in the Payload menu which is on by default.

Tracewrangler is a powerful tool, and these intrinsically tend to be dangerous.

(21 Aug '16, 02:18) sindy

@sindy Here: https://www.cloudshark.org/captures/234d85d617e2 Please apply a display filter, and then see the second packet (It has "text/html" written in info section)

(21 Aug '16, 02:45) Jesss
1

Hm, so my assumption based on your screenshots was wrong, the "line-based text data" is a single item and shows size of 1045 bytes, but copying its value provides only the first line.

Worse than that for you, the http dissector could not determine that the text data in frames 10,11,13, and 14 should be handled together, so the payload in frame 10 is dissected as "line-based text data", and its individual lines are dissected into plain "text" fields, the payload in frames 11 and 13 is handled as just "data", for which an alternative representation "data.text" exists, and the payload in frame 14 is dissected as "xml". To make it even more complex, several other items in lower layers of these frames also generate "text" fields.

So all in all, an automated export of http payload from this mess is close to impossible.

(21 Aug '16, 03:28) sindy

@sindy Ooh.. so does that mean there is no way to extract HTTP payload from a .pcap file?

(21 Aug '16, 04:16) Jesss
1

In general it doesn't, but for this particular payload, the http dissector seems to have an issue with the reassembly. Normally, the last packet of the response (frame 14 in your example capture) would have another tab with a reassembly of the http response as a whole, but this one doesn't. I may only guess that the dissector cannot handle it properly e.g. because the server uses an old version of html and specifies the payload type as text/html.

(21 Aug '16, 04:52) sindy

@sindy "Normally, the last packet of the response (frame 14 in your example capture) would have another tab with a reassembly of the http response as a whole" - how would I extract the payload through tshark (commandline) if the reassembled http response WAS present there?

(21 Aug '16, 05:19) Jesss
1

I haven't found any. The protocol dissectors normally dissect the data they can handle into individual items (= fields) but do not create a field with all the source data as a text or byte string. In GUI Wireshark, File->Export Objects->HTTP can be used, but no equivalent functionality is available in tshark.

(21 Aug '16, 06:47) sindy

@sindy OK. Thank you so much for your help.

(21 Aug '16, 09:24) Jesss
showing 5 of 13 show 8 more comments
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×1,620
×832
×58
×34
×33

question asked: 20 Aug '16, 23:34

question was seen: 1,747 times

last updated: 21 Aug '16, 09:25

p​o​w​e​r​e​d by O​S​Q​A