How can i filter particular web browsing http packet transitions?

Question

Hii. I am trying to extract all packet transition for a particular website visit (for example : google). I am considering websites which takes contents from different servers. Is it possible to filter our whole bunch of packets of a particular website visit apart from multiple webvisit packet existence? Is there any way to determine the ending of a whole web page loaded? Please help me with the issue.

Answer 1

One method of "binding" the individual HTTP requests to all requests needed for building a particular page is to use the HTTP header "Referer:". Whenever you request a page, all objects that are reference by the initial html page have a "Referer:" header pointing back to this page.

One drawback of this method is that when you click on a link, the new "initial" html page will have a "Referer:" pointing to the page in which the link was clicked. But you will be able to distinguish this page by the time gap between the other objects and this new html object. Also, you will see other objects with "Referer:" headers pointing back to this URL.

Example:

You open http://www.example.com/index.html:

GET /index.html HTTP/1.1 Host: www.example.com GET /logo.png HTTP/1.1 Host: www.example.com Referer: http://www.example.com/index.html GET /background.jpg HTTP/1.1 Host: www.example.com Referer: http://www.example.com/index.html

GET /style.css HTTP/1.1 Host: www.example.com Referer: http://www.example.com/index.html

You click on the link to http://www.exaple.com/nl/contact.html :

GET /nl/contact.html HTTP/1.1 Host: www.example.com Referer: http://www.example.com/index.html

GET /background_nl.jpg HTTP/1.1 Host: www.example.com Referer: http://www.example.com/nl/contact.html

So, if you want to filter on all http requests involved when opening http://www.example.com/index.html, you can use the filter:

(http.request.host=="www.example.com" or http.request.uri=="/index.html") or http.referer=="http://www.example.com/index.html"

(and take into account that you will get some extra http requests when you clicked on a link on the page)

You can also have a look at the http-fox plugin for firefox to follow each request and it’s headers. Firebug is another firefox extension that might help, it will show you exactly which elements were used in loading a page, including the timings.

Answer 2

Because i need to find out a way to tell my perl code, this packet is the most probable last http response and then take the whole bunch of packets for further analysis and name the page accordingly.

if you need to do this with Perl, you better look at one of the tools listed on the following page (esp. Chaosreader), instead of using Wireshark/tshark:

https://isc.sans.edu/diary/Tools+for+extracting+files+from+pcaps/6961

Regards
Kurt

Answer 3

You might have to be a little clearer in your question, however if you are asking about how can I see all the traffic associated with a particular web page it is a little difficult directly with Wireshark. Obviously if a particular page starts with say a GET for "index.html", in order to determine the subsequent GET fetches you really need to basically parse the result and act like a browser - running the HTML and Javascript, looking in your cache, and so forth. Your best bet is either to make sure nothing else is running on your client, apart from your web browser, or use tools like Firebug or the Chrome debugger and determine what are the various requests. You then filter these by "http.request" filters in Wireshark.

Answer 4

Is it possible to filter our whole bunch of packets of a particular website visit apart from multiple webvisit packet existence?

well, that's a hard problem, unless you are "sitting" in the browser. Just by looking at the network traffic it is hard to identify and map/match all HTTP requests that are a result of a single "page load" of the browser. The reason: HTTP is stateless. Each page load can trigger further requests. None of those new requests contains any kind of information that they belong to the same "page load". Imagine this: You load cnn.com and there is an image embedded that is hosted on apple.com (iCar add). There is no way to map the access to apple.com to the "page load" of cnn.com. The user could simply have accessed that image manually, possibly even in a second instance of the browser.

So, to answer you question: No, there is no reliable way to map/match all subsequent HTTP requests that are triggered by a page load, as there is no common criteria to identify those requests. You could approximate it like this: You monitor every HTTP request. Then you parse the content of the HTML code (and Javascript code!!). After that, you know the URLs that are linked in that first HTML document. Every subsequent request from the same IP, with the same "User-Agent" (same browser), within a defined time frame (a few 100 ms), is treated as result of the first "page load". This would work, however only with some uncertainty, as the user could have manually loaded any of the subsequent URLs in a second browser instance.

BTW: This kind of approximation is not possible with Wireshark, unless you add some code to do that.

Is there any way to determine the ending of a whole web page loaded?

There is no reliable method, for the reasons I explained above.

Regards
Kurt