Cyberspace geography visualization

A. Information gathering and the World-Wide Web

The World-Wide Web is composed of resources, mostly documents, that are identified using an URL which is composed of four parts:

The protocol scheme, which has to be registered.
The fully qualified domain name of a network host, or its IP address.
The port number. If not specified, the default port number according to the protocol is used.
The rest of the locator specifies the path of the resource. It depends on the protocol used.

Most of the resources available in the World-Wide Web is transferred with the HTTP scheme. Other protocols are mainly used to give backward compatibility. A major exception is NNTP (Network News Transfer Protocol) which gives access to the Usenet News. Unfortunately, the information available through NNTP is not kept on a long term basis. Therefore, it is reasonable to restrict the search to the HTTP scheme.

The HTTP URL takes the form

http://<host>:<port>/<path>?<searchpart>

where if <port> is omitted, the port defaults to 80. It is obvious that URL containing a searchpart element can be discarded. An example of an HTTP URL is

http://www.w3.org/hypertext/WWW/Protocols/HTTP1.0/draft-ietf-http-spec.html

which is the location of the HTTP Internet draft.

According to a defined MIME (Multipurpose Internet Mail Extensions) type, a resource can be either a text, a hypertext, a picture, a sound, etc. Because we are interested only in information containing hyperlinks, we can focus our attention on hypertext documents which are currently only available in HTML. Note that VRML (Virtual Reality Markup Language), although in experimental testing, should soon give hyperlinks functionalities to three-dimensional immersive environments.

HTML is an application of SGML (Standard Generalized Markup Language). It permits the anchoring of parts of documents, either in textual or pictural forms, to other resources by giving their URLs. A URL can be specified in its absolute form or as a relative address. This is an example of simple HTML document with one hypertext link:

<HTML>
<HEAD>
<H1>This is the title</H1>
</HEAD>
<BODY>
<P>This is a paragraph with one 
<A HREF="http://www.eit.com/web/www.guide/">hypertext link</P>.
</BODY>
</HTML>

To fetch a resource using HTTP, a connection to the specified host has to be establish over TCP (Transmission Control Protocol). The request for a resource can then be made by sending a GET command followed by the URL. A response header is returned with the information on the MIME type. As discussed before, we will limit ourselves to the text/html Content-Type. This header is normally followed by the data in the format of a MIME message body. Because no assurance is given of the existence of a resource, this fetching should be made particularly tolerant of any error.

After the HTML document has been successfully fetched, parsing its content can be made. Each discovered anchor can be put in a queue of URLs to fetch. It has to be put at the end of the queue to accomplish a breadth-first search and at the top for a depth-first search.

The next URL to fetch can then be popped from the top of the queue and this process can continue until a specified number of resources has been fetched or until the queue is empty.

Each successfully fetched URL, with all of its anchored links, can then be stored.

Cyberspace geography visualization - 15 October 1995

Luc Girardin, The Graduate Institute of International Studies