Supporting Collaboration through Multimedia Digital Document Archives

7 Moving a Document Archive to the World-Wide Web

Moving an exist document archive to the World Wide Web is a reasonably simple task that is supported by a number of public domain tools. This section explains the protocols involved and exemplifies the tools.

7.1 HTML: HyperText Markup Language

As already noted, World-Wide Web documents are represented in HTML, a document type definition in the standard generalized markup language SGML. Figure 7.1 shows the HTML generating the initial section of the document of Figure 6.8. The SGML markup format is through begin and end tags, such as <html> and </html> respectively. A pair of begin and end tags encapsulates a block of text and ascribes to it the attribute noted in the tag. For example the <title> and </title> tags encapsulate the title of the document which will be displayed as the title of the window in which the document is displayed.
<html>
<head>
<!-- This document was created from RTF source by rtftohtml version 2.7.4 -->
<title>GNOSIS Test Case Final Report</title>
</head>
<body>
<h1><a name="RTFToC1">Final Report of IMS Test Case 7</a></h1>
<IMG SRC="TC7F11.gif">
<h2><a name="RTFToC2">GNOSIS: Knowledge Systematization: Configuration Systems for Design and Manufacturing</a></h2>
<h2><a name="RTFToC3">1 OVERVIEW</a></h2>
<p>This is the final report summarizing the experience and results of IMS Test
Case 7, GNOSIS. Detailed reports of the work packages within GNOSIS are also
available as listed in  <a href="TC7F8.html">Section 8</a>.
<h3><a name="RTFToC4">1.1 Long-Term Goal--A New Manufacturing Paradigm</a></h3>
<p>The one year test case has been an experiment in international collaboration in
pre-competitive intelligent manufacturing systems research, with the objective
of defining a long-term research project. The long-term goal of the GNOSIS
Figure 7.1 HTML representation of GNOSIS World-Wide Web document of Figure 6.8

The entire document is encapsulated in <html></html> tags indicating the DTD used. A header of status information about the document is encapsulated in <head></head> tags. Paragraphs are encapsulated in <p></p> tags where the </p> is generally omitted under an SGML minimization convention whereby some tags may have their presence inferred. Headings are encapsulated in <hn></hn> tags where n is one of 1 through 6 to indicate different levels of headings. Tags are defined in the HTML DTD for numbered and bulleted lists, quotations, blocks of preformatted text, and so on. They are also defined for typographic markup, either logically as in <em></em> indicating emphasized words, or literally as in <b></b> for text in a bold face.

The actual representation of tagged text is the responsibility of the browser which will provide some form of style sheet so that a user can define which text should be emphasized, how a heading level 1 should appear, and so on. Most users leave the style sheet set to the defaults for the browser, giving World-Wide Web documents a similar appearance across machines. However, the local control of appearance can be used to address individual preferences, for example to provide larger fonts for the visually impaired.

The data for the picture of the GNOSIS logo at the top of Figure 6.8 is not embedded in the HTML representation. Instead a reference to it is given in line 8 through the tag <IMG SRC="TC7F11.gif">, and the browser fetches the image data and inserts it where the tag is placed in the text. Additional attributes may be specified in the image tag to specify its placement relative to the text if required.

The hypertext links in Figure 7.1 are particularly interesting. The <a name="RTFToC2"> </a> tags in line 7 encapsulate the text of the heading and give it the attribute of having the name "RTFToC2". This enables it to be used as the target of a hypertext link from another document, such as a table of contents. As already noted, the terms "Section 8" in line 14 of Figure 7.1 is a hypertext link to Section 8 of the GNOSIS Final Report, and it is encapsulated in the tags <a href="TC7F8.html"></a> which give it the attribute of being a hypertext reference to the document TC7F8.html in the same directory.

HTML markup may appear complex if one is used to WYSIWYG word processors, but in practice it is very simple. There are many basic introductions to HTML available on the Internet. A short overview is available in the HTML Primer (HTML, 1993), and a more extensive one in a tutorial paper (Barry, 1994). The detailed technical specification and DTD are also available (Berners-Lee et al., 1994), and will shortly become an Internet RFC (nominally a "Request for Comments" but de facto an agreed standard). Since automatic conversion tools are available that allow documents prepared in word processors to be converted to HTML format, it is not necessary to have a deep understanding of HTML to prepare material for World-Wide Web. However, it is worth browsing a primer to understand at least the basics of the document format, and the power available through references to embedded images and other material, and through hypertext links. An up-to-date guide to relevant resources on the Internet is maintained in the document World-Wide Web Frequently Asked Questions (Boutell, 1994).

7.2 URL: Uniform Resource Locator

World-Wide Web introduced a protocol for accessing a file on the Internet through a uniform resource locator specifying: the protocol to be used; the Internet address of the machine on which the file was located; the directory path to the file; the file name; and optional parameters such as a named location inside an HTML file. The syntax of a URL is:

protocol :// address / path / file delimiter parameters

where the delimiter for a named location is "#". Thus, the URL for section 1.1 of the document of Figures 6.8 and 7.1 is:

http://ksi.cpsc.ucalgary.ca/GNOSIS/TC7F1.html#RTFToC3

indicating that the hypertext transfer protocol should be used to access it on the machine at "ksi.cpsc.ucalgary.ca" in the World-Wide Web sub-directory "GNOSIS" with file name "TC7F1.html" commencing at the text named "RTFToC3".

The gopher page in Figure 6.9 is accessed through the URL:

gopher://gopher.tc.umn.edu/

indicating that the Gopher protocol should be used to access it on the machine at "gopher.tc.umn.edu" in the default Gopher sub-directory with the default Gopher file name.

URLs are frequently embedded in documents to reference images or provide hypertext links to other documents. It is convenient, particularly if one wishes to be able to move a set of associated documents to another machine or to a CD-ROM, to be able to omit the protocol, address and path, and specify a relative URL in which the missing items are filled in with those for the document in which the link is embedded. For example, the tag <IMG SRC="TC7F11.gif"> at line 8 of Figure 7.1 specifies a relative URL for the file containing the image that expands to:

http://ksi.cpsc.ucalgary.ca/GNOSIS/TC7F11.gif

The URL for a file accessible through FTP specifies the protocol "ftp", and the URL syntax provides for expansion to other protocols as they are defined.

7.3 World-Wide Web Browser Helper Applications

The capability for an HTML document to specify a hypertext link to an arbitrary file on the Internet raises a question as to what happens when that link is followed since it is possible that file accessed will not be in a format that the browser can display. The simple, and yet very powerful answer, is that World-Wide Web uses file suffixes to define file types and associated helper applications that can open these files. Thus, when a hypertext link is followed to a file "movie.qt" the browser will fetch the file and open it in whatever QuickTime movie viewer is specified as the browser's helper for files with the suffix "qt".

The capability of World-Wide Web browsers to access files intended for other applications and open them in those applications means that the Web is intrinsically extensible to new types of documents and applications. One can make available files related to an application such as a computer-aided design (CAD) package, issue the user community with the CAD package, and have them modify their browsers' preferences file to specify it as a helper. Users then have a means of accessing and exchanging the CAD files in a collaborative environment without requiring the CAD package itself to be modified in any way.

Additionally, most browsers provide the capability for another application to use an inter-application protocol to request that they fetch a file specified through its URL. Through this mechanism one can develop new applications that themselves specify hypertext access through the Internet without having to develop all the functionality of a World-Wide Web browser.

7.4 HTTP: HyperText Transfer Protocol

HTML and other documents intended for use with World-Wide Web are generally accessed through the HyperText Transfer Protocol (HTTP) (Berners-Lee, 1993a), a generic, stateless, object-oriented protocol that uses the Multipurpose Internet Mail Extensions (MIME) content encoding protocol (Borenstein and Freed, 1993) to transmit arbitrary data. Figure 7.2 shows the request that the browser sends to a HTTP server in order to fetch the document shown in Figure 6.8. The first line specifies a "GET" request to transmit the file with path "/GNOSIS" and name "TC7F1.html" using the HTTP protocol version 1.0. The second line specifies that the browser is "Mozilla" version 0,9 beta for the Macintosh, the name of Mosaic Corporation's NetScape browser. The third line specifies that the reference originated in the document with the URL "http://ksi.cpsc.ucalgary.ca/KSI/KSI.html". The fourth line specifies that the user is "gaines@cpsc.ucalgary.ca", an email address obtained from the browser's preference file. The remaining four lines specify the types of data the browser is able to accept.
GET /GNOSIS/TC7F1.html HTTP/1.0
User-Agent: Mozilla/0.9 beta (Macintosh)
Referer: http://ksi.cpsc.ucalgary.ca/KSI/KSI.html
From: gaines@cpsc.ucalgary.ca
Accept: *
Accept: image/gif
Accept: image/x-xbitmap
Accept: image/jpeg
Figure 7.2 HTTP request to fetch the document of Figure 6.8

Figure 7.3 shows the data returned by the HTTP server in response to the request. The first line specifies the time of transfer. The second line specifies that the server used is "NCSA" version 1.1. The third line specifies that the content is encoded in MIME version 1.0. The fourth line specifies that the content type is text in HTML format. The fifth line specifies the time at which the file was last modified. The sixth line specifies that the content following the next blank line is 8479 characters long. The remaining lines carry the file content as shown in Figure 7.1.

Date: Sunday, 30-Oct-94 00:38:21 GMT
Server: NCSA/1.1
MIME-version: 1.0
Content-type: text/html
Last-modified: Wednesday, 07-Sep-94 00:42:57 GMT
Content-length: 8479

<html>
<head>
<!-- This document was created from RTF source by rtftohtml version 2.7.4 -->
<title>GNOSIS Test Case Final Report</title>
</head>
...........................
Figure 7.3 HTTP reply transmitting the document of Figure 6.8

The client-server messages shown in Figures 7.2 and 7.3 are normally totally invisible to the user. However, it is helpful to know the nature of the protocol underlying the operation of the World-Wide Web. The protocol is termed stateless because a connection is established, the request and reply are transmitted, and the connected is closed with no record being kept of the transaction (except in a log file used to monitor usage of the server). Thus, a World-Wide Web user does not "log in" to the server and initiate a series of transactions, but rather each transaction is an atomic action with no continuity between them. This stateless mode of operation is appropriate to a system initially defined primarily for information retrieval. However, extensions to the HTTP and HTML protocols now allow the World-Wide Web to support more advanced client-server computing and techniques have been developed to maintain state information across a sequence of transactions (Section 9).

7.5 Automatic Conversion of Word Processor Documents to HTML

The HTML tagged format is simple to understand and World-Wide Web hypertext documents may be prepared in a text editor in which one enters the tags directly to produce a document like that of Figure 6.8. There are also specialist HTML text editors that make it easy to enter matched HTML tags. However, many users already use WYSIWYG word processors, and many documents for the web originate as existing word processor files. Hence, utilities have been developed to convert from word processor files to HTML documents. The basis of the conversion is usually to map the styles specified in a style sheet in the word processor to those available in HTML.

The program rtftohtml (rtf to html) is a public domain conversion utility developed by Chris Hector using a public domain decoder for Microsoft's Rich Text Format (RTF) developed by Paul Dubois. The RTF encoding scheme is able to represent every detail of complex word processing documents and is exported by the majority of mainstream word processors. In particular, it exports the style sheet information used in setting up the typographic style of a document. The converter maps the style names of the document to HTML style names using a user-defined table. It also separates pictures embedded in a document and embeds an HTML reference to them in the converted document. It creates a name for each heading in the document and creates a separate HTML contents list with hypertext links to the section names.

The process of converting a document from, say, Microsoft Word to HTML is to either restyle the document using a Word style sheet that corresponds to the HTML tags such as h1, bulleted text, and so on, or to define an rtftohtml table that maps the existing styles in the document to HTML tags. Then one exports the document in RTF format, opens the RTF document in rtftohtml, and the output is a corresponding HTML document, a contents list in HTML, and a set of figures in PICT format. One uses a graphic conversion utility to convert the Macintosh PICT files to the CompuServe Graphic Interchange Format (GIF) which is the most commonly used image format for World-Wide Web. The resultant HTML and GIF files may then be transferred to a document archive managed by an HTTP server.

Provision is made in rtftohtml for hypertext links to be embedded in a document using special conventions involving double-underlining which is not a feature of HTML and hence can be used to indicate a link. Thus, it is possible to do all the hypertext mark up in the word processor and avoid having to edit the HTML documents created in any way. This is desirable for document maintenance. It corresponds to using a high level language for encoding the document and not interfering with the `binary' HTML format produced by the rtftohtml `compiler'.

The KSI has used rtftohtml to manage large World-Wide Web archives created and managed entirely in Microsoft Word. It is a very effective approach to parallel publication in electronic/paper documents and on the web. Similar tools exists for documents in Tex, FrameMaker and other document production formats.

7.6 GIFs and Speed of Communication on World-Wide Web

One of the most common errors in first establishing a World-Wide Web site is to take advantage of the capability of HTML documents to contain large embedded pictures and to put up pages that include pictures that take a long time to transmit yet carry little content. There are commercial sites offering Internet `expertise' that commence with a home page involving a number of very pretty embedded images that take so long to transmit that most users kill the transmission rather than wait to view the page!

The CompuServe GIF format uses Lempel-Ziv run-length encoding of horizontal scan lines to compress the image structure (Luse, 1993). This scheme is very effective at compressing pictures comprised of long horizontal segments of identically colored pixels, and very poor at encoding images involving pixel-by-pixel changes along such a scan line. This means that it is not the absolute size of a picture that determines the data that must be transmitted, but rather the horizontal complexity of the image. Vertically, a graded tint that changes pixel by pixel causes no problems. Thus one may include large colored images provided the horizontal color changes are few. For example, Figure 7.4 shows the Adobe Persuasion colored version of Figure 2.3 embedded in a document on the web. It is 7280 bytes long and takes about 5 seconds to transmit over a 14,400 baud modem line which is within the acceptable limits of human patience. By way of contrast the image at the top of Figure 7.5 occupies about the same area but occupies 95375 bytes and takes some 67 seconds to transmit over a 14,400 baud modem line which is far too long. However, the system shown is a university information service that is accessed only over a local area network with a data rate of over 100 Kbytes/sec so that the image is fetched in a second or so and the user sees no significant delay.

These examples show the importance of understanding the way in which the GIF encoding scheme operates, and of designing World-Wide Web documents for the environment in which they are to be used. For documents to be accessed from locations having poor Internet connectivity it is best to avoid the use of images as much as possible, minimize the size of any necessary, and keep the size of the documents themselves small. On the other hand, if a system is to be used largely for local use it is possible to use large and complex images and documents with impunity.

Figure 7.6 shows the transaction times for typical data over different speed lines chosen to represent: a modem over a dial-up line; a typical Internet data rate between continents; and a typical one in North America. V32bis compression gives a data rate of about 3.4Kbytes/sec for text and 1.5Kbytes/sec for gifs. The 50 Kbytes/sec figure also typifies local ethernet communications. The main components of the transaction time are time to: perform Domain Name System (DNS) lookup; transmit the request; process the request; transmit text; transmit graphics; decode graphics; and layout text and graphics. The times shown do not take into account processing at the server because this is application-dependent, or DNS lookup since this is variable and usually cached both at the local site and within the client.

The document types have been chosen to typify a wide range of applications, and to illustrate some of the features of gif encoding and caching. The times in the first column represents a typical remote user over a modem, and it can be seen that the transmission of images is the primary factor in determining the response time. There are two factors that affect image transmission apart from communication speed. First, the run-length encoding of GIFs already discussed. Second, clients typically cache all material fetched indexed by its URL, and hence a URL that is reused comes from local storage and incurs no communication delay. The reuse of the same icon within a document and across related documents is conducive to both uniformity of style and to interactivity.

Figure 7.4 A large embedded image that compresses well and is transmitted rapidly

Figure 7.5 Interface to a university information system

The second and third columns show the merits of a direct connection to the Internet, and the capabilities of WWW on a local network. They also indicate that it is important to specify whether a system is intended to be used from remote sites and through modem connections.

Figure 7.6 Transaction times for different data rates


Contents, Previous Section, Next Section.