OII Standards and Specifications List

I*M Europe
OII Home Page
What is OII?
Standards List
OII Guides
OII Fora List
Conference Reports
Monthly Reports
EC Reports
Whats New?
OII Index
OII Feedback
Search Database

OII Guide to Archiving

This guide discusses the major requirements, the applicable standards and specifications, and some of the key standards and specifications currently in use for data archiving.

This guide is intended as a companion to the Archivesection of the OII Standards and Specification List. It also references other sections in the list when dealing with more generic aspects (e.g. graphics formats).

The document Draft Guidelines: Best practices for using Machine Readable Data -- From paper to electronic information, Version 1.0, 1996, produced in the context of the DLM-Forum (Electronic Records) within the European Commission, has been utilised for researching the present guide. A final version of the Guidelines on best practices document is expected to be published by the European Commission.

It should be noted that archiving relates closely to document management. This guide therefore provides considerable coverage of the archival aspects of electronic document management. It is, however, acknowledged that archiving, in principle, covers electronic records of different formats and different media. As more and more electronic records are no longer of a documentary origin, the archiving community will need to pay particular attention to the non-document types of electronic record, including multimedia electronic records.

This guide has the following structure:

Section 1: Introduction 
Section 2: Records
Section 3: Access
Section 4: Manipulation
Section 5: Preservation
Section 6: Compatibility
Section 7: Supporting Environment

1. Introduction

It is generally acknowledged that an organization's intellectual assets, the knowledge and expertise of its people, and the information and data that they have compiled, are the most valuable resources of the organization. Typically, this information is stored in documents. In the past, documents were pieces of paper; today a document can be anything from a computer file to a video clip. With the advances in storage, scanning and optical character recognition technologies, that document (unlike paper which degrades over time) now has the potential to exist indefinitely.

The proliferation of electronic documents over recent years has created a significant document management challenge for organizations of all sizes. According to International Data Corporation, in 1985, the number of documents in the world was doubling every five years. By 1994, they were doubling every nine months. It is also estimated that 10-15 percent of an organization's revenues is spent creating, managing and distributing these documents and that 60 percent of people's time is spent working with them.

In the digital world, information can be organized very flexibly, in many different formats, and with intelligent interconnections. The computer is a tool for creating, accessing, presenting, manipulating and distributing electronic data as a document. In simple cases, the recorded information for a document is encoded in a single computer file. However, electronic documents are rapidly becoming more powerful and complex. A single document may comprise several data files, and many software packages may be involved in generating these files. Moreover, a document's files may contain very different kinds of recorded information: text, charts, voice and video clips, process steps, fonts and more, i.e. multimedia information.

Document Management Systems (DMS's) were created to help manage this explosion of computer-accessible documents. Their purpose is to help organizations better protect and manage documents while also giving people easier access to the information stored in those documents. DMS's provide an extensive range of services, such as storing, tracking, versioning, indexing and searching for documents. They also provide a reliable "audit trail" of a document's use and changes.

A key use of DMS's is for (electronic) archiving. An archive is a place where public records or historical documents are kept. However, while archives are generally associated with public or research institutions, they are also applicable to industry and indeed any other entity that has a requirement to store and retrieve data.

For both public institutions and commercial organizations, records and documents typically need to be archived, and hence managed, due to commercial need or a statutory requirement for such records to be kept and be accessible for a period of time (possibly unlimited). In the case of commercial organizations this typically includes invoices, audit information, and technical data. For the records of public institutions, statutory requirements may range from departmental correspondence through to international treaties and parliamentary activity.

The complexity presented by the requirement for electronic archives is also related to the requirement for changes in procedures for handling and recording electronic data -- as opposed to paper documents -- and the legal basis for such procedures. In many organizations, many procedures are still paper-based. Further, the legal basis for such procedures, including procedures for electronic record keeping, are often variable among countries and do not often offer precision. It should be noted that records are more than just data or information. They are kept to provide evidence of transactions and are themselves linked with those transactions that they document. These attributes of evidential purpose and transactional context distinguish records from other types of organizational data/information. With electronic communications, the transactions and transactional contexts are far more varied and complex than the mono-medium of paper transactions.

The need for a new paradigm in addressing and handling electronic archives is increasingly recognized by the archiving community. The broader legal issues are however outside the scope of the present guide.

The lead international body of the archiving community is the International Council on Archives (ICA). The ICA committee on electronic records defines a record as "a specific piece of recorded information generated, collected or received in the initiation, conduct or completion of an activity and that comprises sufficient content, context and structure to provide proof or evidence of that activity"

In an environment where technologies are rapidly changing, and where the tools for handling electronic data is expanding at an astonishing rate, the need for common standards or specifications in all the areas that relate to records and record-keeping is the more vital and urgent. A good example of this is the development of an archive related document type definition (the EAD-DTD) which profiles the SGML standard for this community.

Critically, the management of electronic records needs to be comprehensively and precisely formulated, as a basis for assessing the fit between requirements and the standards/specifications for serving the requirements. Secondly, guidance and best practice on the use of standards and specifications is of enormous value in understanding their applicability. However, it needs to be recognized that technology is a means to an end; standards and specifications are only useful in so far as they can be implemented into products which users can buy.

2. Records

Electronic records themselves consist of four main elements:

  1. Content
  2. Structure
  3. Context
  4. Layout

The first three of these must be preserved in record keeping.

These records are then subject to a range of other functions and processes such as access, manipulation, preservation, etc, which are discussed later in this guide.

Although the four main elements are separately addressed, it does not necessarily imply that one can be preserved independently of others. One of the key challenges to the archiving community is that the way in which the information is presented, structured and classified when it is retrieved (in the future) may need to be different from that in the original form because of technological changes. This raises the requirement for preserving the technological means which generates the original information. Secondly, it raises the issue of to what extent our understanding of information in an electronic record (i.e. content) is independent upon its structure, context and layout. Debates on these more general issues are continuing in the archiving community.

2.1 Content

There is a wide range of standards and specifications for describing the individual components of archives. The primary components are:

  1. Text
  2. Graphics
  3. Data
  4. Audio/Video
2.1.1 Text

There are many formats for the storage of textural information. The ISO SGML standard has been heavily adopted by prominent organizations such as the US Department of Defence who contractually stipulates that all 'documentation' is issued in this format. In general, proprietary formats are not recommended for long-term storage although de facto standards in the word processing field are easily readable by anyone and, due to their wide implementation, are likely to be supported for a considerable period of time.

The following formats are commonly adopted:

  • Plain Text: A low-level file, typically using theASCII character set, containing the text as a sequence of characters. Very difficult to navigate around because of the lack of structure but very easy to access and manipulate
  • HTML: A simplified document type definition (DTD) for SGML that forms the basis of the World Wide Web. Best used with short, simple documents of limited structure
  • HyTime: SGML with multimedia extensions
  • PDF (Portable Document Format): A proprietary (Adobe) but widely adopted format of similar, but inflexible, functionality to SGML. Used extensively on the WWW for downloaded documents where it competes with Microsoft Winword formats.
  • Postscript: A proprietary format widely used for printing text with embedded layout
  • RTF (Rich Text Format): Proprietary format used by Microsoft Office software
  • SGML (Standardized General Markup Language): An international standard which can be used to save text and its structure but without the layout. The structure is defined in a DTD (Document Type Definition), one of which is HTML. Platform independence and flexibility are its main advantage
  • Winword: Proprietary format used by Microsoft's word-processing applicationWinword. The format changes per version, but it is generally upwardly compatible.

Other standards related to text and documents are also included within the Document section of the OII Standards and Specifications List.

Another aspect to consider in this area are character sets, particularly for the usage and possible retrieval in an international context. The ISO Universal Character Set standard (also known as Unicode) is gaining prominence as the standard to address this issue. TheCharacter Set section of the OII Standards and Specifications List provides additional information in this area.

2.1.2 Graphics

These can be divided into raster and vector graphics formats and within both categories there are multiple standardised and proprietary formats. These include:

  • For raster -- TIFFGIF and JPEG.
  • For vector -- CGM (note that closely related to vector graphics are computer aided design (CAD) applications; however, there is little interoperability, or even import/export, between these applications at present).

Further information is available from the OII Raster Graphic and Vector Graphic sections of the OII Standards and Specifications Lists.

2.1.3 Data

Spreadsheets and databases: At the moment there is no high level standardized format for data files used by spreadsheets or database programs. Thus in order to be sure of being able to read data after a long period of time, users must have a tool that can read the old format or must keep the old software itself. Most formats in this area are proprietary. Generally adopted formats include: Microsoft EXCEL (Spreadsheets), ODBC compliant or SQL interfacing databases, and Comma-Separated-Value (CSV) text files.

Programs: There is a similar problem with keeping programs, as the user then has to keep the source program upgraded to run in different environments or maintain a hardware system on which the program can be run. There is little or no standardisation in this field. Platform independent programming languages like ADA and Java are intended to meet the objective of hardware and system independence.

2.1.4 Audio and Video

The practical usage of audio and video information in document management is still limited. The standards/specifications used tend to be varied and purely dependent on the source format. Commonly adopted formats include: M-JPEG, MPEG1/2MIDI,AVI (Microsoft) and WAVE (Microsoft/IBM).

Further information is available from the Audio andVideo sections of the OII Standards and Specifications List.

2.2 Structure

The methods for preserving structure information fall into the following categories:

  • Embedded layout: Preserving the layout (positions, relationships) of information in the same document as the content. The majority of information is created this way as it is used in word-processing applications
  • Separated layout: Many features of the layout of records, particularly text, are dependent on the platform being used. For instance, it is no use specifying a blinking character for a paper printout. Preserving the layout for long term storage poses problems and the alternative solutions separate the structure of the text (e.g. this is a 1st level heading) from the text style (e.g. 1st level headings are in Arial, Bold). It is left to the program used to view the record to choose a style for the different parts.

However, this is not an area with clear boundaries. For example, WWW HTML pages that are style independent are now increasingly preserving style information especially when dealing with tables which are in turn often used to provide layout control. As such, the standards referenced in the content section of this document should be referenced.

In the future, documents and database files will increasingly become composite documents or even object orientated documents. In other words, documents will consist of several separately linked elements, which may be of completely different types, i.e. multimedia documents. Further, with the advent of the WWW these elements may be distributed globally and embedded within different sources. Common multimedia standards and specifications are addressed in the Multimedia/Hypermedia section of the OII Standards and Specifications List.

The Open Document Architecture (ODA) was an attempt to define a common architecture for logical and layout structures, as well as provide composite features. Due to its inherent complexity and limitations, ODA has not yet been widely adopted and, in the meantime, have been largely superseded by de facto standards/specifications - for example, Microsoft's OLE, IBM/Apple's OpenDoc. However, the latter are not yet perceived to be stable enough for usage in long term storage applications.

2.3 Context

Classification to describe the context of records is one of the most important of tasks associated with document management and perhaps the most complex and hence difficult to implement. It is the basis by which archives can be retrieved most efficiently and with most accuracy, as opposed to content searches which often bring little or an overload of information. Generic classification issues are more extensively discussed in the OII Guide to Metadata.

The drawing up of a classification scheme, and the mechanisms adopted, will generally depend on the level of detail required. Criteria typically used for document classification include:

  • Type of document
  • Dates
  • Source
  • Author
  • Version
  • Subject
  • Keywords
  • Abstract
  • Status.

For record keeping, the Encoding Archival Description (EADSGML Document Type Definition is a fairly recent US initiative, which has been implemented by significant sections of the international archiving community, as a standard for encoding archiving finding aids. The standard accommodates registers and inventories of any length describing the full range of archival holdings, including textural and electronic documents, visual materials and sound recordings.

Within these basic and example classifications, there are several standards which can be used for providing ‘option’ lists. For example, ISO standards for language and country codes, and for parties (sources) - the International Standard Archival Authority records for Corporate Bodies, persons and Families (ISAAR).

The term Electronic Imaging is used by the document management community to describe a multitude of aspects but primarily the representation and processing of information in electronic formats – i.e. much more that just the processing of image (picture) files themselves. There is a significant number of standards and specifications which cover identification, indexing, attributes, relationships, etc, in this field. Further information is available from the Electronic Imagingentry in the Archive section of the OII Standards and Specifications List.

2.4 Layout

The features of document layout, often called 'style', are similar to that of structure -- i.e. style can be embedded or separate -- and thus the same comments on the encompassing text standards within this guide’scontents section apply. In this area there are only a limited numbers of de jure standards (e.g. DSSSL) but a large number of de facto ones (e.g. XSLCSS2). Many also relate to fonts and templates, although the specifications are generally are not available. Nevertheless, typically, fonts can be downloaded or come as a matter of course with applications and operating systems.

3. Access

Record keeping serves no purpose unless they can be consulted when required. This implies aspects such as:

  1. Dissemination -- to distribute the information
  2. Query -- to research specific information according to extraction criteria
  3. Access control -- to control the access and maintenance of information.

3.1 Dissemination

Within the open information society it is becoming vital for machine readable data not only to be preserved but also to be made as easily accessible as possible. The recent and continuing exponential growth of the Internet and the World Wide Web (WWW) has had significant impact in this area as a global and cheap distribution mechanism. Within the context of the Internet this can be within organizations (Intranets), between organizations (extranets) and to the wider world (Internet). In particular, Intranet implementations and the associated field of knowledge management is closely coupled with document management activities in their widest context. However, mechanisms of distribution that are generally applied to controlled dissemination, or where large quantities of information need to be distributed (database, program files, videos…), still rely on more traditional formats, e.g. as for CD-ROM and diskette distribution.

Common networked based dissemination standards and specifications include:

  • HTTP (HyperText Transport Protocol) for Web servers
  • FTP (File Transfer Protocol) for File servers.

3.2 Query

In order to efficiently access managed information, the data must be queried and the relevant components retrieved. The most widely adopted mechanisms are:

  • SQL 2 (Structured Query Language) for relational data bases
  • ISAM (Indexed Sequential Access Method) for indexed sequential files.

SQL, in particular, is widely adopted due to the prolific use of relational databases in corporate, and more recently, desktop applications. SQL is independent of the data, provided that the meta description is broadly compliant to open database architectures such as Open Database Connectivity (ODBC).

SQL is not well suited to querying hierachically structured data, of the type typically found in documents. Where these are stored in object-oriented databases the Object Query Language (OQL) can be used in place of SQL.

Alternatively a language such as the Structured Document Query Language (SDQL) component ofHyTime and DSSSL can be used to locate information within electronic documents. Another widely used method of identifying subsets of documents is the X-pointers query specification defined for the Text Encoding Initiative, which have also been adopted for W3C's Extensible Markup Language (XML)extension to HTML.

Within the specific arena of document management and information retrieval, the ANSI standardCommon Command Language for on-line information retrieval is also being used to achieve this task. The standard specifies the vocabulary, syntax and meanings of command terms to be used with on-line interactive information retrieval systems.

3.3 Access Control

It is often essential that documents are not subject to (un-authorized) tampering or deletion. This is no different for electronic records. Thus access rights need to be established ranging from access to:

  • View
  • Print
  • Update
  • Delete
  • Allow anonymous use.

The security mechanisms adopted can range from preventing hardware access to software access codes, document access codes and base level encryption. There is no standardized mechanism for access control and it is often dependent on the precise level of security needed. However, for each security feature, there are often established standards and specifications. Further information is available from theInformation Security section of the OII Standards and Specifications List.

The use of digital signatures can potentially provide the required level of protection against tempering with archived records. The usage of digital signatures for general archiving purposes is currently under investigations within the archiving community.

4. Manipulation

Records need to be manipulated in various ways either through pre-processing during their production, or post processing operations once they have been established and need to be used/disseminated. This includes:

  1. Conversion
  2. Compression.

4.1 Conversion

Conversion includes:

  1. Paper to electronic format
  2. Between electronic formats.
4.1.1 Paper to electronic format

The main solutions for converting a document from paper to digital format are:

  • Scanning the document to obtain an image of it
  • Scanning and then digitising it through optical character recognition (OCR) for text and graphics, through vectorisation for line-based images or CAD data, and through fractal transformation for image files. The advantages of digitization are the significant decrease in file size (especially for graphics) and the ability to post process or categorize information (especially for text files).

There is no widely adopted standard or specification for the format of such documents; instead scanning tools tend to allow the user to convert to multiple proprietary application formats used by common word processing packages.

4.1.2 Between electronic formats

There are at least two instances where it may be useful to convert a record from one digital format to another:

  • To make a more durable format to make preservation and consultation easier
  • Adding structure to a document which may the be used in the document and/or associated database.

When an electronic record is converted from one format to another, care has to be taken to avoid accidental loss of data. The features catered for by proprietary and standard formats do not often correspond -- e.g. the loss/disconnection of footnotes.

DSSSL is a candidate solution for conversion between electronic formats. DSSSL contains a transformation language which can be used to:

  • Add structure to unstructured documents
  • Convert documents from one structure to another as technologies change
  • Convert from content storage structure to content presentation structure.

4.2 Compression

Another manipulation process is compression which reduces the size of files through encryption. There are two main categories:

  • Lossless compression: After being compressed and then decompressed, the original and copy are the same – this gives lower rates of compression. Typically text documents and program files need to use this type of compression since no information must be lost.
  • Lossy compression: In this case, the less useful information in a file is not saved and thus the compression achieved depends on how much deterioration (loss) is acceptable. Compression of images and sound typically use this method since there is often redundant embedded information. For example, there is little point in storing non-audible audio frequencies.

For generic compression of all documents the most popular formats are also largely consumer orientated and can also be classified as:

  • Software: Typically associated with the web and the transfer of program files. These include the proprietary file formats: ZIP, TAR, LZH, and ARC. Fractal transformation for image files is also used for the compression of art and other image based information.
  • Hardware: Based upon de facto 'backup' formats.

Further information is available from the OII Guide to Image Compression.

5. Preservation

The areas which must be considered in the (long-term) preservation of information includes:

  1. Formats -- the structure of data on media
  2. Media -- the physical media.

5.1 Formats

Strictly speaking, the format (structure) of data written on a recording media is independent of the media itself -- for example, compact discs become Audio Compact Discs for music and CD-ROMs for data. However, practically, the two often become interlinked -- especially within the context that they are used. Indeed, often the format used is hidden at the very lowest level as is the case with most computer media. Thus, when a document manager is choosing the format, they are more likely to simply select a suitable media. However, for suppliers, the format becomes more important, since this often differentiates the product. Typical formats are presented in theArchiving Interchange Formats entry of the Archivesection of the OII Standards and Specifications List.

5.2 Media

Obviously the physical medium on which electronic records are stored should have as long a life span as possible. But so must the technology, since there is no point in physically preserving records if the hardware and software are no longer capable of processing the data that they contain.

Types of media include:

  • Magnetic: This is a well established, cost effective method but is liable to damage and generally requires rewriting periods of around 2 years. For longer-term storage, this is gradually being replaced by optical methods. Magnetic based mechanisms include: Diskette, Cartridge, Hard Drives, DAT, and Tape
  • Optical: This is the most recent type of storage mechanism. It is widely perceived to be the future for longer term archives and is already replacing other media types for long term storage. A rewrite period of between 10-20 years, a high per volume storage capacity, zero media write contact are some of the features that make optical mechanisms so attractive. However, at present, magnetic mechanisms do offer faster access and reduced costs. Mechanisms include: CD-ROM, Digital Video Disk (DVD), and Optical disks. DVD in particular offers very large capacity for long term storage facilities.
  • Paper/Microfilm/Microfiche.

6. Compatibility

Since non-electronic documents have been processed for substantial numbers of years, compatibility with existing systems is paramount. More precisely, the integration of existing archival systems with the power of electronic retrieval systems is high on the document management priority list. Some of these existing systems (e.g. micrographics) lend themselves to this and there are a series of hybrid standards that establish anything from the output from computers to form micrographic material through to the retrieval of that information.

7. Supporting Environment

The electronic management and archival of documents is supported by several additional processes:

  1. Procedures and rules
  2. Compliance

7.1 Procedures and rules

As has already been suggested, the archiving community should use standards and specifications which are already existing and if necessary optimizing them for their own environment. Accordingly, there is increasing focus on providing more business orientated procedures and rules which describes best practices for utilizing the technologies and associated standards/specifications, as well as defining how, for example, information should be stored or how it should be integrated at a high level. Many of these guides are specific to certain sectors, companies, etc, and are not generally made available on a more open basis. Open guides include:

  • ISAD - International Standard Archival Description

7.2 Compliance

Various compliance checks can be carried out on electronic records:

  • Compliance with content standards
  • Compliance with formatting standards
  • Readability.

There are limited open standards/specifications in this field since they tend to be specific to the 'house' environment of the managing organization.

Section Contents
OII Home Page
OII Index
OII Help

This information set on OII standards is maintained by Martin Bryan of The SGML Centre and Man-Sze Li of IC Focus on behalf of European Commission DGXIII/E.

File last updated: January 1998



Home - Gate - Back - Top - Archive guide - Relevant