OII Standards and Specifications List


I*M Europe
OII Home Page
What is OII?
Standards List
OII Guides
OII Fora List
Conference Reports
Monthly Reports
EC Reports
Whats New?
OII Index
OII FAQ
OII Feedback
Disclaimer
Search Database

OII Guide to Metadata

Metadata is the generic term used to describe data that can be used by searchers to identify features shared by different documents.

Metadata is not something unique to the World Wide Web (WWW) and other forms of electronic data distribution. Metadata has been used for cataloguing and indexing information stored in libraries for over 2000 years. The introduction of computing has, however, vastly increased the speed and range of metadata searching. Instead of slowly searching the catalogue of a single library for data of the type required, it is now possible to consider doing a parallel search of thousands of digital libraries, obtaining a list of possible data sources within seconds rather than days.

The types of metadata that can be associated with a WWW document include:

  • details of the document's author, publisher and publication date
  • details of the ownership of any associated intellectual property rights (IPR)
  • details of any ratings assigned to the data to allow protection against harmful content facilities to be applied
  • searchable keywords that can be used to classify the document
  • codes used to classify the document's contents with respect to a standardized classification scheme (e.g. Universal Decimal Classification)
  • details of the type of data found in the document, and the relationships between different data components.

Metadata can be stored either as an integral part of the document it describes or as part of a separate file of information. Ideally it should be possible to do both, with data stored as part of document being automatically extracted into searchable metadata repositories.

Different search tools require the presence of different types of metadata. A set of related metadata fields is referred to as the vocabulary of a particular application. A document that defines the vocabulary used by an application is referred to as a schema.

Metadata is typically defined in terms of name/value pairs in which the name identifies the role of the specific element of metadata and the associated value indicates the searchable term to be used to reference documents exhibiting the required characteristic. For example, the Dublin Core proposaluses CREATOR as the name of field used to store details of the document's author.

Commonly used metadata sets

The development of standards and specifications for metadata vocabularies for describing electronic data is in its infancy. Organizations working in this area include:

Note: For details of the MPEG-7 work see the OII Report of the MPEG-7 Seminar held in Bristol in April 1997. For details of the first CEN/ISSS MMI workshop held in Brussels in February 1998 see the OII Report of the workshop.

This section looks at the following proposed metadata vocabularies:

In addition there is a wide range of sector-specific metadata sets which are documented as part of theOII Standards and Specifications List.

The Dublin Core Metadata for Simple Resource Discovery

An invitational workshop held in March of 1995 in Dublin, Ohio, brought together librarians, digital library researchers, and text-markup specialists to address the problem of resource discovery for networked resources. This activity evolved into a series of related workshops and ancillary activities that have become known collectively as the Dublin Core Metadata Workshop Series.

As a result of the workshops, a set of elements judged to form the core elements for cross-disciplinary resource discovery were identified. The term "Dublin Core" applies to this core of descriptive elements, which includes the following elements:

TITLE
The name given to the resource by its creator or publisher
CREATOR
The person or organization primarily responsible for creating the intellectual content of the resource.
SUBJECT
The topic of the resource.
DESCRIPTION
A textual description (abstract) of the content of the resource.
PUBLISHER
The entity responsible for making the resource available in its present form.
CONTRIBUTOR
A person or organization not specified in a CREATOR element who has made significant intellectual contributions to the resource.
DATE
The date the resource was made available in its present form.
TYPE
The type of resource, such as home page, novel, poem, etc.
FORMAT
The data format of the resource.
IDENTIFIER
String or number used to uniquely identify the resource.
SOURCE
A string or number used to uniquely identify the work from which this resource was derived.
LANGUAGE
Language(s) of the intellectual content of the resource.
RELATION
The relationship of this resource to other resources.
COVERAGE
The spatial and/or temporal characteristics of the resource.
RIGHTS
A link to a copyright notice, to a rights-management statement, or to a service that would provide information about terms of access to the resource.

Metadata conforming to the Dublin Core set can be associated with documents coded using theHyperText Markup Language (HTML) used for WWW documents by the addition of a set of METAelements in the document header. For example, this document includes the following Dublin Core metadata data fields in its header:

<META NAME="Title" CONTENT="OII Guide to Metadata">  <META NAME="Publisher" CONTENT="European Commission DGXIII/E">  <META NAME="Creator" CONTENT="Bryan, Martin">  <META NAME="Creator" TYPE="Affiliation" CONTENT="The SGML Centre">  <META NAME="Creator" TYPE="Email"        CONTENT="mtbryan@sgml.u-net.com">  <META NAME="Contributor" CONTENT="Li, Man-Sze">  <META NAME="Contributor" TYPE="Affiliation" CONTENT="IC Focus">  <META NAME="Description"        CONTENT="Comprehesive guide to the use of metadata">  <META NAME="Subject" CONTENT="OII, Open Information Interchange,              Standards, Specifications, metadata, Dublin Core,              UDC, DDC, LCC, MARC, IMS, PICS, GEDI">  <META NAME="Identifier"        CONTENT="http://www.echo.lu/oii/en/metadata.html">  <META NAME="Language" CONTENT="ENG">  <META NAME="Rights" CONTENT="http://www.echo.lu/disclaimer.html">  <META NAME="Relation" TYPE="IsChildOf"        CONTENT="http://www.echo.lu/oii/en/guides.html">  

A number of documents proposing qualifications that could be applied to Dublin Core elements have been published. The TYPE attribute in the example above illustrates a typical use of these qualifiers to create subcategories within a main category.

The IMS Metadata Dictionary

An alternative scheme to the Dublin Core proposal has been developed by the Instructional Management Systems Project, part of the US Educom National Learning Infrastructure Initiative TheIMS Metadata Dictionary defines the following object types according to the rules specified in ISO standard 11179:

Abstract
Author
Catalog ID
Concepts
Container Type
Credits
Expiration Date
Form
Format
GUID
Interactivity Level
Keywords
Language
Learning Level
Location
Metadata Version
Objectives
Pedagogy
Platform
Prerequisites
Presentation
Price Code
Relation
Role
SizeOf
Source
Steward
Structure
Subject
Title
Use Rights
User Support
Use Time
Version Date
Version

Note that many of these object type serve a similar purpose to Dublin Core components with the same or different names, while others are specific to IMS. This commonly found relationship between schemas occurs because there are certain characteristics that are common to all electronic documents stored in repositories. While each repository has its unique characteristics (e.g. Learning Level, Pedagogy and Prerequisites for learning resources) there are many cases where it is possible to consider searching different types of repoisitory for data, providing you can determine the relationship between the names assigned to attributes that serve a common purpose (e.g. a Dublin Core creator and an IMS author).

The IMS project team provide a useful listing of initiatives relating to metadata capture and use on theirCredits and References page. (Please note that some of the links on this page are out of date.)

report of a Metadata Summit organized on 1st July 1997 by the US Research Libraries group highlights the requirements of libraries to be able to apply metadata to information sources that are not coded using HTML. It specifically looks at the relationships between searches based on Z39.50 Information Retrieval Application Service Definition and Protocol Specification, and relatedLibrary Information Interchange Standards, with those available on the Internet.

report on the Joint Workshop on Metadata Registeries held in California during July 1997 highlights the role the W3C Resource Description Framework (RDF) is expected to play in integrating a wide range of metadata description formats.

In November 1997 the US National Institute for Science and Technology published Version 1.0 of aLearning Object Metadata Framework Requirements Specification. This specification highlights the role RDF, and the XML standard it is described in, are expected to play in formally describing on-line educational resources. The LOMF Preliminary Metadata Specifications include fields relating to Learning Level, Learning Objectives, Learning Style, Learning Assessment, Prerequisites and a unique Learning Object ID (LOID).

Platform for Internet Content Selection (PICS)

The W3C have published the following recommendations concerning the Platform for Internet Content Selection (PICS) document rating services:

A PICS rating service is an individual, group, organization or company that provides content labels for information on the Internet. The labels it provides are based on a rating system. Each rating service must describe itself using a special MIME type, application/pics-service. Selection software that relies on ratings from a PICS rating service can first load the application/pics-service description. This description allows the software to tailor its user interface to reflect the details of a particular rating service, rather than providing a "one design fits all rating services" interface.

A PICS label can be associated with a distributable file to indicate how it has been rated by a specific rating service. A PICS label can be:

  • defined as part of the metadata in the header of an HTML file
  • sent with a document being transferred over the Internet by any protocol that uses RFC 822conformant message headers (e.g. HTTP)
  • stored in a "label bureau" that can accept HTTP calls for information on documents that have an Internet Unique Reference Locator (URL).
When embedded within an HTML header the label has the form:
<META http-equiv="PICS-Label"        content='(PICS-1.1 "http://www.fda.org/v2.5" labels                  on  "1994.11.05T08:15-0500"                  exp "1995.12.31T23:59-0000"                  for "http://www.greatfoods.com/curries.html"                  by  "George Sanderson, Jr."                  ratings (strength 0.5 additives 1))'        >

It is more common, however, for labels to be exchanged independently of the data as part of the file transfer negotiations. For example, the request for a file can be extended to include a request that the rating of the file by a specific rating agency be returned as part of the header for the data supplied.

A PICSRule can identify one or more PICS rating services to be used to control access to information, one or more PICS label bureaus to query for labels, and criteria about the contents of labels that would be sufficient to make a decision to accept or reject a particular information resource.

PICSRules consist of parenthesized attribute-value pairs. Values may contain lists and nested attribute pairs. An example of a PICSRule is:

(PicsRule-1.1   ( name        (rulename "Example 4"                  description "Example PICSRules spec")     source      (sourceURL                  "http://www.raleigh.com/PICSRules/Example.htm")     ServiceInfo (name "http://www.cool.org/ratings/V1.html"                  shortname "Cool"                  bureauURL "http://labelbureau.cool.org/Ratings")     ServiceInfo ("http://www.kid-protectors.org/ratingsv01.html"                  shortname "KP")     Policy      (RejectByURL ("http://*@www.badnews.com:*/*"                               "http://*@www.worsenews.com:*/*"                               "*://*@18.0.0.0!8:*/*"))     Policy      (AcceptByURL "http://*rated-g.org/movies*")     Policy      (AcceptIf "(KP.educational = 1)"                  Explanation "Always allow educational content.")     Policy      (RejectIf "(KP.violence >= 3)"                  Explanation "Blood's a %22scary%22 thing.")   )  )

MARC and Library Document Classification Schemes

Libraries have for many centuries used catalogues as an aid to finding documents with certain properties. Computerized library catalogues are typically created using one of the standardized Machine Readable Cataloguing (MARC) formats.

Note: For a summary of the development of MARC refer to the entry on MARC in the Library Information Interchange Standards section of the OII Standards and Specifications List.

MARC records are based on standardized classification schemes such as the Universal Decimal Classification (UDC) developed by the International Federation for Information and Documentation (FID), the Library of Congress Classification Scheme and the Dewey Decimal Classification (DDC). As the latter of these is updated most frequently it tends to be most used for computer-based cataloguing. A list of the main DDC categories can be found athttp://www.oclc.org/oclc/fp/about/ddc21sm3.htm.

Note: For further information on the schemes used for classifying document collections refer to the Book classification section of the Data Classification Standards section of the OII Standards and Specifications List.

The top-level categories used for the Dewey Decimal Classification are:

000 Generalities  100 Philosophy & psychology  200 Religion  300 Social sciences  400 Language  500 Natural sciences & mathematics  600 Technology (Applied sciences)  700 The arts  800 Literature & rhetoric  900 Geography & history

By contrast the top-level categories used for the Universal Decimal Classification are:

0 Generalities  1 Philosophy (inc. Psychology)  2 Religion, Theology  3 Social Sciences, Law, Government  5 Mathematics and Natural Sciences (inc. Chemistry, Biology, Computing)  6 Applied Sciences, Medicine, Technology  7 The Arts, Recreation, Entertainment, Sport  8 Languages, Linguistics, Literature  9 Geography, Biography, History  

It will be noted that there is a wide degree of overlap at the uppermost levels of the DDC and UDC scheme, but as you go down the hierarchies the differences become more noticable. For example, computing is covered as one of the Generalities in Dewey under the headings of:

004 Data processing, Computer science  005 Computer programming, programs, data  006 Special computer methods  
while, under the Universal Decimal Classification, computing is a single subject area within the Mathematics subject area in the Natural Sciences top-level classification.

The main Library of Congress classifications are:

A General Works  B Philosophy, Psychology, Religion  C Auxliary Sciences of History  D History: General and Old World  E History: America  F Histroy: America  G Geography, Anthopology, Recreation  H Social Sciences  J Political Science  K Law  L Education  M Musing and Books on Music  N Fine Arts  P Language and Literature  Q Science  R Medicine  S Agriculture  T Technology  U Military Science  V Naval Science  Z Library Science

The Library of Congress treats Electronic computers, Computer science, and Computer software as branches of Mathematics under Q Science, but reserves a separate entry under the Electronics section of T Technology for Computer hardware. However, Information theory is seen as part of Cybernetics, which is classed as a General science rather than a branch of Mathematics, while the Information superhighway, Electronic information resources, Computer network resources and Databases are headings under the Information Resources heading under Z Library Science.

Such spreads of information resources are typical of what happens if you try to graft new subjects into already established general-purpose classification schemes, without allowing for the possibility of reclassifying existing data sets.

When the Dublin Core proposals are being used to identify the subject of an electronic publication it is possible to use one of the proposed extensions to classify the data according to more than one classification scheme, as the following example shows:

<META NAME="Subject" SCHEME="UDC" CONTENT="518.5">  <META NAME="Subject" SCHEME="DDC" CONTENT="004">  <META NAME="Subject" SCHEME="LCC" CONTENT="TK7885">;  

The schemes listed in the Dublin Core extension document are:

  • AAT (Art and Architechture Thesaurus)
  • DDC (Dewey Decimal Classification)
  • LCC (Library of Congress Classification)
  • LCNAF (Library of Congress Name Authority File): for names used as subjects
  • LCSH (Library of Congress Subject Headings)
  • MeSH (Medical Subject Headings)
  • NLM (National Library of Medicine Classification)
  • UDC (Universal Decimal Classification).

Note: At present there is a somewhat US bias to the listing of classification schemes. While there is nothing to stop other schemes being added to the list, it is rather a pity that no classification schemes based on non-English classifications are currently listed.

Internet Subject Indexes

By contrast to the relatively stable world of library subject indexes detailed above, the sorts of categories used for identifying information stored on the World Wide Web tend to be much more business oriented. (They also tend to change very frequently.) For example, in April 1998 theAltaVista subject search used the following top-level categories:

Computers & Internet  Business & Finance  Reference & Education  Society & Politics  Entertainment & Media  People & Chat  Shopping & Services  Travel & Vacations  Sports & Recreation  Hobbies & Interests  Health & Fitness  Home, Family & Auto  

while Yahoo used:

Arts and Humanities (Architecture, Photography, Literature...)  Business and Economy (Companies, Finance, Employment...)  Computers and Internet (Internet, WWW, Software, Multimedia...)  Education (Universities, K-12, College Entrance...)  Entertainment (Cool Links, Movies, Music, Humor...)  Government (Military, Politics, Law, Taxes...)  Health (Medicine, Drugs, Diseases, Fitness...)  News and Media (Current Events, Magazines, TV, Newspapers...)  Recreation and Sports (Sports, Games, Travel, Autos, Outdoors...)  Reference (Libraries, Dictionaries, Phone Numbers...)  Regional (Countries, Regions, U.S. States...)  Science (CS, Biology, Astronomy, Engineering...)  Social Science (Anthropology, Sociology, Economics...)  Society and Culture (People, Environment, Religion...)

and the Netscape Internet Guide offered:

Business (Dream Jobs, Ad Reviews, Market Research)  Computers (Downloads, Software Prices, Web Reviews, Intranet Events)  Entertainment (Top 10 Singles, TV Listings, Box Office Figures, Game Reviews)  Finance (Market Quotes, Interest Rates, Tax Tips, Mortgage Calculator)  Local (Local News, Events, Restaurants, Hotels)  Netcenter (Business Journal, Smart Update, Software Depot Specials)  News (Today's Headlines)  Shopping (Classifieds, Prices, Buying Tips, Auto Reviews)  Sports (Player Of The Week, Golf Money Leaders, Scores)  Travel (Flight Info, Travel Bargains, FareFinder, Hotel Finder)

By contrast Lycos does not offer search categories but instead offers its users "guides". Which guides are available depends on which country you are searching in. For example, in April 1998 Lycos USoffered the following guides:

Autos (Classifieds, Buy a Car, Parts)  Business (News, Industries, Small Business)  Careers (Job Search, Advice)  Computers (Hardware, Software, Cyberlife)  Education (Financial Aid, Colleges, K-12)  Electronics (Audio, TV/Video, Laptops)  Entertainment (TV/Movies, Humor, Music)  Fashion (Supermodels, Designers, Clothes)  Games (PC Games, Popular Games)  Government (Politics, Services, Issues)  Health (Fitness, Diseases, Diets)  Home/Garden (Gardening, Cooking, Fix-It)  Internet (Just For Fun, Web Design)  Kids (Games, Teens, Sports)  Money (Investments, Resources)  News (U.S., World, Weather)  People (Women, Interests, Romance)  Real Estate (Advice, Properties, Apt/Rentals)  Shopping (Books, Cards, Search)  Space/Sci-Fi (Exploration, X-Files, Planets)  Sports (Basketball, Hockey, Baseball)  Travel (Destinations, Reservations, Cities)

while subscribers to Lycos UK were being offered:

Business (Tax Issues)  Cars (Electric, Offroad)  Career (Employment, CV & Interview)  Entertainment (New Movies, Books)  Finance (Shares, News, Currency Calculator)  Kids (Games, School)  Sports (Formula 1, Football, Cricket)  Technology (CeBIT 98, Online Privacy)  Travel (Ireland, City Breaks)

Note particularly the regional differences between the application of such shared terminology as Sports and Travel, and the use of different categories for the same thing in different regions (e.g. Auto for Car).

It will be seen from this wide variety of classification schemes that developing subject-based search engines that will work across a wide range of web sites, in the way that meta-search engines do for free-text searching, is not currently feasible. Hopefully a new generation of standards, such as ISO 13250, which defines the specification of Topic Naviagtion Maps that can link together different classification schemes, will make it easier for users to find their way around the vagaries of the different classification schemes.

Another area where improvement is required is in the preparation of vocabularies and acronym lists for use with automated searches. In this area the fast moving, acronym-ridden, world of IT standards is particuarly illustrative . Trying to categorize the OII files, which list the latest standards for IT, is somewhat of a nightmare as no amount of referencing of existing vocabulaies or acronym lists will identify standardized terms for referencing something that has only recently been developed.

Many of the standards listed in the OII Standards and Specifications List are most frequently referenced by their acronym (e.g. RTF, RTP, HTTP,HTML,....). However acronyms are rarely unique. Unless you know the domain in which the acronym is being used the chances of finding the correct interpretation of the acronym are often vastly reduced. In addition the acronyms are often used for other purposes than listing standards. Anyone trying to do a free-text search for HTML will be very unlikely to come across details of the specification of HTML: they are most likely to come across references to its use. So unless the role of the document is recorded as well as its domain and the meaning associated with that acronym in that document there is only a limited likelihood of a search returning the desired results.

Note: The recent addition of an <ACRONYM> element to Version 4.0 of the HTML specification is to be welcomed in this respect. It is to be hoped that it will be widely used, and that a domain attribute will soon be added to it!

Format for Generic Electronic Document Interchange (GEDI)

The ISO committee responsible for the standardization of computer applications in information and documentation (TC46/SC4) issued a proposal to develop a Format for Generic Electronic Document Interchange (GEDI) in January 1998. The working draft of this proposal suggests that files should be interchanged as images, using the TIFFJPEG and PDF data interchange formats. The proposal identifies the following categories of metadata that should be associated with interchanged images:

Category Fields
Type 1: Document Interchange Format Information Interchange Format ID, Interchange Format Version, Cover Information Length, Document Format ID, Service String Advice
Type 2: Destination and Storage Information Consumer Name, Record Name, Supplier Name, Service Date Time, System Service ID, Delivery Service, Confirmation Address
Type 3: Transaction Information Priority, General Note, Client Name, Client ID, Client Status, Name of Person or Institution, Extended Postal Delivery Address, Street and Number, Post Office Box, City, Region, Country, Postal Code, Requester ID, Requester Name, Responder ID, Responder Name, Copyright Compliance, ILL Transaction ID, Responder Note, Receive Control
Type 4: Document Description Title, Volume/Issue, Author of Article, Title of Article, ISBN, ISSN, Page Numbers, Date Scanned, Number of Pages, Call Number, Publication Date of Component, Publication Date, Place of Publication, Publisher, Edition, Request as Quoted, Copyright Statement, Item ID

Sector-specific metadata sets

In addition to the general-purpose metadata information sets for describing the contents of electronic documents described above, a wide range of specialist metadata schemas are described in theSectorial Data Interchange sections of the OII Standards and Specifications List. The sectors currently covered include:

Few of the standards described in these sections are specifically designed for use in an Internet environment, and in general these standards do not rely on the use of shared metadata facilities such as RDF and the META element of HTML. Among the many metadata sets described in these sections are:



Section Contents
OII Home Page
OII Index
OII Help

This information set on OII standards is maintained by Martin Bryan of The SGML Centre and Man-Sze Li of IC Focus on behalf of European Commission DGXIII/E.

File last updated: April 1998

Home - Gate - Back - Top - Metadata - Relevant