Metadata is the generic term used to describe data that can be used by searchers to identify features shared by different documents.
Metadata is not something unique to the World Wide Web (WWW) and other forms of electronic data distribution. Metadata has been used for cataloguing and indexing information stored in libraries for over 2000 years. The introduction of computing has, however, vastly increased the speed and range of metadata searching. Instead of slowly searching the catalogue of a single library for data of the type required, it is now possible to consider doing a parallel search of thousands of digital libraries, obtaining a list of possible data sources within seconds rather than days.
The types of metadata that can be associated with a WWW document include:
Metadata can be stored either as an integral part of the document it describes or as part of a separate file of information. Ideally it should be possible to do both, with data stored as part of document being automatically extracted into searchable metadata repositories.
Different search tools require the presence of different types of metadata. A set of related metadata fields is referred to as the vocabulary of a particular application. A document that defines the vocabulary used by an application is referred to as a schema.
Metadata is typically defined in terms of name/value pairs in which the name identifies the role of the specific element of metadata and the associated value indicates the searchable term to be used to reference documents exhibiting the required characteristic. For example, the Dublin Core proposaluses CREATOR as the name of field used to store details of the document's author.
Commonly used metadata sets
The development of standards and specifications for metadata vocabularies for describing electronic data is in its infancy. Organizations working in this area include:
Note: For details of the MPEG-7 work see the OII Report of the MPEG-7 Seminar held in Bristol in April 1997. For details of the first CEN/ISSS MMI workshop held in Brussels in February 1998 see the OII Report of the workshop.
This section looks at the following proposed metadata vocabularies:
An invitational workshop held in March of 1995 in Dublin, Ohio, brought together librarians, digital library researchers, and text-markup specialists to address the problem of resource discovery for networked resources. This activity evolved into a series of related workshops and ancillary activities that have become known collectively as the Dublin Core Metadata Workshop Series.
As a result of the workshops, a set of elements judged to form the core elements for cross-disciplinary resource discovery were identified. The term "Dublin Core" applies to this core of descriptive elements, which includes the following elements:
Metadata conforming to the Dublin Core set can be associated with documents coded using theHyperText Markup Language (HTML) used for WWW documents by the addition of a set of METAelements in the document header. For example, this document includes the following Dublin Core metadata data fields in its header:
<META NAME="Title" CONTENT="OII Guide to Metadata"> <META NAME="Publisher" CONTENT="European Commission DGXIII/E"> <META NAME="Creator" CONTENT="Bryan, Martin"> <META NAME="Creator" TYPE="Affiliation" CONTENT="The SGML Centre"> <META NAME="Creator" TYPE="Email" CONTENT="firstname.lastname@example.org"> <META NAME="Contributor" CONTENT="Li, Man-Sze"> <META NAME="Contributor" TYPE="Affiliation" CONTENT="IC Focus"> <META NAME="Description" CONTENT="Comprehesive guide to the use of metadata"> <META NAME="Subject" CONTENT="OII, Open Information Interchange, Standards, Specifications, metadata, Dublin Core, UDC, DDC, LCC, MARC, IMS, PICS, GEDI"> <META NAME="Identifier" CONTENT="http://www.echo.lu/oii/en/metadata.html"> <META NAME="Language" CONTENT="ENG"> <META NAME="Rights" CONTENT="http://www.echo.lu/disclaimer.html"> <META NAME="Relation" TYPE="IsChildOf" CONTENT="http://www.echo.lu/oii/en/guides.html">
A number of documents proposing qualifications that could be applied to Dublin Core elements have been published. The TYPE attribute in the example above illustrates a typical use of these qualifiers to create subcategories within a main category.
An alternative scheme to the Dublin Core proposal has been developed by the Instructional Management Systems Project, part of the US Educom National Learning Infrastructure Initiative TheIMS Metadata Dictionary defines the following object types according to the rules specified in ISO standard 11179:
Note that many of these object type serve a similar purpose to Dublin Core components with the same or different names, while others are specific to IMS. This commonly found relationship between schemas occurs because there are certain characteristics that are common to all electronic documents stored in repositories. While each repository has its unique characteristics (e.g. Learning Level, Pedagogy and Prerequisites for learning resources) there are many cases where it is possible to consider searching different types of repoisitory for data, providing you can determine the relationship between the names assigned to attributes that serve a common purpose (e.g. a Dublin Core creator and an IMS author).
The IMS project team provide a useful listing of initiatives relating to metadata capture and use on theirCredits and References page. (Please note that some of the links on this page are out of date.)
A report of a Metadata Summit organized on 1st July 1997 by the US Research Libraries group highlights the requirements of libraries to be able to apply metadata to information sources that are not coded using HTML. It specifically looks at the relationships between searches based on Z39.50 Information Retrieval Application Service Definition and Protocol Specification, and relatedLibrary Information Interchange Standards, with those available on the Internet.
A report on the Joint Workshop on Metadata Registeries held in California during July 1997 highlights the role the W3C Resource Description Framework (RDF) is expected to play in integrating a wide range of metadata description formats.
In November 1997 the US National Institute for Science and Technology published Version 1.0 of aLearning Object Metadata Framework Requirements Specification. This specification highlights the role RDF, and the XML standard it is described in, are expected to play in formally describing on-line educational resources. The LOMF Preliminary Metadata Specifications include fields relating to Learning Level, Learning Objectives, Learning Style, Learning Assessment, Prerequisites and a unique Learning Object ID (LOID).
The W3C have published the following recommendations concerning the Platform for Internet Content Selection (PICS) document rating services:
A PICS rating service is an individual, group, organization or company that provides content labels for information on the Internet. The labels it provides are based on a rating system. Each rating service must describe itself using a special MIME type, application/pics-service. Selection software that relies on ratings from a PICS rating service can first load the application/pics-service description. This description allows the software to tailor its user interface to reflect the details of a particular rating service, rather than providing a "one design fits all rating services" interface.
A PICS label can be associated with a distributable file to indicate how it has been rated by a specific rating service. A PICS label can be:
<META http-equiv="PICS-Label" content='(PICS-1.1 "http://www.fda.org/v2.5" labels on "1994.11.05T08:15-0500" exp "1995.12.31T23:59-0000" for "http://www.greatfoods.com/curries.html" by "George Sanderson, Jr." ratings (strength 0.5 additives 1))' >
It is more common, however, for labels to be exchanged independently of the data as part of the file transfer negotiations. For example, the request for a file can be extended to include a request that the rating of the file by a specific rating agency be returned as part of the header for the data supplied.
A PICSRule can identify one or more PICS rating services to be used to control access to information, one or more PICS label bureaus to query for labels, and criteria about the contents of labels that would be sufficient to make a decision to accept or reject a particular information resource.
PICSRules consist of parenthesized attribute-value pairs. Values may contain lists and nested attribute pairs. An example of a PICSRule is:
(PicsRule-1.1 ( name (rulename "Example 4" description "Example PICSRules spec") source (sourceURL "http://www.raleigh.com/PICSRules/Example.htm") ServiceInfo (name "http://www.cool.org/ratings/V1.html" shortname "Cool" bureauURL "http://labelbureau.cool.org/Ratings") ServiceInfo ("http://www.kid-protectors.org/ratingsv01.html" shortname "KP") Policy (RejectByURL ("http://*@www.badnews.com:*/*" "http://*@www.worsenews.com:*/*" "*://*@220.127.116.11!8:*/*")) Policy (AcceptByURL "http://*rated-g.org/movies*") Policy (AcceptIf "(KP.educational = 1)" Explanation "Always allow educational content.") Policy (RejectIf "(KP.violence >= 3)" Explanation "Blood's a %22scary%22 thing.") ) )
Libraries have for many centuries used catalogues as an aid to finding documents with certain properties. Computerized library catalogues are typically created using one of the standardized Machine Readable Cataloguing (MARC) formats.
MARC records are based on standardized classification schemes such as the Universal Decimal Classification (UDC) developed by the International Federation for Information and Documentation (FID), the Library of Congress Classification Scheme and the Dewey Decimal Classification (DDC). As the latter of these is updated most frequently it tends to be most used for computer-based cataloguing. A list of the main DDC categories can be found athttp://www.oclc.org/oclc/fp/about/ddc21sm3.htm.
Note: For further information on the schemes used for classifying document collections refer to the Book classification section of the Data Classification Standards section of the OII Standards and Specifications List.
The top-level categories used for the Dewey Decimal Classification are:
000 Generalities 100 Philosophy & psychology 200 Religion 300 Social sciences 400 Language 500 Natural sciences & mathematics 600 Technology (Applied sciences) 700 The arts 800 Literature & rhetoric 900 Geography & history
By contrast the top-level categories used for the Universal Decimal Classification are:
0 Generalities 1 Philosophy (inc. Psychology) 2 Religion, Theology 3 Social Sciences, Law, Government 5 Mathematics and Natural Sciences (inc. Chemistry, Biology, Computing) 6 Applied Sciences, Medicine, Technology 7 The Arts, Recreation, Entertainment, Sport 8 Languages, Linguistics, Literature 9 Geography, Biography, History
It will be noted that there is a wide degree of overlap at the uppermost levels of the DDC and UDC scheme, but as you go down the hierarchies the differences become more noticable. For example, computing is covered as one of the Generalities in Dewey under the headings of:
004 Data processing, Computer science 005 Computer programming, programs, data 006 Special computer methodswhile, under the Universal Decimal Classification, computing is a single subject area within the Mathematics subject area in the Natural Sciences top-level classification.
The main Library of Congress classifications are:
A General Works B Philosophy, Psychology, Religion C Auxliary Sciences of History D History: General and Old World E History: America F Histroy: America G Geography, Anthopology, Recreation H Social Sciences J Political Science K Law L Education M Musing and Books on Music N Fine Arts P Language and Literature Q Science R Medicine S Agriculture T Technology U Military Science V Naval Science Z Library Science
The Library of Congress treats Electronic computers, Computer science, and Computer software as branches of Mathematics under Q Science, but reserves a separate entry under the Electronics section of T Technology for Computer hardware. However, Information theory is seen as part of Cybernetics, which is classed as a General science rather than a branch of Mathematics, while the Information superhighway, Electronic information resources, Computer network resources and Databases are headings under the Information Resources heading under Z Library Science.
Such spreads of information resources are typical of what happens if you try to graft new subjects into already established general-purpose classification schemes, without allowing for the possibility of reclassifying existing data sets.
When the Dublin Core proposals are being used to identify the subject of an electronic publication it is possible to use one of the proposed extensions to classify the data according to more than one classification scheme, as the following example shows:
<META NAME="Subject" SCHEME="UDC" CONTENT="518.5"> <META NAME="Subject" SCHEME="DDC" CONTENT="004"> <META NAME="Subject" SCHEME="LCC" CONTENT="TK7885">;
The schemes listed in the Dublin Core extension document are:
Note: At present there is a somewhat US bias to the listing of classification schemes. While there is nothing to stop other schemes being added to the list, it is rather a pity that no classification schemes based on non-English classifications are currently listed.
By contrast to the relatively stable world of library subject indexes detailed above, the sorts of categories used for identifying information stored on the World Wide Web tend to be much more business oriented. (They also tend to change very frequently.) For example, in April 1998 theAltaVista subject search used the following top-level categories:
Computers & Internet Business & Finance Reference & Education Society & Politics Entertainment & Media People & Chat Shopping & Services Travel & Vacations Sports & Recreation Hobbies & Interests Health & Fitness Home, Family & Auto
while Yahoo used:
Arts and Humanities (Architecture, Photography, Literature...) Business and Economy (Companies, Finance, Employment...) Computers and Internet (Internet, WWW, Software, Multimedia...) Education (Universities, K-12, College Entrance...) Entertainment (Cool Links, Movies, Music, Humor...) Government (Military, Politics, Law, Taxes...) Health (Medicine, Drugs, Diseases, Fitness...) News and Media (Current Events, Magazines, TV, Newspapers...) Recreation and Sports (Sports, Games, Travel, Autos, Outdoors...) Reference (Libraries, Dictionaries, Phone Numbers...) Regional (Countries, Regions, U.S. States...) Science (CS, Biology, Astronomy, Engineering...) Social Science (Anthropology, Sociology, Economics...) Society and Culture (People, Environment, Religion...)
and the Netscape Internet Guide offered:
Business (Dream Jobs, Ad Reviews, Market Research) Computers (Downloads, Software Prices, Web Reviews, Intranet Events) Entertainment (Top 10 Singles, TV Listings, Box Office Figures, Game Reviews) Finance (Market Quotes, Interest Rates, Tax Tips, Mortgage Calculator) Local (Local News, Events, Restaurants, Hotels) Netcenter (Business Journal, Smart Update, Software Depot Specials) News (Today's Headlines) Shopping (Classifieds, Prices, Buying Tips, Auto Reviews) Sports (Player Of The Week, Golf Money Leaders, Scores) Travel (Flight Info, Travel Bargains, FareFinder, Hotel Finder)
By contrast Lycos does not offer search categories but instead offers its users "guides". Which guides are available depends on which country you are searching in. For example, in April 1998 Lycos USoffered the following guides:
Autos (Classifieds, Buy a Car, Parts) Business (News, Industries, Small Business) Careers (Job Search, Advice) Computers (Hardware, Software, Cyberlife) Education (Financial Aid, Colleges, K-12) Electronics (Audio, TV/Video, Laptops) Entertainment (TV/Movies, Humor, Music) Fashion (Supermodels, Designers, Clothes) Games (PC Games, Popular Games) Government (Politics, Services, Issues) Health (Fitness, Diseases, Diets) Home/Garden (Gardening, Cooking, Fix-It) Internet (Just For Fun, Web Design) Kids (Games, Teens, Sports) Money (Investments, Resources) News (U.S., World, Weather) People (Women, Interests, Romance) Real Estate (Advice, Properties, Apt/Rentals) Shopping (Books, Cards, Search) Space/Sci-Fi (Exploration, X-Files, Planets) Sports (Basketball, Hockey, Baseball) Travel (Destinations, Reservations, Cities)
while subscribers to Lycos UK were being offered:
Business (Tax Issues) Cars (Electric, Offroad) Career (Employment, CV & Interview) Entertainment (New Movies, Books) Finance (Shares, News, Currency Calculator) Kids (Games, School) Sports (Formula 1, Football, Cricket) Technology (CeBIT 98, Online Privacy) Travel (Ireland, City Breaks)
Note particularly the regional differences between the application of such shared terminology as Sports and Travel, and the use of different categories for the same thing in different regions (e.g. Auto for Car).
It will be seen from this wide variety of classification schemes that developing subject-based search engines that will work across a wide range of web sites, in the way that meta-search engines do for free-text searching, is not currently feasible. Hopefully a new generation of standards, such as ISO 13250, which defines the specification of Topic Naviagtion Maps that can link together different classification schemes, will make it easier for users to find their way around the vagaries of the different classification schemes.
Another area where improvement is required is in the preparation of vocabularies and acronym lists for use with automated searches. In this area the fast moving, acronym-ridden, world of IT standards is particuarly illustrative . Trying to categorize the OII files, which list the latest standards for IT, is somewhat of a nightmare as no amount of referencing of existing vocabulaies or acronym lists will identify standardized terms for referencing something that has only recently been developed.
Many of the standards listed in the OII Standards and Specifications List are most frequently referenced by their acronym (e.g. RTF, RTP, HTTP,HTML,....). However acronyms are rarely unique. Unless you know the domain in which the acronym is being used the chances of finding the correct interpretation of the acronym are often vastly reduced. In addition the acronyms are often used for other purposes than listing standards. Anyone trying to do a free-text search for HTML will be very unlikely to come across details of the specification of HTML: they are most likely to come across references to its use. So unless the role of the document is recorded as well as its domain and the meaning associated with that acronym in that document there is only a limited likelihood of a search returning the desired results.
Note: The recent addition of an <ACRONYM> element to Version 4.0 of the HTML specification is to be welcomed in this respect. It is to be hoped that it will be widely used, and that a domain attribute will soon be added to it!
The ISO committee responsible for the standardization of computer applications in information and documentation (TC46/SC4) issued a proposal to develop a Format for Generic Electronic Document Interchange (GEDI) in January 1998. The working draft of this proposal suggests that files should be interchanged as images, using the TIFF, JPEG and PDF data interchange formats. The proposal identifies the following categories of metadata that should be associated with interchanged images:
In addition to the general-purpose metadata information sets for describing the contents of electronic documents described above, a wide range of specialist metadata schemas are described in theSectorial Data Interchange sections of the OII Standards and Specifications List. The sectors currently covered include:
Few of the standards described in these sections are specifically designed for use in an Internet environment, and in general these standards do not rely on the use of shared metadata facilities such as RDF and the META element of HTML. Among the many metadata sets described in these sections are:
This information set on OII standards is maintained by Martin Bryan of The SGML Centre and Man-Sze Li of IC Focus on behalf of European Commission DGXIII/E.
File last updated: April 1998