OII Standards and Specifications List

I*M Europe
OII Home Page
What is OII?
Standards List
OII Guides
OII Fora List
Conference Reports
Monthly Reports
EC Reports
Whats New?
OII Index
OII Feedback
Search Database

OII Guide to Internet Searching

Whilst the OII Standards and Specification List has separate lists for standards relating to classification, searching and directories, for the purposes of this guide the related topics of classification and searching will be discussed as a single subject, with directories being described in terms of how electronic directories can be used to identify service providers.

Indexing, classification and searching

There are traditionally four main types of search engines:

Increasingly, however, the margins of these groups are being blurred. By early 1998 most search tools provided at least two, and in some cases all four, of these methods of searching from their home pages.

During the next century a new type of search tool, based on user navigation of trees of information domains will become available. An overview of suchtopic navigation maps is provided as part of this OII Guide.

Searching for people, products and services requires a different type of search technique, based on the use of directories and catalogues. While there are currently few standards in this area, this report also suggests techniques for identifying service providers.

Full-text searching

As more and more text becomes availiable on-line, as part of the World Wide Web (WWW) of electronic information, finding information about a specific topic becomes harder and harder. The ambiguities found in most languages mean that few terms have a single meaning. Whilst some words have the same meaning in a number of languages, many meanings have different expressions in different languages. Full-text searching typically fails to distinguish between the different meanings of a word. It often fails to distinguish between the use of the same word in different languages. Because of this, full-text searching often provides too many 'hits' for users to have time to find the information they need from the morass of irrelevant information. For example, a full-text search for the word "trees" using the AltaVista search engine, identified data on the following subjects in its first 20 entries out of more than 400,000:

  • The Festival of Trees - a celebration at an arboretum
  • Welcome to Trees Lounge - a bar
  • A Cottage under Evergreen Trees - a 19th century Chinese painting
  • Dying Trees
  • Tree Trust - Time for Trees School Program
  • Trees Resort - holiday resort in US
  • Trees and Ferns - example of Java applet for drawing trees and ferns using a mathematical grammar
  • Two Trees - Keakiutl indian sign
  • Eucalypt Trees
  • Edible Landscaping - a supplier of fruiting trees!
  • Olympic Design List '92 Faction Trees
  • Growing Better Design Trees from Data
  • Homework 5: Trees
  • Fake Plastic Trees
  • City Plants 250 Trees Yearly - Manhattan news cutting
  • Why Plant Trees?
  • Beyond The Tree Limited - Company devoted to sales and marketing in the IT arena
  • The Integral Trees - site dedicated to discussing Larry Niven's science fiction novel on the WWW
  • Across the River and Into the Trees - paper on Hemingway's work of that title.

A list of multilingual full-text search engines is provided in the OII Searching Techniques section of the OII Standards and Specifications List. The availablity of search request pages and help information in the language of the user is an important factor in encouraging the use of full-text search tools in areas of cultural diversity such as Europe.

It is important that Internet search matches be restricted as far as possible to documents in relevant languages whose source is in appropriate countries. For example, just because a particular European word or phrase occurs in a Japanese document does not mean that that document is relevant to a European searcher, who probably does not have a browser that can display the associated Japanese text, even if he could understand it. Similarly there is a difference in the relevance weighting that needs to be applied to a word such as cricket in the UK, where the word has two meanings, or in America, where there is generally only one.

The development of regionalized search services during 1997 has gone some way to making it easier to provide a higher level of relevant 'hits' to users. It is still, however, difficult for regional sites to identify all sources of information within an area. The main difficulty providers of regionalized services have is that international companies and organizations still prefer to use general-purpose domain identifiers, ending in .comor .org, rather than country specific equivalents such as .co.uk and .fr based on ISO 3166 country codes.

Today techniques are beginning to be deployed for the automatic identification of the language in which a document has been written, through use of ISO 639 language codes in document headers supplemented by automated scanning of text for linguistic characteristics. There are still, however, problems associated with the identification of language segments within multilingual documents, and for the recognition of the significance of embedded text in a different language. These problems can be greatly assisted by the inclusion ofmetadata within the header of files that identify, among other things, the language(s) the document is prinicpally written in and the character set used to encode the digital text.

One area where Internet full-text searching has typically lagged behind more traditional services has been in the ability to take the results provided by a search for a generalized term, such as tree, and subset them to find entries that are relevant to the required domain. Whilst this type of searching of result sets has been widely available on traditional database systems for many years, its application to data that has been identified using a full-text search only started in the latter half of 1997, when the US site of AltaVistastarted to offer a 'refine' option with its query results. (This service was not available as part of the European search engines based on AltaVista at the beginning of 1998.) By categorizing the returned results into some dozen 'topics', AltaVista allows users to exclude entries covering a specific topic, or to restrict entries to those that have been identified as falling into a specified topic area.

Early in 1998 the US site for AltaVista also started offering an on-line translation service based on the Systran translation software. You can use this service to enter a phrase in your language and retrieve details of the equivalent term in another language. Unfortunately this service is not connected directly to the search request form, so you cannot simply click on the translation to search for information on the subject in documents in an alternative language. In the next few years it is to be expected that such facilities will become a common feature of full-text search engines.

Catalogue-based search engines

The traditional way of finding information stored in large information sets is via structured catalogues of the type used in libraries for many centuries. Library catalogues have typically relied on the identification of suitable categories by trained librarians, but for specialist works this often leads to anomalies, with different librarians assigning the same document to different categories. To help to reduce these anomalies a Cataloguing in Publication scheme was introduced by the Library of Congress and British Library in the 1970s to allow publishers to identify which categories the books they publish should be catalogued under within libraries.

Cataloguing of data held in libraries has become very sophisticated, but in many cases it is difficult to relate items in catalogues using different schemes. Despite efforts to standardize machine readable cataloguing (MARC) there is still no agreed way of cataloguing general data collections in an internationally acceptable manner. It is not even possible to select a single catalogue for a single language, as the differences between the US and UK forms of MARC show.

For libraries that specialize in particular subject areas the general classification categories provided through schemes such as the Dewey Decimal Classificationscheme used for the British Library catalogue and theLibrary of Congress classification scheme used forUSMARC records. do not allow sufficient sub-categorzations of data to allow related data sources to be identified. Traditionally such libraries have developed their own categorizations, without considering the relationship between their schemes and those used in other libraries specializing in the same subject. As more and more of these specialist libraries start to make their catalogues available through the WWW, the discrepancies between individual catalogues are becoming more obvious to both creators and users of library catalogues.

Note: Details of the top-level categories used in the general-purpose national library classification schemes listed above can be gound in the OII Guide to Metadata.

Many schemes for introducing cataloguing metadatainto the WWW environment are being studied, but in general these only rely on assigning searchable keywords to files. This approach does not provide any facilities for defining the relationships that might exist between keywords, or between different sets of keywords. Some schemes, such as those for theDublin Core Metadata set, seek to add the metadata to the heading of the HTML files that form the bulk of the data on the WWW. Others, such as Netscape'sMeta Content Framework using XML and Microsoft's XML-Data proposals, allow the metadata to be prepared outside the referenced document.

The World Wide Web Consortium (W3C) has drafted a Resource Description Framework (RDF) that will allow different schema devised for use over the Internet to be described in a consistent manner. Once RDF is widely adopted it may become possible to interconnect different identification schemes into a coherent search tool.

There are a number of search engines on the WWW that provide a catalogue-based approach to finding data. A list of multilingual catalogue-based search engines is provided in the OII Searching Techniquessection of the OII Standards and Specifications List. The availablity of catalogues and help information in the language of the user is an important factor in encouraging the use of such search tools in areas of cultural diversity such as Europe.

For reasons related to how best to find information in an interactive environment, on-line search catalogues tend to be broad and relatively shallow. Typically such categorized services are designed for use by those who are not familiar with the terms used by a specialist community, and are designed to identify a broad range of sources that provide a starting point for more in-depth study. Because they deliberatly adopt such a policy, category-based search engines often make it difficult for experienced users to find information related to specific disciplines, or to specialist areas within disciplines.

To understand the differences between using full-text and catalogue-based searching, consider the following sequences of sub-categories that the popular Yahoosearch engine suggests could be used to reach information related to trees:

  • Business and Economy: Companies: Home and Garden: Lawn and Garden: Nurseries: Trees: Christmas Tree Farms
  • Business and Economy: Companies: Outdoors: Hunting: Gear and Equipment: Tree Stands
  • Recreation: Outdoors: Parks: United States: National Parks: Joshua Tree National Park, California
  • Computers and Internet: Operating Systems: GNU: GNU Info Tree
  • Business and Economy: Companies: Computers: Consulting: Information Technology: Binary Tree, Inc.
  • Business and Economy: Companies: Home and Garden: Lawn and Garden: Trees: Tree Relocation
  • Regional: U.S. States: California: Cities: Joshua Tree
  • Regional: Countries: Australia: States and Territories: New South Wales: Cities and Regions: Lemon Tree Passage
  • Regional: Countries: Canada: Business: Companies: Home and Garden: Lawn and Garden: Christmas Tree Farms
  • Regional: Countries: Belize: Districts and Regions: Cayo District: Cities: Bullet Tree Falls
  • Business and Economy: Companies: Home and Garden: Lawn and Garden: Nurseries: Trees
  • Recreation: Home and Garden: Gardening: Trees
  • Business and Economy: Companies: Home and Garden: Lawn and Garden: Trees
  • Society and Culture: Holidays: Christmas: Trees
  • Business and Economy: Companies: Gifts: Christmas: Trees
  • Entertainment: Music: Artists: By Genre: Rock: Screaming Trees
  • Entertainment: Music: Artists: By Genre: Rock: Gothic: And Also the Trees
  • Regional: Countries: Australia: Recreation and Sports: Home and Garden: Gardening: Trees
  • Regional: Countries: Australia: Business: Companies: Home and Garden: Lawn and Garden: Nurseries: Trees
  • Regional: U.S. States: Michigan: Cities: Gaylord: Recreation and Sports: Treetops Sylvan Ski Area and Resort.

In practice, however, users might find the following catagories, found using manual search methods, slightly more useful:

  • Science: Agriculture: Forestry
  • Computers and Internet: Computer Science: Algorithms.

This classic example shows the danger of relying on specific keywords for managing searches. If the cataloguer has chosen to use a different set of keywords from those you typically use, identifying the relationship between the keywords you would choose and those chosen by the cataloguer will be a problem.

For example, trying to use the Lycos search engine to identify entries for trees by selecting Science followed by Agriculture and Forestry returned 25 entries, none of which related to trees. The best option seemed to be the entry for the US Department of Agriculture site. Selecting Search: Site Map: Forest Service: Forest at this site eventually found a small amount of mostly irrelevant detail.

This illustrates one of problems with using general-purpose catalogues. Unless you clearly understand how the domain about which you want to discover information about fits into the categorization scheme of the particular search tool you have selected, it is only too easy to spend a long time looking in the wrong place for information.

Meta-search engines

Meta-search engines (also known as multi-search engines) allow users to search for the same keywords using more than one search engine, either sequentially or simultaneously. By combining requests to both catalogues of index terms and full-text searching, the results of the two types of searching can be integrated. In some cases it is possible to choose which type of searching is to be undertaken. In many cases you can separately search documents forming part of the World Wide Web (WWW), those generated by news groups, and the contents of electronic mail archives. In the very best engines it is possible to restrict searching to terms that occur in the header of an HTML file, newsgroup listing or e-mail message.

Some meta-search engines take a query, submit it to a number of search points, concatenate the results and return a single list of relevant documents. Others offer individual entry points for searching using each of the associated engines. Others combine these two techniques by taking a single query, using it to complete individual query requests for each server and then asking the user to select which of the servers to search.

A list of meta-search engines is provided in the OII Searching Techniques section of the OII Standards and Specifications List. Unfortunately very few meta-search home pages are provided in languages other than English at present, despite the fact that such services are a vital factor in encouraging the use of Internet search tools in areas of cultural diversity such as Europe. Whilst this is principally because most of the current services are based in the US, it also reflects the fact that such services only became widely available in 1997.

There are a number of techniques that can be used to improve the efficiency of searches. Use of these techniques is especially important when attempting to send queries to multiple search engines, where you cannot rely on the use of application-specific techniques which may be provided by individual search engines. For example:

  1. Use the most specific search phrase possible-- Longer phrases, such as "Open Information Interchange" are better than acronyms such as OII, which has many different meanings, whilst qualified phrases such as "oak trees" and "binary trees" are better thantrees, oak and binary on their own, or as entries in a list of keywords.
  2. Enter phrases between quotes -- This avoids any chance of the string being treated as a set of individual words, which results in far more hits, but a lower proportion of relevant hits.
  3. Use Boolean operators between pairs -- Many search engines (but not all) will treat ANDand OR (in caps) as special indicators that identify whether both or either of the adjacent words/phrases should be present.
  4. Where supported, use NEAR to indicate words that must be closely related -- This qualified equivalent to AND sets a system dependent limit on how many words can occur between the adjacent words/phrases.
  5. Use terms found in indexing catalogues -- This will improve your chances of finding entries that will point you on to sub-categorized data, such as terms in topic lists.

Specialist search engines/catalogues

Search engines that are constrained to searching specific sites are widely available. Some of these search engines, such as the Interactive Movie Database Search engine listed in the Searching Techiques section of the OII Standards and Specifications List, offer searches of comprehensive international databases of information on a specific subject areas.

Many user communities have set up WWW pages whose sole purpose is to provide a start point from which interested parties can visit sites that provide information on a particular topic. At present these pages are constrained to storing 'bookmarks' to useful data soruces using the limited form of link anchor provided by the HyperText Markup Language (HTML) subset of ISO's Standard Generalized Markup Language (SGML).

A new document markup language, the Extensible Markup Language (XML), will become available during 1998. XML will allow more explicit markup to be applied to documents than is currently possible using HTML. When XML has been deployed it will start to become possible to look for occurrences of terms within named 'elements' in a file. A search in an XML file should be able to distinguish between the use of a number as a part number, a telephone number or a price, something that is not currently possible when carrying out a full-text search of an HTML file.

Another problem associated with the use of HTML link anchors today is that HTML anchors can only identify a single document, and a single point within each document. This means that multiple references to a subject in a document have to be represented by multiple links.

A furhter complication is that HTML link anchors can only identify points within a document that have previously been assigned names by their creator. They do not permit arbitary points in a document to be identified, or to allow users to identify all occurrences of a particular term within a document.

To overcome these problems you need to use lanugages, such as the proposed XML Linking Language (XLink) or the Hypermedia/Time-based Structuring Language (HyTime), that allow pointers to previously unnamed points in multiple documents, and to sequences of elements, such as all the paragraphs in a section, or between one illustration and the next.

In certain environments the number of links to a document are as important as the number of words it contains. For many years journal publishers have used the number of references to a paper to rank its imporatnce. Search engines that base the relevance of their documents on the number of other documents that point to them, or on the number of links they themself contain, are currently (end-1997) beginning to be tested on the Internet.

One of the key advantages of the Internet that forms the WWW is that it allows user communities to develop new concepts faster. New concepts typically involve applying new meanings to existing words. As people are introduced to new concepts they need to be made aware of the new meanings assigned to terms they may have thought they already understood if they are to catch the nuances of the debate. For this reason it is vital that the meaning of key terms be clearly identifiable to the user communities using them. This cannot generally be done by simply referring the terms to a dictionary. It should be done by creating a link between the term and the places at which it was first used/defined by the relevant user community.

In the past recording the meaning of words has been the role of specialist lexicographers, whilst devising cataloguing schemes was left to experienced librarians. In today's interactive world, however, such delegation is no longer possible. It must be up to user communities to clearly identify their own definition of terms and to catalogue where they have used these terms in a consistent manner. The role of lexicographers then changes to identifying those points at which terms have been assigned a specific meaning so that a single reference source for all possible meanings of a term can be developed. Similarly the role of library cataloguers becomes that of defining the relationships between terms in such a way that it is possible to identify which terms form a sub-category of a given subject.

For such advanced systems to become possible it is important that the description of the meaning of terms, and the location of references to these terms, are separated from the definition of the relationships between terms. It is also important that the maintenance of the locations at which a term is used is separted from the maintenance of the definition of the term, as a particular user community should only need to define its meaning of a term once, and then to be able to continue to use that term in new documents for a long time.

In addition, the WWW requires us to make other forms of distinction between terms if we are to enable users to distinguish one use of a term from another when doing full-text searching. Users need to know which universes of discussion (domains) the term is being used, and for which languages this term is relevant. They may also need to know what the equivalent term is in another language, and when the term started to be, or stopped being, employed with the specified meaning in that language.

Topic Navigation Maps - a solution for the 21st centruy?

A proposed new ISO Topic Navigation Map standard (ISO 13250) will provide facilities for creating, maintaining and interchanging topic-based navigational aids to large corpora of documents containing interrelated information. The standard makes a distinction between the highly concentrated and independent topic navigation maps -- sets of relations between the topics covered in a given corpus -- and the addresses of relevant information within the corpora themselves, which are typically defined using facilities provided by defines the Hypermedia/Time-based Structuring Language known as HyTime.

Topic navigation maps should improve the accessibility of information by facilitating, and to some extent automating, the task of providing navigational resources. Topic navigation maps are designed to simplify groupware-supported production of data for which navigational aids such as indexes, glossaries, tables of contents, lists and catalogues need to be generated. Topic navigation maps will also be used to enhance the navigability of very large information bases by providing in-depth sub-categorization of terminology bases.

Topic navigation maps can be considered as a customized view of an information repository. Different views can be developed by different user communities to allow various points of view to be expressed. Several topic navigation maps can be interconnected to form a more general-purpose knowledge base.

The number and complexity of indexable topics, and the relationships between them, greatly exceeds the number and complexity of relations normally represented in traditional databases or, for that matter, in the kinds of indexes normally found in books. The number of topic relationships that might usefully be represented with respect to any reasonably large collection of documents is, in fact, for all practical purposes limitless. Moreover, even in archived documents, new kinds of topic relationships can be expected to appear from time to time. ISO's Topic Navigation Map standard, therefore, is specifically designed to allow multiple topic navigation maps to be created over a period of time for any collection of data, and to allow for different topic navigation maps to be inter-related.

Creating and maintaining indexes can be a difficult and expensive proposition. Many of today's printed indexes are indexes in name only. All too often, even when an index is well thought out, well constructed, and useful, little thought is given to its maintainability. When the time comes to create an updated or corrected index, the original documentation for the topic architecture of the index is no longer available. Indeed, it may never have existed or have been consciously expressed in any abstract way. Even an index on which enormous maintenance effort has been expended can quite easily become self-inconsistent, especially when the size of the indexing task dictates that it must be a cooperative effort, or when there have been changes in the responsible personnel.

The ISO Topic Navigation Map standard will enable:

  • many experts in a given field of knowledge to share in, and jointly contribute to, the evolution of a common map of topic relationships for the field;
  • the merging of maps, whenever multiple fields of knowledge must be used simultaneously, in such a way as to maximize the meaningful cross-connections between them; and
  • the re-use of maps in a variety of ways for a variety of purposes, such as extracting printed and online indexes and glossaries for particular documents.

Identifying service providers

An alternative form of searching that is starting to be offered as an adjunct to many search engines is the searching of what are known as yellow and white pages. These services provide the electronic equivalent of a telephone directory, but return either e-mail addresses of individual users or web site URLs for the providers of specific services.

Yellow page search engines are often provided by the same companies that provide the yellow pages in telephone directories. Typically such systems are regionalized, providing information about services in a well defined georgraphical area. Some Internet-specific services do not require listed companies to pay a fee for an entry, providing freely accessible on-line facilities for users to add and update entries.

One yellow page facility that is noticeably difficult to find at present are sites that make the various category headings available in different languages. The Yelloweb Europe Directory listed in the Searching Techiquessection of the OII Standards and Specifications Listis a noteworthy exception to this trend.

White page search engines allow users to find a registered e-mail address. Some search engines are global in extent, and do not offer users the choice of which region or country addresses should be returned from. Others are restricted to a particular country, but do not necessarily make this clear. This latter approach is particularly true of search engines based in the US, which tend to presume that the address you want is one listed in a US white pages directory. As these services tend to be fairly new they do not typically offer users a choice of languages for the search page and use instructions. A noteworthy exception to this are the Bigfoot White Pages.

Details of standards that should be used to develop white page directories are provided in the Directory Standards section of the OII Standards and Specifications List. The recent publication of Version 3 of the Lightweight Directory Access Protocol should make it easier to create meta-search engines that will be able to pass queries on to a wide range of yellow and white page search engines.

The adoption by the Internet community of the functionality provided in the X.500 Message Handling System protocols (known as The Directory) should make it easier to create well integrated distributed directory systems during the early part of the 21st century. One of the pilot projects set up by the G7 initative to develop a Global Marketplace for SME'sinvolves the setting up of a global directory network.

Section Contents
OII Home Page
OII Index
OII Help

This information set on OII standards is maintained by Martin Bryan of The SGML Centre and Man-Sze Li of IC Focus on behalf of European Commission DGXIII/E.

File last updated: April 1998

Home - Gate - Back - Top - Semantics - Relevant