I'M-Europe Home Page INFO2000 Programme Home Page OII Standards and Specifications List

MPEG-7 Seminar
Multimedia Content Description Interface
Bristol, 9th April 1997

Following the decision of ISO/IEC JTC1 SC29 to develop a multimedia content description interface standard it was decided that a seminar to discuss user requirements should be held as part of the April 1997 SC29 meeting. The seminar, which was open to the general public, took the form of a set of presentations followed, on the next day, by a discussion of the impact these presentation would have on user requirements for the new standard.

The presentations arranged to give an overview of current research in this area were:

The seminar was organized and directed by Thomas Sikora.

MPEG-7 - A scenario

To set the scene for the seminar Dr. Leonardo Chairiglone, the chair of SC29, explained how the idea of developing a Content Description Interface for multimedia data encoded using SC29 standards such as MPEG-2 and MPEG-4 was first mooted in Seoul. This led to preliminary discussions in Florence in March 1996, with further internal discussions at subsequent SC29 meetings before it was decided to hold a public seminar to discuss the issue as part of the Bristol meeting.

DVB already has ways of classifying digital television broadcasts into one of 16 categories, which can be associated with textual descriptions and parental rating information to help users identify suitable programs. What is already clear, however, is that 16 categories are too few. You can, for instance, identify programmes relating to hobbies, but not narrow them down to just those relating to gardening. In addition, the categories apply to whole programmes: you cannot distinguish the characteristics of differrent parts of a programme. ETSI are developing an API for DVB classification descriptions.

MPEG-4 will provide object descriptors that will cover things such as IPR ownership, parental rating, DVB classification and cost of encoded audiovisual transport streams, but this information will not be adequate for fully indexing audiovisual data. In particular MPEG-7 will need to provide "information search freindly" techniques for identifying audiovisual sequences based on both manually recorded and automatically recognized features.

It is probable that MPEG-7 will be developed in phases, starting with manually added descriptors, with automatically generated descriptors and then natural language querying of image bases following as technology improves.

MPEG-7 - Context and Objectives

Rob Koenen, who has been tasked by SC29 with managing its MPEG-7 work programme, explained the plans for the new group. A web site has already been set up at http://www.cselt.stet.it/mpeg to provide information on MPEG-7 to potential users. The purpose of the Bristol meeting was to draft out an initial set of user requirements that could be sent to interested parties for comment.

MPEG-7 will only be concerned with how to describe and search multimedia data - it will not control how the data will be encoded in the way that other MPEG standards have. At present it is unclear how the information coded within an MPEG-4 sequence could be intergrated with the information held outside the data through MPEG-7. It is probable that there will be high- and low-level descriptors, with low-level descriptors being derived directly from the data and high-level descriptors being added by human classifiers. MPEG-7 will not concern itself with how the descriptors are ascertained or how they are searched. It will only concern itself with how they are recorded.

It is probable that descriptors that are not directly ascertainable from the content will be needed. For example, details of authorship, production dates, locations, production team and other ancilliary detail are often only available from the storyboard and other paperwork associated with a clip.

It is likely that different users will want different classifications for the same material. For example, a gangster movie may need to be classified according to characters portrayed, clothing worn, decor depicted, etc.

A first working draft of the standard is not anticipated before July 1999, following calls for proposals by November 1998. The final CD for the standard is expected to be published in the first quarter of 2000, with final publication of the approved standard by the end of that year.

Linking text and image retrieval

Martin Bryan, the official liaison between SC29 and SC18, gave a brief overview of what SC18 were doing to ensure that text retrieval systems could take full account of image descriptions. After explaining SC18's current range of standards (SGMLDSSSLSPDLHyTime, etc) he provided an overview of new work in this area. Today most searching for information is text based, using data encoded using HTML. While HTML can reference still images, and provides a method by which a text descriptor can be associated with a call to an image, it does not have any mechanism for attaching arbitrary sets of descriptors to images. For moving images Java applets can be invoked, but these cannot have desciptors attached to them in the source data either. The new eXtensible Markup Language (XML)would allow users to define their own descriptor elements, but would not allow descriptors to be associated with images or applet calls directly as such descriptors cannot be associated with individual stored objects.

SGML already contains a general purpose mechanism for attaching arbitrary sets of descriptive properties to image files in a way that could be used to search for suitable images using existing search tools for the Internet. The HyTime standard contains comprehensive facilities for locating relevant parts of text and images, including moving images and audio, and for associating descriptive information with these locations.

SC18 was recently started to develop a standarized way of defining Topic Navigation Maps that could be used to structure the way in which users search through complex data sets, By combining HyTime locators with structured sets of descriptors, application-specific ways of describing data sets can be recorded separately from the data.

Mr. Bryan ended with a plea that the MPEG-7 team ensure that the techniques they adopt can be integrated with those already used for text searching. If we end up having to use different techniques for searching for text and for audiovisual information the concept of setting up intergrated information services for the 21st century will have to be undertaken using proprietary rather than standardized solutions.

A content provider's view of multimedia indexing

Remi Ronfard from INA in Paris started by explaining how INA manages the official archives for all French broadcasting. INA has been indexing programmes for more than a decade, using computer databases to record details of more than 80,000 television and 60,000 radio broadcasts. They are currently indexing some 35,000 programmes a year.

Indexing is currently done manually, based on a combination of thesaurus keywords and secondary descriptions. Different techniques are applied for documentaries and other single subject programmes, and news and magazine style programmes. For the latter you need to index both the programme as a whole (to record details of the programme title ,time(s) of transmission, the producer, link presentor(s), etc) and the individual film clips (to record who made them, who added the commentary, what they show, etc.)

The INA archives are used by researchers and by companies wishing to identify clips that can be reused as part of new programmes. Such people typically use the clip descriptors rather than those describing the whole programme to identify suitable sources.

Traditionally clips have been indexed on things such as the names of people and locations shown, organizations mentioned and time recorded, rather than on topics covered. For single subject programmes an abstract is prepared together with a description of the sequences used and key indexing terms.

INA would like to turn their archive into a multimedia knowledge base. To do this they will need to be able to fully record the context in which clips are generated and reused, and the relationship between the clips making up a programme.

The thesaurus currently being used at INA does not contain verbs, so it is very difficult to classify programmes based on actions or relationhips between events. A better system of recording time-dependencies is also required.

INA would like to develop a standarized vocabulary for describing film clips. The vocabulary should cover things like shot type, duration, camera motion, location of shot, place/camera direction within location, time at which event was shot, actors and other objects of interest, screen position, gaze direction, etc, as well as a summary of the actions recorded. Some of this information is available from the storyboard and related documentation generated during shooting. Being able to associate this information with the clip, and then query it, would provide a powerful tool for image and sound retrieval.

The Video Mail Retrieval project

Martin Brown from URL used a short video to introduce some of the findings of the Video Mail Retrieval (VMR) project undertaken by ORL for Oracle and Olivetti as part of their experiments at Cambridge Unversity into the use of video mail. Unlike normal e-mail, video mail cannot be searched using standard text searching methods as it is received. The VMR project has been looking into how the speech associated with video mail can be analyzed to produce a searchable record of mail contents.

Whilst speech recognition of limited vocubularies for known users is today well understood, the problems of analzing general speech from untrained users has still to be solved. Typically success rates of not much greater than 50% can be acheived using standard analysis.

To help overcome this the VMR team concentrate on the analysis of the individual "phones" that make up the phonemes used to create speech. By building up webs of phone combinations it is possible to identify sets of words that a sound could represent. By matching search requests to "phone sets" that match ways the word could be pronounced the possibility of finding a suitable match within the audio data increases. At present such searches cannot be restricted to correct uses of a single word, but with the addition of "learning algorithms" it is anticipated that they will provide a suitable method for scanning a video mail database to identify clips relating to a particular subject.

IBM's QBIC and its relevance to MPEG-7

John Ibbotson from IBM introduced the concepts behind IBM's Query By Image Context (QBIC) methodology. QBIC identifies features of different types of data that can be recorded within a database. For example, the pitch of an audio clip, or the texture and colour of parts of an image. Rather than try to identify images based on a record of their contents, QBIC tries to identify characteristics of an image that can be used for pattern matching.

For moving images QBIC first identifies clip boundaries, and then searches for representative frames from selected clips, the number of frames/clips being analysed depending on the depth to which the data is to be indexed. Each selected frame is than analyzed in detail to identify its recordable features.

The database has a query interface that is attached to a match engine. Searches can be based on seelecting features required or by drawing an sketch or humming a tune that can be analyzed to identify features that can usefully be matched against the database. The search results are ranked to identify the matches according to the likely relevance.

As has been found in other disciplines, artificial intelligence cannot supply a complete answer. You need help from experts to build up knowledge bases of what consitutes an appropriate set of matching criteria. QBIC is not seen as an automated replacement for traditional data cataloguing techniques. Instead it is seen as a data mining tool that can be used to obtain information about images that were not captured through cataloguing.

John Ibbotson ended his talk with a request that those developing MPEG-7 take a look at the classification schemes already developed by librarians, especially those used in conjunction with the ANSI Z39.50 specification. His final slide read:

Where is the wisdom - lost in the knowledge!
Where is the knowledge - lost in the information!

The SMASH project's perspective on browsing of consumer video archives

Inald Lagendijk of Delft Univerity explained how the EU-funded SMASH project was experimenting with techniques to allow untrained consumers to browse the contents of their own video archives within a typical home environment. The presumption is that all devices used to record incoming digital broadcasts would be connected to a local "database" in which details of which programmes were recorded on which medium by which device would be recorded. Browsing of the database and retrieval of data would have to be done using simple tools operating through a standard television screen/video interface.

One key difference from existing systems would be the ability to record multiple incoming streams at one time. These streams will need to be separated prior to analysis of their contents.

It is presumed that the MPEG-2/4 coded input will be accompanied by textual descriptors applying to the whole programme. These need to be extracted and placed into the local database, where they will later be supplemented with data obtained by analyzing the recorded data. The retrieved textual information will need to be integrated with other information obtained over the WWW from, for example, programme guides.

When time is available each video needs to be analyzed to identify features that can be used by consumers to request a particular video clip. This requires analysis of shot boundaries, selection of representative frames, and analysis of the data contained therein. The low-resolution DC images provided as part of an MPEG-2 sequence do not provide sufficient information for accurate analysis. You need to combine I and P frames to obtain sufficient detail, but this should be done at a resolution level that is below that of the full stream, otherwise analysis takes too long.

Content-based video representation/indexing

Murat Tekelp from the University of Rochester, USA, presented details of the work being done at Rochester to anaylze news broadcasts based on the meaning and semantics of their contents. In a typical news broadcast effects such as wipes and dissolves can be used to identify clip boundaries. Link presenters and the scenes shown behind them are also good clues to when a subject change has occurred. Analysis based on 2D meshes and motion analysis within MPEG-2 image streams can be used to identify the "birth frame" and "death frame" associated with a particular story. By analyzing these, together with frames taken at other points at which image changes take place, a small subset of significant frames can be identified for detailed analysis.

Content-based visual querying

Dr. Shis Fu Chang from Columbia University explained the key features of the tools that Columbia have developed for indexing images stored on the WWW. These tools are being used for off-line indexing of images stored at web sites and to provide a means for on-line browsing of the index. The query interface is the critical factor in determining what sorts of analysis need to be performed. It is based on a database of meta data that is specifically designed for ease of retrieval of data that is not stored at the same site as the database.

Columbia's WebSeek tool uses image classification as its basic methodology. Each image is classified according to a fixed set of criteria which form a navigable hierarchy. Currently Webseek has been used to analyse over 1 million GIF and JPEG images stored on the WWW, and some 10,000 video clips.

Columbia have also developed a WebClip tool for browsing and editing MPEG images on the WWW, and a VideoQ object-based video query system. These tools incorporate a scene change detector that can detect over 90% of scene changes by analyzing the incoming compressed data stream. Using compressed data as the source recognition of faces can be done in around 30ms.

Dr Chang stressed the advantages of identifying the relationships between analysed frames in terms of hierachies, where each frame was linked to preceding ones through a tree structure. For a typical news broadcast the presentors provide a natural top level for such trees. Grouping of other frames is generally done in a bottom-up manner based on shared features of temporarly related sequences.

For retrieval purposes it is important that an ontological approach is used, as feature-based searching tends to be too broad. The 2000 information classes used for WebSeek have proved to be sufficient for retrieval purposes, but automatic classification only works for some 60% of images.

Current research is hampered by the lack of a suitable test suite against which different systems can be tested and the results compared. Dr. Chang hoped MPEG-7 will lead to the preparation of a suitable test suite.

MPEG-7 related audio research

Eric Scheirer from the MIT Media Lab's Machine Listening Group brought the presentations with a close with illustrations of the effects of applying machine analysis of music to synthesized sound reproduction. Machine listening of music, and other types of sound effects, requires a multidisciplinary approach, which is based on a combination of signal processing, acoustic analysis, music psychology, pattern recognition, statistical analysis and information theory.

To date most synthesis has been based on musical concepts, but video producers also want sound effects synthesized. MIT have analyzed a wide range of sound tracks of trucks in action to try to work out how to synthesize truck noises. Similar analysis of water-based sound effects has identified features that are shared by all such sounds. Using these features a synthesizer has been produced that will turn a the sound of a shower of water into a stream of water from a tap and then into a bubbling bath in a continuous sequence.

MIT are building synthetic listners that are able to analyse features such as rhythm, tempo, speed, attack, timbre and expression. They can use these listners to classify recordings and create a database from which music to suit particular moods within film sequences can be selected. By creating synthesizers that use a similar set of features it is possible to synthesize music to suit a particular mood or action sequence.

At present there are still some problems with the techniques used for analyzing music, as an example of trying to use the rhythm detected by the listening software to add a drumbeat to a classical soundtrack showed. The basic problem is that the listener cannot smooth out changes, or anticipate changes based on what it has already heard. It takes a while for it to detect a change in rhythm and generate the revised beat. In classical music, which typically uses variations on a theme based on rhythm changes, this is much more of a problem than in contemporary music.

When music, sound effects and speech are mixed it is much harder to separate the various components, which introduces a greater error rate into any analysis. Most speech analysis work to date has been based on signals with clearly separated speech. With mixed sound sources accurate analysis of speech becomes much harder. Good techniques for recognizing and classifying the background noise accompanying speech will be needed if we are going to acheive accurate analysis of the audio tracks of films and other types of multimedia presentations. Until speech recognition systems can handle what is known as "the cocktail party effect" used by humans to pick out particular conversations in a multi-person environment the use of speech analysis as a general purpose tool for classifying audiovisual material will remain problemetical.

Querying based on non-speech characteristics presents interesting problems. Many people use musical characteristics to describe the bit of a film they want to retrieve. Phrases such as "when the violins came in" or "when that eerie music started" are often used to qualify descriptions of action which is difficult to detect by visual analysis of individual frames. Most such descriptions relate to describing moods, which is not surprising because most music is added to film tracks to enhance the mood of the film. It makes sense, therefore, to develop ways of detecting the mood introduced by a particular piece of music so that queries based on mood can be used to find the required part of a film.

Before we can hope to automate the process of music analysis of audiovisual material we will need to develop:

Details of the conclusions reached as a result of the seminar can be obtained from http://drogo.cselt.it/mpeg/documents/mpeg-7_seminar.htm.

Martin Bryan

[I'M-Europe Home Page] [INFO2000 Home Page] [OII Home Page] [Help] [Frequently Asked Questions] [Subject Index] [Text Search] [Europa WWW server]

Home - Gate - Back - Top - Mpeg 7 - Relevant

File created: April 1997

©ECSC-EC-EAEC, Brussels-Luxembourg, 1997