SELECT Project Overview

Rating and filtering of scientific, technical and other network documents

A research project funded by the European Union, Telematics project

Table of contents

Short abstract
Partners
Summary
User needs and application area
Introduction: Problems with Scientific and Technical information exchange on the Internet
Methods to help users find the best information
User needs to be addressed
Trust
Different requirements from different users
Selection criteria
Filtering action
User interface and the creation of filtering conditions
User modelling
Rating user requirements
Market situation and prospects
Work content
Phase of the project
Project methods
Filtering methods
Phases of the project

Links to more information

Fuller technical description of the SELECT methods and architecture
Freely available filtering systems
Platform for Internet Content Selection: services.
Platform for Internet Content Selection: Syntax and protocols.
PICS: Internet Access Controls Without Censorship
The Kids on the Web: Safety on the Net
Pics Third-Party Rating Services.
The MPAA Rating Systems.
The Voluntary Movie Rating system.
Recreational Software Advisory Council
Firefly Collaborative Filtering Technology.
Net Shepherd.
GroupLens White Paper.
GroupLens: An Open Architecture for Collaborative Filtering of Netnews.
Collaborative Filtering The SEPIA Suggestion Box
Filtering and Collaborative Filtering. Notes from the DELOS workshop.

Short abstract

Internet started as a network for science and technology, and still a large part of the documents available on the Internet are scientific and technical documents provided by universities and other research and educational organisations. However, the spread of availability of the Internet to commercial organisations and to undergraduate students has meant a decline in the reliability of the documents available on the net. Nowadays, when you use the web to look up a document, or read a message in e-mail or Usenet News, you cannot trust the correctness of the document. This has reduced the usefulness of the network for use by scientific and technical professionals.
    In normal scientific and technical communication, a number of instruments are available to safeguard the quality of the information. Examples of this are peer-reviewed journals and conferences and the academic traditions of thesis development and presentation. Other EU-funded projects have tried to remedy this problem by copying the traditional methods onto networked applications, for example by peer-reviewed electronic journals. The disadvantage with this method is that it is expensive, time-consuming, and in practice will only provide quality for a small number of networked documents.
    This project aims at helping to solve this problem in another way. We will develop rating and intelligent filtering tools, which will help users find the information of value. Filtering for one scientific and technical user will be based on ratings provided by other users with similar competence and on other filtering criteria. Rating and filtering will be provided on documents in mailing lists, in Usenet News, in other mesaging services like TAP Web4Groups and on Web documents. We have as partners in this project two of the most successful European providers of Internet search services, EuroSeek and Arianna, and they will provide our rating and filtering services to their users.
    Users involved are two major European providers of Internet search services, who will use the results to improve their services, and their customer groups. Users are also two specialised Internet user groups of scientific and technical people.
    Techonology approach: Data bases of ratings of documents on the net, filters using these data bases based on a knowledge data base of the interests, values and competence of users and rating providers, methods to improve this data base with or without explicit user interaction.
    Benefits for people: People within the EU are probably already spending more than a thousand millions hours a year reading Internet documents. Even if this is only made 10 % more efficient, thousands of millions of ECUs per year will be saved. Select will help people to find more valuable and interesting documents, and aovid trash and repetition of what you already know.

Partners:

Omega generation s.r.l. Bologna I-40137 Italy
Riverside, Belgium Belgium
OLE, Office Line Engineering Zoonegem, B-9520 Belgium
MediBRIDGE Brussels, B-1070 Belgium
SZTAKI, Hungarian Academy of Sciences Budapest, H-1111 Hungary
KTH/SU, DSV, Department of Computer and Systems Sciences Kista, S-16440 Sweden
Euroseek Sundbyberg, 172 07 Sweden
Socoec, The Austrian Academy of Sciences Wien A-1040 Austria
IFI, University of Zürich Zürich, CH-8057 Switzerland
University of Edinburgh Edinburgh EH9 3JZ Great Britain
University of Lancaster Lancaster, LAI4YR Grait Britain
ISCN, The International Software Consulting Network Wicklow Ireland

Summary

The objectives of this project is to help Scientific, Technical and other professional Internet users to get and find the most reliable, valuable, important and interesting information and to avoid trash and reduce information overload. The service we develop will be available to users of World Wide Web, Usenet News, e-mail mailing lists, and non-simultaneous computer conferencing system like TAP Web4Groups (which is currently being developed in an ongoing European Union Fourth Framework Telematics for Research project). We are not only directed at users who search for specific information on the net, but also on users who use the net to keep up to date with what is happening in particular areas.
    The methods to achieve these objectives is to develop, demonstrate and user test rating and intelligent and non-intelligent filtering tools. By rating tools we mean tools for users to evaluate and store their ratings (gradings, quality assessments) of Internet documents and resources. By filtering we mean tools to automatically scan Internet documents and resources before delivery to the users. The result of filtering can be an ordering of documents and resources with the most interesting first, marking up of documents and resources with codes to help the user in manual decisions on what to read and can also in some cases mean that less interesting items are discarded and not shown to a user. Which of these results is used depends on the user needs: For important areas to a particular user, the filter should sort items, not discard them, but for less important areas, the user may prefer that the filter automatically discards items. Filtering can also be based on ratings provided by the author of an item or by other readers of that item. In particular, we will develop tools to select items based on ratings made by people with similar values, competence and interests as the person for whom the filtering is done.
    The filters will appear to the users as additions or plug-ins to their ordinary reading and browsing tools, but in reality part of the filtering process will be done in servers, since downloading everything to the personal computer of a user before filtering can be time-consuming and inefficient.
    Part of this project is to develop and test different filtering methods (methods of assigning attributes to documents, methods of using these attributes to filter messages, methods of finding the suitable filtering rules for each user). We will test both automatic methods, where the computer derives the filtering conditions from user actions or evaluations of documents, and manual methods, where interaction between the user and the filter is used to establish the filtering conditions. One type of filtering which we will develop is intelligent filtering. By this is meant that the computer will use machine learning techniques to derive a knowledge data base of information about the preferences of a user to use in filtering for that user. We will also develop filtering which uses interaction with the user to develop this data base. This interaction with the user need not mean that the user has to specify complete Boolean expressions, they can be based on other kinds of interactions, for example that the computer questions the user only for items which the user rated differently than the filter algorithm expected.
    We have in this project partners who are service providers of Internet search services. These partners will enhance their search services with data bases of rating information. These data bases will be used, when users so wish, to enhance the search service. The data bases will also be accessible for rating and filtering of newsgroup postings, e-mail mailing list messages and messages in conferencing systems like TAP Web4Groups. Their customers are the users on which we will test and demonstrate the software we develop, but our project also includes user organisations representing different Internet user groups.
    This project has unusually many partners. We believe the project will still be manageable because of the well-defined split into different tasks. In particular, several different partners will develop different filtering methods. These filtering methods can be compared and tested against each other to find which filtering methods gives the best filtering results. To make such comparisons, an important work package (work package U2) in the project is to collect large sets of rated documents from users (see page 25).
    People within the EU are probably already spending about one thousand million (1 000 000 000) hours a year getting information from the Internet, and this can in a few years be expected to increase manyfold. Even if our rating and filtering system will only make this activity 10 percent more efficient, the potential savings will still be in the order of thousands of millions of ECU/year. The commercially most successful information providers on the Internet are the Internet search services, and the systems we will develop and demonstrate is an enhancement to their services.

User needs and application area

User needs to be addressed and description of application sites

Introduction: Problems with Scientific and Technical information exchange on the Internet

The Internet has opened up important new opportunities for knowledge exchange between scientific, technical, professional and other users. Some important such knowledge techniques in the Internet are:
    Search-oriented techniques like The World Wide Web, where a vast amount of factual information is available to anyone searching for specific kinds of information using hyperlinks, subject trees or subject structures and web search engines.
    News-oriented techniques like simultaneous computer conferencing services such as the Usenet News, electronic journals and various e-mail mailing lists. The EU Telematics project TAP Web4Groups is developing systems in this area which provide better facilities than current techniques.
    The difference between these two techniques is that search-oriented techniques are mainly oriented towards a user who wants to search for specific information already stored in computers or browse through knowledge depositories. News-oriented techniques are mainly oriented towards distribution of news within specific topic areas, and for a continuous, ongoing discussion and exchange of ideas between people with special interests in various fields.
    Sometimes, the user need is to find particular information on particular topics, in other cases the need of the user is to update knowledge, keep up-to-date with recent developments, and increase contacts with other people with the same interests and speciality.
    There is no sharp distinction that the WWW is used only for search-oriented techniques and e-mail and conference systems only for news-oriented techniques, the reverse also happens. It is however important to be aware of the difference, since much of the experience in the field known as information retrieval is primarily aimed at search-oriented, not at news-oriented usage. This means that some of the knowledge from information retrieval area may not always be directly applicable for news-oriented usage.
    In news-oriented usage, a user is looking for something new but of interest, so difference from previously known knowledge may be a valuable quality. In search-oriented techniques, the user usually knows fairly well what he is looking for and looks for information which match the question. Thus, similarity to known documents is often a user need in search-oriented techniques, while too much similarity (i.e. too little new) may be a non-requirement in news-oriented techniques.
    All information available is not of high quality. Since anyone is allowed to put up any information they like on the Internet, there is no quality control (like that done by the editors and reviewers of magazines and journals). Another related problems is that there is often too much information, making it difficult for a user to find what is of most interest and relevance to that user. Users need tools to handle their time, so that they can read the most valuable items within the limited time they can spend on reading information.

Methods to help users find the best information

We believe that the free nature of the Internet is very important. Thus, it is not our intent to implement techniques for censoring and forbidding information. Instead, our intention is to develop and implement techniques to aid users of the Internet in finding the information which is of highest quality and relevance for the particular interests and needs of that user.
    Many people and research groups have tried to tackle these issues on the Internet in many different ways. For an overview. SELECT will tackle them using two related techniques:
    Rating is methods for users of the Internet to input their evaluation of the quality of various messages, articles and web pages. These evaluations are stored and used to aid other users to select documents to read. The interest and knowledge profiles of the users, as shown by their evaluation, can be matched with other users, so that the evaluations done for one user is used for aiding other users with the same interests, values or knowledge.
    Example 1: A user with a particular religion or political affiliation may prefer to find information which has been highly rated by other people with the same personal values.
    Example 2: A specialist in an area may want to find high quality information. Information which is of high quality for the specialist may be too complex for a beginner or amateur. Information which the specialist finds trivial and unimportant may be very valuable for a non-specialists who wants to learn the basics about a particular topic.
    In the Internet, rating may be applied to many kinds of objects, like web pages, messages, electronic journal papers, public domain software.
    The purpose of rating may be to increase the quality of the documents you read, or to avoid certain documents deemed unsuitable in certain communities for certain groups of readers (example: violence, pornography).
    In the world before the Internet, rating was commonly provided by services such as:

  • Newspapers, magazines, books, which are rated by their editors or publishers, selecting information which they think their readers will want.
  • Consumer organisations and trade magazines which evaluate and rate products.
  • Published reviews of books, music, theatre, films, etc.
  • Peer review method of selecting submissions to scientific journals.

Information filtering: Methods to scan through new information arriving in the Internet through group communication tools like computer conferencing, Usenet News, e-mail mailing lists, electronic journals and new web pages. This scanning has some similarities to information retrieval, but is also different in many aspects. Filtering may be based on different criteria like:

  • Keywords in the document.
  • Semantic analysis of the document.
  • Analysis of the stylistic and genre qualities of the document.
  • Analysis of the similarities between the document and other documents which the same user has rated highly.
  • Documents directly related to other documents of high interest to the user, for example by being replies or having hyperlinks to the document of interest.
  • Rating of the document done by other people or by the author him/herself.

Note that filtering is not only a task of dividing all documents into two categories, "good" and "bad", for a particular user. Often, what the user needs is instead a list of documents sorted by a matching index. Also, the user often wants the filter to sort information of interest into different areas or folders representing different interests of that user.
    The relations between rating and filtering is shown by the fact that rating is also known under the names "social filtering" and "collective filtering".

User needs to be addressed

Several of the partners in the SELECT consortium have already carried out studies on user requirements:
    DSV has performed a three-year study on information filtering. In this study, investigation of user requirements was an important part. The most important user needs reports in this project were [Lantz 1993, Lantz 1995, Fåhræus 1997]. The DSV study was performed on users of Usenet News, because this is an Internet application where filtering needs are especially large. The methods used in the DSV studies was (i) to collect groups of Usenet News users to discuss user requirements and making notes from these discussions (ii) to ask Usenet News users to manually filter articles and explaining how they decided which articles to read and not to read (iii) by letting Usenet News users use a prototype of a newsreader with filtering capabilities, and interviewing them on their experience from this usage. A full overview of reports from this research project can be found at URL http://dsv.su.se/jpalme/fk/if_Doc/IntFilter.html.
    The TAP Web4Groups has studied user requirements for rating. These results are reported in [Scmutzer et al 1997, Irmay 1997]. (The Web4Groups project will implement a limited system for collaborative rating in small user groups.)
    One conclusion from these studies is that different users have different filtering requirements. It is thus not possible to specify a single filtering system to satisfy all users. The system must be able to adapt to the needs of different users.

Trust

Important in the design of filtering systems is trust. A user is willing to let a filtering system filter messages only if the user can trust the filter. Users, who are afraid that the filter system will delete important messages without asking them, are not willing to use the filtering system at all.
    In the case of e-mail, for most users personally addressed messages sent to them are of potentially high importance, and most users do not want a filter to remove such messages automatically. E-mail also contains messages coming from mailing lists. Some of these mailing lists are small closed groups of people doing important work together, most users do not want such messages to be automatically removed by a filter. Other mailing lists are large lists with many members, who exchange information within the topic of the list. In some such lists, the activity is high, and some users want the filter to select only the most important messages from such lists.
    In cases where the user is not willing to trust the filter to automatically remove messages of less interest, the users prefer that the filter orders the messages, so that the most interesting are shown first. The user can then manually scan the list produced by the filter and make a final decision of which messages to read and which to skip or defer reading of to a later time.
    Users want to be in control. They want to be able to specify that filters should only be applied to messages from some mailing lists, newsgroups and forums, not to messages belonging to other, more important groups. When searching using Internet search engines, many users want to control whether filtering is to be used and what kind of filtering to use.
    On the other hand, many users do not want to be troubled by the need to specify how filtering is to be done. They just want to perform their searches, and they will prefer to use the search engine which most often give them the information they are searching for. Whether that search engine is using rating and filtering is something these users are not interested in.

Different requirements from different users

Every user is different. Every user has different needs. One user is interested in medieval religious beliefs, another is interested in particle physics. One user wants an overview of the knowledge on a certain topic, another wants to find the latest news. One user wants to get the maximum amount of information of value in a limited time, another user wants to browse and entertain at leisure. One user is an expert, another user is a novice, in the subject area they are retrieving information on.
    How then, can tools be designed to cater for all these differing users with differing user needs? Because even though users are different, they are common in that each user wants to find information of value to him or her. This is the basic user need which this project will address. The goal of the project is not to find "the" good information, according to some particular criterion of goodness.
    Our aim is to develop tools which will make it easier for each Internet user to find the important and interesting information for that particular user.
    Here is a list of potentially conflicting user requirements:

  • A user is getting too much information from the groups (mailing lists / conferences / newsgroups) subscribed to. And much of the information is not of interest. The user wants to see only a selection of the items of highest interest.
  • Another user may prefer to see all messages from a group, but sorted with the most interesting items first in the list.
  • Another user may want to see all messages in the ordinary chronological order of threads, but wants each message shown with the important terms for this user highlighted so that the user can rapidly manually decide what to read and what to skip.
  • One user is not willing to do anything to aid the software in filtering. The software should automatically, be looking at the user behaviour, deduce what is of interest to this user.
  • Another user is willing to classify items as interesting or uninteresting, or in some other way, to aid the software in knowing what is interesting to this user.
  • Another user is willing to explain in simple terms to the software why a certain item was interesting or not.
  • Another user prefers to see and set filtering conditions in a special language for this.

Asynchronous group communication is an increasingly important tool for the exchange of information and ideas between people in different places and countries. However, there are also problems in this area. Many people feel that too much is written in the discussion areas of interest to them, so that they do not have time to read all the new items [Denning 1982, Palme 1984, Hiltz and Turoff 1985, Malone 1987]. They also feel that the quality of the information provided is sometimes not high enough. Too many uninteresting items are shown to them. These two problems are to some extent two sides of the same coin, and one solution is information filtering.

Selection criteria

Filters should be capable of selecting items based on all attributes of items, including attributes specially defined for special applications. Filters should also be capable of selecting on attributes which are derived from the text of the message, such as style and genre, degree of new information, etc. Example of basic attributes for filtering are time flow (date and time in reference to other messages), author, recipient, group to which the item was sent, topic, conversation/thread, subject heading, keywords, relations to other items and text. A user may wish to give higher priority to items which are direct or indirect responses to what the user him/herself has written. Other filtering attributes can be special categories of items, such as notifications, questions, replies etc. The degree to which semantic analysis of the text is meaningful for selection will be investigated.

Filtering action

Based on selection criteria, the filter should be able to select items into categories such as:

  • Items to be read immediately
  • Items to be saved for later reading
  • Items belonging to different subject areas
  • Items to be forwarded to someone else
  • Items to be automatically processed by special software
  • Items to be listed to the user, but then discarded unless the user specifically overrides the recommendation of the filter
  • Items to be discarded without showing them to the user

The choice of categories should be adjustable to the wishes of each user.

User interface and the creation of filtering conditions

Most existing filtering software like Elm [Taylor 1987], Procmail [Berg 1993], MailFilter [Wyle 1992] are not easy to set up and require the user to prepare special control files in a language that is not easy to understand for non-computer specialists [McGough 1994]. It is easy to make a mistake in specifying the filtering conditions. And such a mistake can have disastrous effects, e.g. automatically throwing away important messages. The increasing usage of messaging by people whose speciality is not in the computer area, will create a demand for filtering software that is easier to use. This can be achieved by better user interfaces.
    The risk of disastrous mistakes in specifying filtering conditions might become even more of a problem if the filtering software aids the user in producing filters, giving the user less direct control of the filtering conditions. It is important, if people are to use filters, that they are able to trust the filters. One way of achieving this, would be that new filtering conditions would in the beginning not automatically trash messages, but rather put the deselected messages into a list which the user can approve manually. Only when the user after some time of experience is fully satisfied, the filter might trash messages without user final approval.
    Studies on user behaviour when using filters show that filter conditions set up by users are generally very simple [Mackay 1989]. This means that advanced ways of specifying filters using complex logical language may be more of a disadvantage than an advantage for most users, unless the complexity can be hidden from the user by the user interface [Karlgren 1994C]. Viewing filter rules by frames is usually easier for users than seeing them as logical rules in some programming-like language.
    Another way to make it easier for a user to specify filtering conditions, would be to ask the user to input an evaluation on those messages which the user believed were filtered wrong (e.g. a low evaluation on messages accepted by the filter but not wanted by the user). To make this easy, the user interface could be designed so that a single digit, input after reading a message, would be interpreted as an evaluation on a scale from 0 to 9. The filtering program could use this digit plus the message it applies to, to deduce a filtering rule, possibly in co-operation with the user. This could be seen as an expert system, where the filtering program in co-operation with the user builds an expert system data base with filtering conditions for that particular user.
    Some users want the filter to act automatically, deriving filtering rules and improving its performance, or whether it should regularly communicate with the user, asking for advice, showing prospective new filtering rules etc. The reason users want filter is to make their life easier. Other users prefer regular contact between filter software and user to increase user control of and user confidence in the filter. Some users find it important to be able to inspect and understand the filtering conditions, while others do not bother with this.
    One user study (Fåhræus 1997) indicated that users might prefer a filter which did not sort or reject documents, but which marked up the documents or provided information to make it easy for the user to manually filter the documents. This design choice will also be studied in this project.

User modelling

Important in filtering is to model different categories of users. The system will be able to observe user behaviour and from this infer their needs, so that the system adjusts itself to different user categories. This is especially important for new and inexperienced users, since systems which do not explicitly model such users are often too difficult to use for them.

Rating user requirements

There are many different kinds of rating with different user requirements. Some examples:

  1. In a small, closed groups, users want to rate options. This kind of rating could also be named voting or straw voting, and is handled by the TAP Web4Groups project. This project will not develop support specifically for this kind of rating.
  2. People are employed, often also paid, for making ratings. This is very common outside the electronic area, most newspapers and journals have some rating system to decide what to publish and what to omit, even if they do not use the word rating for what they are doing. A special case is the peer review system used for filtering contributions to scholarly scientific and technical journals and conferences. In the electronic publishing area, this kind of rating is applied by some search services, the most well-known of which is Yahoo (see page 25). In Usenet News and e-mail mailing lists, moderated groups publish only contributions which have been approved by one or more moderators. A big disadvantage with such human moderators is the delay they cause in publishing. In newsgroups and mailing lists, the time interval between one message and a reply to it is often only a few hours, in moderated lists, this time is lengthened to usually about a week. It is obvious that the rapid interaction in discussions is severely hindered by this. On the other hand, the aid of moderators is needed by many users who do not have time to read messages like "please remove me from this mailing list".
  3. A variant of this is in education, where the teacher rates the submissions from students.
  4. Instead of having special people who perform the rating, some systems allow anyone or almost anyone to rate any document. They sometimes just use an average of these ratings, but some systems (for example Firefly, see page 25) rate objects based on other people who have similar tastes (views, values, competence) as the person the rating is done for. A variant of this is to put people into different categories, so that a user might specify that he prefers documents rated highly by other people in his own category (political or religious group, scientist, etc.)
  5. The author of a document can provide his own rating. The advantage with this is that more documents get rated, and that the ratings are easily transmitted with the document. The disadvantage is that people may sometimes rate their own documents too high. This disadvantage can be reduced by choosing a rating scale which does not make such misratings easy to do, for example a scale with the values
        9 = ph.d. thesis or equivalent
        8 = accepted for publication in peer-reviewed scientific or technical journal
        7 = accepted for presentation at peer-reviewed scientific or technical conference
        6 = scientific research report
        5 = other scholarly scientific och technical text
        4 = popular science written by a scientist
        3 = other newspaper or journal article
        2 = discussion item in newsgroup, mailing list or other on-line forum
        1 = document of interest only to very few people.

We intend to develop support primarily of type 4 (but maybe also type 5) above. Rating of kind 1-3 above is provided by other EU projects or proposals.
    For rating of type 4, here are some user requirements:

  • If anyone is allowed to submit ratings, there is a risk of misuse by people putting in high ratings on their own documents, or collusion between two people putting high ratings on their own documents. A check for the domain of the rater and the document can stop ratings by people in the same domain, this is not a full protection. People known to misuse the rating system in this way can be identified and put on a stop list. Social codes that such misuse is not permitted may also help.
  • A problem with such rating systems is how to get people to provide ratings. A good solution to this problem is that used by for example Firefly, where you have to provide your own ratings to get aid from the ratings data base. A variant of this is that a filtering system may use the ratings by a user as a tool in developing filtering conditions. Either the filtering system can automatically deduce the filtering conditions by looking at the messages which a user rates high or low, or the filtering system can ask the user (when needed, because the user rating did not agree with already known filtering conditions) to specify some property, for example keywords, to identify future message which the user wants to be filtered in similar ways.
  • Important is that users, who so want, can control when rating is used or not.
  • Some user studies have shown that users prefer not to rate documents into a scale from interesting to uninteresting, but want to use several scales (for example a scale of agree or disagree) or want to sort messages into folders/mailboxes for different sets of messages. However, in a social filtering system where anyone can provide rating on any document should not use more than one scale, because a major problem is to get people to provide ratings, and this will be even more of an obstacle if they have to provide ratings on several scales.
  • Some users may want to read new documents as soon as they arrive, even if no one has yet rated them. Other users may want to delay reading new documents in less important categories for them, in order to wait for ratings by other people to arrive.
  • We will also investigate whether ratings can be deduced from people's behaviour. If this works, it can help in collecting ratings data.

Market situation and prospects

The amount of documents on the Internet is growing exponentially. The quality and value of the documents is varying. The market for tools to aid users in finding what is of highest value to them is very large. If message handling systems (e-mail systems, news readers, non-simultaneous conference systems) and information retrieval systems (like web search services) are equipped with features to help a user find information of high interest, and to skip information with less interest, users can be expected to choose to use such tools.
    Internet users are already spending more than a thousand million hours a year reading Internet documents. If this can be made only 10 % more efficient, the benefit will be thousands of millions of ECU/year.
    This benefit also will open a market for commercial companies who provides this service to users. This market consists of the following products:

  • Search services on the Internet: This market is today dominated by American companies with services such as Alta Vista, Infoseek, Yahoo, etc. The competiton is fierce in this market. A few European companies have entered this market, and two of these companies, Euroseek (Euroseek search service) and OTM (Arianna search service) are partners in this proposals. SELECT will give these companies a larger market, because they will be able to give users, who so prefer, a better selection of documents in the search results.
  • Software tools for information filtering: Good such tools are obviously useful, and there will thus be a market for good such software tools, probably as add-ons to existing messaging or web retrieval software.
  • Value-added information services for specific specialist groups: Two of the partners in the SELECT proposal is MediBRIDGE/MGC and ISCN. MediBRIDGE provides information to physicians and other people in the health care area. In Belgium today, there are 34000 general practitioners, 8000 of these have a PC, many of these physicians are already customers to MediBRIDGE. The market for this kind of information service to physicians is presently increasing with 35 % per year. This is only for service to one particular group of specialists, physicians, in one EU country, Belgium. Similar services can be offered to other groups of specialists in other EU countries, which means that the total future market for these services will be large. A user of the medical discussion groups, which is one of the services of MediBRIDGE, receives on average 35 messages/day. Only 4-5 of these are of primary interest. Even if a filtering system is not able to automatically find only these 4-5 messages, i.e. providing 100 % recall and precision, a filtering system will still increase the value to the users, and may also be a factor making the service worthwhile to subscribe to. MGC is a Belgian organisation of physicians which will represent the users. ISCN is an organisation of specialists on software quality in different European countries, and they are an examples of the many networks of specialists which are expected to benefit from this project.

More information about market situation can be found in section 10.1.4 Is there a need for a new project on page 25
   

Work content

Phase of the project

Project methods

This project will pursue its work in the following ways:

  1. Collect a data base of documents and ratings on them made by a number of users. This data base can be used to develop and test different filtering and rating methods, without having to go back to the users every time a new method is to be tested.
  2. Define an architecture for filtering and rating with well-defined interfaces between modules. This is important, because a good such architecture will allow partners in different countries to develop different modules so that they can co-work well. A first draft of such an architecture is provided in the chapter 1.4.2 Architecture of the rating and filtering system on page 25.
  3. Develop a basic system, within the defined architecture or a subset of it, which will not provide all advanced functionalities, but which can be used in real usage to gain user experience. The basic system should be ready within 10 months of the project start, and user tests with it should start not later than 12 months after the project start.
  4. Based on the defined architecture, different partners can find existing software, or develop new software, or modify existing software for the different modules. In some cases, different partners in the project can find or develop alternate versions of the same architectural module. For example, one partner might develop a module for filtering based on semantic analysis, another might develop a module for filtering based on similarity to other documents using the so-called cosine method. A competition will thus exist within this project between different filtering methods, and the users experience will decide which methods is best.
  5. Experts on Human-Computer Interaction can make studies, prototypes and user tests to find good user interfaces.
  6. Combined systems of modules developed as described above can then be put up and tested on users, and the user evaluations of different methods can be collected to improve the methods.

Many of the partners in this project are organisations who have already done research in the area of information filtering. An important part of this project is to combine and test different filtering methods and evaluate how well they satisfy users. We have chosen a two year project period because the market needs these services now, and if we are too late, American services may have achieved dominance on the market. Because of this short project period, the project is not split into phases which succeed each other in time. Instead, the phases overlap, so that some development can start in parallel with collection of user requirements and collection of the test data base. Also the important phase for filter experiments will go on in parallel throughout the whole project.

Filtering methods

Here are some filtering methods which will be developed and evaluated and tested on users within the unified architectural framework we develop:

  1. Filters on keywords supplied by the user. New user interfaces will be tested, where users can easily supply keywords in a dialogue with the software. Using this method, the software will ask the user when needed questions like "Why did you like/dislike this document" (when the user evaluation did not agree with what the filter expected).
  2. So-called intelligent filters, by what is meant that the user supplies a quality evaluation of documents, and the filter system deduces how to recognise high-quality documents for this user in the future.
  3. Filtering on information in collaborative filtering data bases.
  4. Filtering on computer evaluations of the style and genre of documents.

Phases of the project

The major phases of the project can however be said to be:

  1. Collection of user requirements and specification of a base system based on existing user requirements (work package U1 , S1, S2 and S3).
  2. Development of a simpler base system, which can be used by users but also as a tool in testing rating filtering methods (workpackage D1, D2 and D4).
  3. Development of an advanced system, where the major difference from the base system is that filtering is based on rates submitted by users with similar tastes, interests and competence as the users for whom information is filtered (work package S4, D5, D6, U4, D7).
  4. Parallel and continuous development and user testing of new filtering methods throughout the project (tasks D2, U3, U4).
  5. External coordination and development of an exploitation plan (work package X, M and C).

Back to top