Abstract |
|
Better ways at finding the most valuable information on the Internet, and to avoid trash, would very much enhance the value of the network. This paper makes an overview of methods and problems in this area, including social filtering, where people help each other with filtering objects on the net. | |
First Published 1 June 1998. Last update: 3 July 1998 by Jacob
Palme E-mail: jpalme@dsv.su.se.
at the Department of Computer and Systems Sciences,
Stockholm University/KTH Published in the proceedings of the ITS'98 conference URL of this page: http://www.dsv.su.se/select/information-filtering.html This document is also available in Adobe Acrobat (PDF) format at URL: http://dsv.su.se/jpalme/select/information-filtering.pdf |
Table of contents |
|
|
Delivery of filtering results |
|
The most common way of delivery of filtering results is that documents are filtered
into different folders. Users choose to read new items one folder at a time. Thus,
the filter helps users read messages on the same topic at the same time. The user
can also have a personal priority on the order of reading news in different folders. Unwanted messages can be filtered to special "trashcan" folders. User may choose not to read them at all, or to read such folders only very cursorily. Filtering can also be used to mark messages within a folder. Different colors or priority indications can be put on the messages, or the messages may be sorted, with the most interesting first in the list. Most services deliver new documents with a list, from which the user can select which items to read or not to read. The user act of selecting what to read from such a list can also be seen as a kind of filtering. The figure below shows an example of such a list, taken from the Web4Groups system [Palme 1997]: |
|
time, time, time , by Andras Micsik <micsik@sztaki.hu> 16/09/97
09:36 (2) Re: Advanced Functionalities , by Alain Karsenty <karsenty@eurecom.fr> 16/09/97 10:28 (1) Re: Web4Groups Technical Forum , by Torgny Tholerus 16/09/97 10:41 (1) Re: Web4Groups Technical Forum , by Torgny Tholerus 16/09/97 10:42 Re: Web4Groups Technical Forum , by Torgny Tholerus 16/09/97 10:42 (1) Re: Re: Web4Groups Technical Forum , by <MAILER-DAEMON@dsv.su.se> 16/09/97 11:38 Re: Draft agenda for Sophia-Antipolis , by Jacob Palme <jpalme@dsv.su.se> 20/09/97 04:35 Re: Web4Groups test report, by Jacob Palme <jpalme@dsv.su.se> 24/09/97 13:36 Re: Web4Groups test report, by Jacob Palme <jpalme@dsv.su.se> 24/09/97 13:36 |
|
Intelligent filtering |
|
By intelligent filtering is meant use of artificial intelligence (AI) methods to
enhance filtering. This can be done in different ways: AI software can be used to
derive attributes for documents, which are then used for filtering, it can be used
to derive filtering rules, or it can be used for the filtering process itself. With
the machine learning approach, the filter will take as input information from the
user about which documents the user likes, and will then look at these messages and
try to derive common characteristics of them to be used in future filtering. Such filtering can be done in the background, behind the scenes, with little or no interaction with the user, or it can be done in a way where a user can interact with the filter and help the filter understand why the user likes certain messages. A disadvantage with much user interaction is that it takes user time, and the whole idea of filtering is to save user time. A disadvantage with very automatic filtering is that the user may not trust a filter if the user does not understand how it works. If an AI method is used to derive filtering rules, it might be valuable if these rules are specified in a way which a human can understand and trust. Certain AI methods, the so-called genetic algorithms, are known to produce very unintelligible rules and this may be a reason against using them for information filtering. |
|
Filtering against spamming |
|
Many people want filters which will remove unsolicited direct marketing e-mail messages, so-called spamming. To do this, the filter has to recognize special properties of spam messages, which distinguish them from other messages. Examples of such properties are:
None of these methods are very efficient. A social filtering system might be more efficient, see the next chapter. |
|
SOCIAL FILTERINGWhat is social filtering |
|
By social filtering is meant that some kind of ratings are assigned to documents. The ratings can be compared to the stars which newspapers often assign to films, books and other consumer products. But the ratings can also include categorization into subject areas or according to particular scales. Social filtering has some similarities to the filtering done by editors, journalists and publishers, since in both cases humans select the filtering attributes. | |
Why use social filtering |
|
It is difficult to design automatic or intelligent filtering algorithms which really can evaluate the content of a document and evaluate its value. Humans are more capable of really deciding the value of a document. | |
Who make the ratings? |
|
Ratings for use in social filtering can be provided by:
A filter may use an average or median of the ratings put by all who have rated a document. It might be better to use something like the upper quartile, since documents liked very much by a few people may be of particular interest, because they provide new thoughts and ideas. A filter might also base its filtering on the ratings done by other people with similar values, views and knowledge as the filter user. The filtering system might automatically find such people with similar views to the filter user. |
|
Rating collection |
|
A rating system must collect ratings from the people who do the rating. This can
be done explicitly, where the user gives a rating command after reading a message.
It can also be done implicitly, by studying variables like the time a user has spent
on a message, whether the user has written a reply to it, printed it on paper, etc.
Some studies Indicate that such implicit rating can give as good values as explicit
ratings. The advantage, of course, is that people may forgot to provide explicit
ratings. Ratings collected in this way can be used for social filtering. But they can also be used as input to intelligent filtering algorithms (see above). And this might be a way of getting people to provide ratings, since people will have a personal gain by providing ratings: This will make the intelligent filtering for themselves work better. |
|
Spamming of social filtering systems |
|
By spamming is meant ways in which people can cheat the system to force messages
on you which you do not want. Most people think of spamming as it is done in e-mail
or in Usenet News. But another variant of spamming is performed against Internet
search engines. Authors of web documents give faulty keywords to their documents,
to cheat the search engine into selecting the document by inserting the most popular
search terms, which are known to be words like "sex", "naked",
"girl", etc., even if these words are not related to the actual content
of the document. Some search engines will first show you documents which contain
the search word many times, so spammers may repeat the same word many times in the
keyword set. (Keywords are placed in the meta fields of a HTML documents, which is
not shown when you read the document with a web browser.) Search engine providers have developed methods to recognize and dismiss messages with such false keywords. If social filtering systems are used in the future, there is an obvious risk that spammers will try to cheat the system, by entering lots of false positive ratings of their web pages. To stop this, some kind of authentication of raters may be needed. |
|
Privacy issues |
|
If a social filtering data base stores information, for individual raters, of which documents they like and dislike, such storage may be used for infringement of privacy. Possibly, some encryption method might be used to make such invasion impossible or difficult. This will of course depend on trust between user and filtering service. Web search engines today have similar privacy issues: They can store information about what you search for on the web. They already use this information to target selection of banner advertisements - other uses, which you might not like, may also occur. | |
RESEARCH ON FILTERINGHow research on filtering is usually done |
|
There are many research projects on information filtering. Such a project is usually started by some clever computer scientist, who has some novel idea of how to do filtering. He or she often finds that the task of developing a complete filtering system is larger than expected. If there was a standardized architecture for filtering systems, with standardized interface between modules, a researcher might easier be able to reuse existing modules, so that not a whole new filtering system has to be developed, when the researcher only wants to try out some new idea for one particular module. | |
Evaluation of filtering results |
|
To evaluate a new filtering method, or to compare different filtering methods, one might compare the filtering with manual ratings of documents done by users. A filter which will be good at predicting the ratings done by a user would then be regarded as a good filter. Of course, an intelligent filter should not derive its filtering rules from one set of messages, and then test the filter on the same set. In the most extreme case, if a user found message 1, 3, 17, 32, 36, 53, 55, 58, 72, 76 and 84 best, a genetic algorithm might derive the rule: Select all messages with number 1, 3, 17, 32, 36, 53, 55, 58, 72, 76 or 84. Such a filtering rule would of course be totally valueless. Even if filtering is developed and tested on different sets of messages, there is still a risk that a filtering method is developed which only suits the test subjects. To avoid this, a large and varied set of test subjects should be used. | |
ARCHITECTURE AND STANDARDSArchitectural issues |
|
To reduce the burden of developing and testing different filtering rules, it would be very valuable to develop a standardized architecture and standardized interfaces between the modules. The SELECT EU project [Palme 1998], which will start in the autumn of 1998, will work on this. Some modules which this project will specify are:
|
|
The PICS standard |
|
|
|
(Picture from Resnick 1996A) The PICS standard was mainly developed as a tool for teachers and parents to censor the information which children can download from the Internet. But PICS can be useful in other ways. It provided a general-purpose, standardized way of storing and distributing ratings. Users or groups of users of PICS can, within the PICS standard, specify their own rating scales. PICS might thus be useful as a basis for some of the interfaces between the different modules of the filtering infrastructure. |
|
The MTA filtering proposals |
|
Another on-going standards work in the filtering area is the IETF work on MTA filtering [IMC 1997]. IETF is developing a basic standard for controlling server-based filters. | |
MORE INFORMATIONOverview of research and servicesDifferent approaches |
|
The issue of finding better-quality information on the Internet (in web documents, newsgroup postings, mailing list contributions, etc. below the word "document" is used) has been discussed and tackled in many different ways. A good collection of links to these issues can be found in [Ciolek 1994-1997]. Approaches taken have been:
|
|
Existing rating and filtering services and research projects |
|
Many research projects are going on or finished in the area of information filtering.
|
References |
|
Bommel 1997: Internet filtering references at http://www.cs.kun.nl/is/research/filter Denning 1982: Electronic Junk, Communications of the ACM no. 23 vol. 3, March 1982, pp 163-165. Firefly 1996: Personalize your Network at http://www.firefly.com/. Hiltz and Turoff 1985: Structuring Computer-mediated Communication Systems to avoid Information Overload, by S.R. Hiltz and M. Turoff. Communications of the ACM, Vol. 28 No 7 July 1985, pp 680-689. IMC 1997: IETF MTA-filters Mailing List. http://www.imc.org/ietf-mta-filters/. Karlgren, 1994, Jussi:Text genre recognition using discriminant analysis. International Conference on Computational Linguistics, 1994. http://www.sics.se/~jussi/cmplglixcol.ps. Kilander 1997, Fredrik, Fåhræus, Eva and Palme, Jacob: Intelligent Information Filtering, http://dsv.su.se/jpalme/fk/if_Doc/juni96/ifrpt.ps.Z. Kilander, 1995, Fredrik: A Brief Comparison of News Filtering Software. http://dsv.su.se/jpalme/fk/if_Doc/Comparison.ps.Z. Koch 1996A Internet Search Services, by Traugott Koch at the Lund University Library, in German at http://www.ub2.lu.se/tk/demos/DO9603-manus.html, in English at http://www.ub2.lu.se/tk/demos/DO9603-meng.html. Koch 1996B DESIRE: Development of a European Service for Information on Research and Education. http:www.ub2.lu.se/desire/ Krauskopf 1996: Platform for Internet Content Selection Version 1.1: PICS Label Distribution - Label Syntax and Communication Protocols, By T. Krauskopf, J. Miller, P. Resnick and W. Treese. URL http://www.w3.org/pub/WWW/PICS/labels.html Magellan 1997 Magellan Internet Guide athttp://www.mcinley.com/ Malone 1987 et al: Intelligent Information-sharing systems, by Malone, Grant, Turbak, Brobst and Cohen. Communications of the ACM, May 1987, Vol. 30, No. 5, pp 390-402. Palme 1981: Experience with the use of the COM computerized conferencing system, DSV, Stockholm University, 1981, re-published 1993. Palme 1984: You have 134 Unread Mail! Do You Want To Read Them Now? by Jacob Palme. In Proceedings of the IFIP Wg 6.5 Working Conference on Computer-Based Message Services, 1984. Palme 1994, Jacob, Karlgren, Jussi and Pargman, Daniel: Issues When Designing Filters in Messaging Systems. Computer Communications 19 (1996) 95-101.http://dsv.su.se/jpalme/fk/if_Doc/JPfilter-filer/IssuesDesFilter.ps.Z. Palme 1997A: Filtering and Collaborative Filtering, Notes from the DELOS Workshop, Budapest, November 1997. http://dsv.su.se/jpalme/select/delos-filtering-notes-nov97.htm. Palme 1997B: Non-Simultaneous Web-based Groupware at http://dsv.su.se/jpalme/w4g/web4groups-summary.html. Palme 1998: Choices in the implementation of rating at http://dsv.su.se/jpalme/select/rating-choices.html. Pargman, 1994, David et al: How to Create a Human Information Flow. http://dsv.su.se/jpalme/fk/if_Doc/JPfilter-filer/HumaneInfoFlow.ps.Z. Resnick 1996A: PICS: Internet Access Controls without Censorship, by P. Resnick and J. Miller, Communications of the ACM, and http://www.w3.org/pub/WWW/PICS/iacwcv2.htm Resnick 1996B: Platform for Internet Content Selection Version 1.1: Rating Services and Rating Systems (and their Machine Readable Descriptions), by J. Miller, P. Resnick and D. Singer. May 1996. URL http://www.w3.org/pub/WWW/PICS/services.html Resnick et al 1994A: GroupLens: An Open Architecture for Collaborative Filtering of Netnews. Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, Chapel Hill, Pages 175-186, and at URL http://ccs.mit.edu/CCSWP165.html. Resnick et al 1994B: Roles for Electronic Brokers, by Paul Resnick, Richard Zeckhauser and Chris Avery, Twenty-Second Annual Telecommunications Policy Research Conference, October 1994, URL http://ccs.mit.edu/CCSWP179.HTML. Sepia 1995: Collaborative Filtering - The SEPIA Suggestion Box, http://www.sepia.com/suggestion_e.html. Tzolas 1994, I and Hussain, F.P: Word-statistical categorization of texts for filtering of electronic messages http://dsv.su.se/jpalme/fk/if_Doc/JPfilter-filer/OrdStatUppsats.ps.Z. Yahoo 1998: Yahoo. http://www.yahoo.com/. |