The Private Filtering News Agent

The Private Filtering News Agent (PEFNA) is a prototype system for filtering Usenet News. Pefna is developed by the Intelligent Filtering project (IntFilter). This document describes the Pefna system.

Overview

The PEFNA system consists of two pieces of software, a Usenet News browser (the client) and a classifier (the agent). The user runs the client to read news from a standard news server. The client allows the user to create and maintain an ordered list of freely named categories. An article which is judged by the user to be a good example of a particular category is saved under the name of the category. Over time, the user accumulates a number of categories and one or more example articles in each.

The agent runs as a batch program, preferrably in the morning hours. It compares the user's .newsrc file with the contents of the news server and retrieves any new articles. Each article is then compared against the examples in each category and a list of distances is built. The list is stored in a database in the user's home directory.

When the user launches the client the next time and visits a newsgroup, the client builds a list of threads for unread articles in the group. It also examines the user's database for the ratings provided by the agent. In each thread, a single category is selected and used to label the thread. The list of threads presented to the user is then sorted to correspond to the sequence in which the user has arranged the list of categories.

For example, if the user has placed the category 'Call for papers' at the top, then all threads closest to that category are sorted to the top of the list. See here for a more detailed example.

The Agent

The Pefna agent is the program which performs the analysis and classification of unread articles. The agent does not take any destructive action, it only rates each unread article and stores the information in the user's personal database. The agent runs independently and the interaction with the user is indirect, via the categories and examples supplied by the user and the distance estimates stored in the database.

The agent is modularized and can be seen as an assembly of separate parts where each part can be added or disabled without disturbing the operation of other components. Currently only two such modules are implemented; a heuristic for measuring the ratio between cited and original text in the article and the category distance metric.

The major work in the agent consists of establishing the id of each unread article and calling each module. The module then decides if the article is to be fetched and examined. Some modules may choose to ignore articles already analyzed, while others may have to update the article's database entry in response to a changed collection of examples and categories.

Category distance

When the user instructs the Pefna client (the newsreader) to label an article as an example of a category, the article is saved to a file in the user's home directory. The label is used as the name of the directory and while this invites the use of a concept hierarchy the list of categories are only used as a flat, disjoint set. Technically, the user can arrange for the text to be a member of several categories but this works against the idea of having each unread article classified into a single category.

When the agent is run it examines the files in each category (directory) and creates a word frequency list for each file. This file is cached along with the example to avoid redundant reconstruction on each run. For each category, the agent then collapses all the word lists in the category into a concept master. This word list contains all word frquencies in the category and is counted as single document.

When an article is to be examined a temporary word list is created for it. Salton's cosine measure [Salton83] is then used to calculate a distance measure to each of the concept masters. The list of distances is stored in the database under the article's message id.

Development

The Pefna system is still under development and some while not all fascinating possibilities may be adressed here are some of the more interesting.

Wordmatching

Words are compared using the library function strcmp() which require both identical spelling and case. This is fast and efficient and the lack of case insensitivity is not a very big problem since the number of all-capital articles is generally very rare. Capital headings are somewhat more common and generally contain a greater proportion of topical words, but most matches occur in the running text of the article.

The ability to do word stemming or allow for some variation to compensate for misspellings is clearly worth investigating.

The concept masters could be postprocessed to include synonyms and homonyms. While requiring a dictionary, the latter case opens the possibility of extracting sufficient context from other parts of the article to identify the intended usage in words with multiple meanings. For instance, the word sample has a definition which allows it to be used quite differently by statisticians, chemists and digital audio engineers.

Word frequency normalization

Many of the most common words appear in each concept master and are therefore less suitable to discriminate between categories. Using the standard information criterion

   ___
   \
 -  >    P(w|C) log2( P(w|C) )
   /__

where P(w|C) is the probability of the word given concept master C, it is possible to decrease the weight of the less informative words.

The Client

The Pefna client is written in Tcl/Tk. Communication with the NNTP server is provided by code from arTCLs, another newsreader written in Tcl/Tk by Mike Hoegeman (mh@awds.imsd.contel.com). The client provides a modern graphical user interface and all the functionality needed to read Usenet News.

The client has three displays: the list of subscribed groups, the list of threads in the current group and the text of the currently displayed article. Any of these can be placed into view at any time. However, making a selection in the group list will reload the contents of the thread list; making a selection in the thread list reloads the current article.

The filtering action of the Pefna system becomes visible in the thread list, ie the list of threads in the currently selected group. The threads are constructed twice, first by traversing the pointers in the References: field in the article header, and secondly by collapsing articles with identical subject lines.

The client examines each thread for the article with the highest similarity to any category. The name of the winning category is then used to label the thread.

After labelling, the threads are sorted in the order given by the user's list of categories. The user can rearrange the order of the labels at any time. Within each thread, articles are kept in (roughly) chronological order.

The Thread Distance Plot

From the thread selection list the user can command a separate window for any thread. This window contains a plot of the articles in the thread and their relation to their categories. The plot is in form of a 'radar' where the names of the categories are placed around the circumference of a circle and the articles are shown as dots within the circle. The position of an article is determined by the relative distances to the surrounding categories. This is not terribly useful as it does not immediately reveal if an specific article is weakly or strongly associated with any category. It does tell, however, the relationship between the article that provided the label for the thread and the other articles.

Limitations

The following limitations are currently placed on the Pefna client:

Posting news is not implemented.
Subscription editing is not implemented.
Its not very fast.

References

[Salton83]: Salton, G. and McGill, M. J., Introduction to Modern Information Retrieval, McGraw-Hill, 1983.