The Private Filtering News Agent (PEFNA) is a prototype system for filtering Usenet News. Pefna is developed by the Intelligent Filtering project (IntFilter). This document describes the Pefna system.
The agent runs as a batch program, preferrably in the morning hours. It compares the user's .newsrc file with the contents of the news server and retrieves any new articles. Each article is then compared against the examples in each category and a list of distances is built. The list is stored in a database in the user's home directory.
When the user launches the client the next time and visits a newsgroup, the client builds a list of threads for unread articles in the group. It also examines the user's database for the ratings provided by the agent. In each thread, a single category is selected and used to label the thread. The list of threads presented to the user is then sorted to correspond to the sequence in which the user has arranged the list of categories.
For example, if the user has placed the category 'Call for papers' at the top, then all threads closest to that category are sorted to the top of the list. See here for a more detailed example.
The Pefna agent is the program which performs the analysis and classification of unread articles. The agent does not take any destructive action, it only rates each unread article and stores the information in the user's personal database. The agent runs independently and the interaction with the user is indirect, via the categories and examples supplied by the user and the distance estimates stored in the database.
The agent is modularized and can be seen as an assembly of separate parts where each part can be added or disabled without disturbing the operation of other components. Currently only two such modules are implemented; a heuristic for measuring the ratio between cited and original text in the article and the category distance metric.
The major work in the agent consists of establishing the id of each unread article and calling each module. The module then decides if the article is to be fetched and examined. Some modules may choose to ignore articles already analyzed, while others may have to update the article's database entry in response to a changed collection of examples and categories.
When the user instructs the Pefna client (the newsreader) to label an article as an example of a category, the article is saved to a file in the user's home directory. The label is used as the name of the directory and while this invites the use of a concept hierarchy the list of categories are only used as a flat, disjoint set. Technically, the user can arrange for the text to be a member of several categories but this works against the idea of having each unread article classified into a single category.
When the agent is run it examines the files in each category (directory) and creates a word frequency list for each file. This file is cached along with the example to avoid redundant reconstruction on each run. For each category, the agent then collapses all the word lists in the category into a concept master. This word list contains all word frquencies in the category and is counted as single document.
When an article is to be examined a temporary word list is created for it. Salton's cosine measure [Salton83] is then used to calculate a distance measure to each of the concept masters. The list of distances is stored in the database under the article's message id.
Words are compared using the library function strcmp() which require both identical spelling and case. This is fast and efficient and the lack of case insensitivity is not a very big problem since the number of all-capital articles is generally very rare. Capital headings are somewhat more common and generally contain a greater proportion of topical words, but most matches occur in the running text of the article.
The ability to do word stemming or allow for some variation to compensate for misspellings is clearly worth investigating.
The concept masters could be postprocessed to include synonyms and homonyms. While requiring a dictionary, the latter case opens the possibility of extracting sufficient context from other parts of the article to identify the intended usage in words with multiple meanings. For instance, the word sample has a definition which allows it to be used quite differently by statisticians, chemists and digital audio engineers.
___ \ - > P(w|C) log2( P(w|C) ) /__where P(w|C) is the probability of the word given concept master C, it is possible to decrease the weight of the less informative words.
The Pefna client is written in Tcl/Tk. Communication with the NNTP server is provided by code from arTCLs, another newsreader written in Tcl/Tk by Mike Hoegeman (firstname.lastname@example.org). The client provides a modern graphical user interface and all the functionality needed to read Usenet News.
The client has three displays: the list of subscribed groups, the list of threads in the current group and the text of the currently displayed article. Any of these can be placed into view at any time. However, making a selection in the group list will reload the contents of the thread list; making a selection in the thread list reloads the current article.
The filtering action of the Pefna system becomes visible in the thread list, ie the list of threads in the currently selected group. The threads are constructed twice, first by traversing the pointers in the References: field in the article header, and secondly by collapsing articles with identical subject lines.
The client examines each thread for the article with the highest similarity to any category. The name of the winning category is then used to label the thread.
After labelling, the threads are sorted in the order given by the user's list of categories. The user can rearrange the order of the labels at any time. Within each thread, articles are kept in (roughly) chronological order.
The following limitations are currently placed on the Pefna client: