Fifth ScandSum Minutes Åre 3-6 April 2003

Program and Participants of Fifth ScandSum meeting 4-6 April 2003,
Hotell Fjällgården, Åre, Sweden

Participants
Hercules Dalianis KTH
Martin Hassel KTH
Magnus Rosell KTH
Magnus Sahlgren SICS
Dick Stenmark Volvo IT and Viktoriainstitutet
Koenraad de Smedt Univ of Bergen
Anja Liseth Univ of Bergen
Gordana Ilic Holen Univ of Oslo

Presentations
Plan for evaluating Norwegian Summarizer (NorSum) at Bergens Tidende -
Anja Liseth and Koenraad de Smedt Univ of Bergen
Anja is going to start her investigation the fall of 2003 (Hovedfag) and work during one year.

Koenraad is the supervisor. Bergens Tidende (BT) will participate. BT makes very little editing only one or two sentences removal. Norsk Telegrambyrå NTB, have raw news that are much more heavily edited.NTB is written bokmål, BT is written in both languages bokmål and nynorsk.

Norsum dictionary (from Scarrie) is in bokmål. Anja would like to evaluate and develop NorSum. The development would be to incorporate tagging following the XML format described here.
That would include word frequencies and lemmas. The tagger for Norwegian would be the Oslo-Bergen tagger. According to Journal of ACL Dec 2002, there are only 60 percent overlap between different manual extracts. So what is the perfect automatic summary?

A very first approximation of anaphora resolution for Norwegian - Gordana Ilic
Gordana has worked with her hovedupgave for 3 months of totally 12 twelve months.
Gordana says that she checks two sentences (same as Kari Fraurud wrote in her PhD) to find anaphors. Martin Hassel checks five sentences back.
If one take a look at human beings, at one part of SIMPLE corpus, there are only - 1000 words describing humans, this might become up to 3000 words, if the total corpus is used. The problem with Bokmål mixes genus in different situations.
To make anaphora resolution one should first check NP's, then check numerus, and finally check genus. Gordana is using the ideas from Ruslan Mitkov's Anaphora resolution system 1996, 1998, where Mitkov found 62 percent precision in English and over 76 percent precision in Bulgarian. An other system us Anaphor Procedures /RAP Leass & Lappin 1994.
Gordana uses LISP for her anaphor resolution program.

Att söka och hitta på ett stort intranät - Dick Stenmark
(To search and find on a large intranet)
AB Volvo has 71 000 employees (That is everything except the car company Volvo that belongs to Ford)

Dick is responsible for the search engine that indexes 1600 web servers 750 000 documents. There are about 6-7 000 queries per day. The search engine is called Ultraseek-Inktomi. Ultraseek was initially developed by Infoseek and was then owned by Inktomi that was bought by Verity
and Inktomi is now part of Veritys application portfolio.

Dick said that in the beginning 1996 the number of web servers where growing very fast but now it has slowed down to as most 100 per year. 81 percent of the information on the intranet is in English, 8 percent in Swedish, then French, and Dutch totally 15 different languages.

One search session takes 2.18 queries. Most users (58 percent) ask just one query. Dick can not see from the log list if one user investigate the hit list or if the user ask somebody else at the company. In average there are 1.18 words per query. Most queries are in Swedish though most information is in English. The employees are not aware of that the bulk of documents are in English. The most common query is "tjänstebil" (company car) - one word and pretty ambiguous.

Dick and his students did an experiment asking 40 engineers at Volvo Power Trains. The result was that 10 percent uses the search engine every day and 50 percent uses the search engine each week. People find information by using the search function but they work hard to find the information, making many trials. Many not using the search engine use bookmarks and navigation menus instead to find information on the intranet.

Meta data is used by the search engine but mostly the meta data is not filled in. Since most information providers does not know how to fill in meta data and that there is no support in the tools for creating meta data, neither semi-automatically or automatically. There are a variety of tools for entering information into the intranet.

The summary in the hit list contains either the meta data and if there is no meta data the first 25 words in the beginning of the document.

There is a mismatch between information providers and information users/seekers.

The engineers should inform others but they are not paid for that. While the information managers should inform, but they don not know about the technical details.

These are topics that Dick thinks should be further elaborated.

Cross language information retrieval
Query Expansion
Categorisation / Clustering
Pattern Matching (similar to Autonomy)

Query Expansion ad on on Ultraseek gave better results for explorative queries but worse results for precise queries.

The query expansion used LSI (Latent Semantic Indexing) applied on 7000 Volvo documents. To make LSI work fine one need a magnitude of 10 more documents that is at least 70 000 documents.

LSI is time consuming to execute for large document collections, one solution follows below using Random Indexing.

One method for better synonyms could also be to enter a manual synonym lexicon or extend the stemmer. These manual synonyms could be based on RI and/or the statistics coming from query logs.

Random Indexing och vektorbaserad semantisk analys - Magnus Sahlgren
(Random Indexing and vector based semantisk analysis)

To create context based vectors one uses a co-occurrence matrix. (a non inverted index, compare inverted index for information retrieval). Same distribution gives the same context of words. Latent semantic indexing, LSI, is the same as Latent Semantic Analysis, LSA.
LSA uses SVD, Singular Value Decomposition, Reducing the matrics from 100 000 dimension to 300 - 400 dimensions. Co-occurrence matrices are sparse and high-dimensional - it is necessary to reduce the dimensionality! SVD is one alternative.
LSA and HAL (Hyperspace Analogue Language uses word-word matrix) have too many dimensions to be easy manageble.

Random Indexing (RI) is the solution. Use high-dimensional sparse random vectors to represent words, and accumulate context vectors by incrementally adding together the sparse random vectors.

The vectors are many dimensional and sparse populated. Each document gets a vector that is incremently added. Context vectors have the same dimension as index vectors.

LSA and HAL need 50 000 x 30-50 000 dimensions while RI needs only 50 000 x 2000 dimensions. Dimension become much lower in RI. Using many documents give lower noise than fewer documents. RI is an approximation of LSA. But RI is easy to update and needs not large time consuming calculations. The complexity of RI is only O(NV), where N is the size of the training data and V is the size of the vectors.

Applications on Random Indexing:

Automatic generation of association lexicons
Query expansion
Classification problems

Evaluation of RI: Toefl test for synonyms: RI gives 67 percent precision in Toefl, 72 percent precision (for lemmatizer), 64 percent precision for LSA

Magnus S says: Remember to not expand terms but query vectors

Next CLEF 2003 will take place in September in Trondheim.

Klustring av svenska tidningsartiklar- Magnus Rosell
(Clustering of Swedish News Articles)
Categorisation or classification are different from clustering.
The clustering algorithms described in this presentation are based on the vector space model and uses the cosine measure for similarity calculation.

The clusters are represented by vectors, centroids or cluster centres, calculated as the average of all document vectors. These vectors tends to be very long. For efficiency they are truncated. This also improves results.

For evaluating clustering quality Magnus uses entropy, as defined in information theory.
One could also use precision and recall. Magnus described two simple clustering algorithms:
Agglomerative clustering which is deterministic and produces a cluster hierarchy and K-mean
which is nondeterministic and only partitions the document collection.
K-mean operate on a global level, while Agglomerative
clustering works from the local level and upwards.

K-mean needs as input k seeds, or initial cluster centres. These are often choosen at random. Each text is compared to all clusters and said to belong to the one most similar. The cluster centroids are then updated. This procedure is repeated until stability is reached or some other criteria is fulfilled.

Stemming for Swedish improve the results about 10 percent for clustering.
Splitting of compounds also improves results for Swedish. The spell correcting program STAVA by Viggo Kann splits compounds into its parts and checks them for spelling errors one by one. In an information retrieval perspective, however, there are compounds that should not be split since they mean something different from there parts. Magnus has gathered a long list of such words to be used as a stop list. There are also frequently used parts that does not mean a lot on there own. These are also stopped by a stop list.

Of course one could use lemmatization instead of stemming. Magnus has not yet tried that.

The same representation that has been described can be used to make automatic categorization; build cluster representations from typical texts of each genre and use the similarity measure and/or link structure to tell to which category a new text belongs. Read more about Magnus' master thesis about clustering.

SiteSeeker Voice - A speech controlled search engine - Hercules
Demo and presentation using Åre Kommun. Calling by telephone to SiteSeeker voice and asking various questions by telephone to Åre kommun. The search engine finds the documents and reads them summarized / extracted for the user, questions about Bibliotek, Liftkort and Åreskutan .

Evaluating automatic text summarizers-Martin Hassel
SweSum and the problem of evaluating automatic text summarizers. High keyword ranking can introduce redundancies in the summary. The summary become concentrated around one specific topic.

First evaluation carried out year 2000.

Reduce text using summarizer SweSum and make a subjective evaluation about the resulting text coherence and content. The results showed at 30 percent summarization for good coherence and 24 percent summarization for good content. (30 percent summarization means to remove 70 percent of the text)

Second evaluation year 2001

Question-Answering evaluation schema on 100 texts. 40 percent summarization gives 84 percent correct answer. That means if one remove 60 percent of the text one can find an answer at a given question.

Third evaluation preparation year 2003

Creating man-made extract to be used as a reference to compare the prestanda of the summarizer. Martin showed his Automatic extract creating tools.

You tick the sentences you think choose to be in the extract and then you can investigate the result and approve the result. The tool collects various texts and make statistics on the text extract collection as number of extracts, shortest extract, longest extract, average length extract, precision and recall.

What about marking up pronouns? What about marking up order of sentences? What to do about different subtopics in one text, which one should be extracted?

Koenraad argues that each sentence is dependent of the other sentences in the same texts.

While the tool look over all texts. This dependence will disappear if one obtain a large sample.

Martin says that if one get close to 100 percent precision and recall, then one is close to an ideal extract, since all test persons have voted for them, this will of course never happen.

We assume that when we have obtained enough answers from our users we the precision will stabilize around 60 percent precision. As previous results has shown for example Journal of ACL Dec 2003, and also manual indexing of texts from Riksdagsbiblioteket that shown only 30 percent overlap Bäckström 2000.

The evaluation is not carried out automatically but manually at this point. There will be an automatic evaluation tools called Blue programmed by Chin Yew Lin at ISI/USC and a plugin for SEE evaluation tools. Martins work is SEE compatible.

Improving Precision in Information Retrieval using Stemming *and* Spell checking Hercules.
Hercules described different stemming approaches. Specifically Tomlinson's stemmers from Hummingbird was discussed where the evaluation from CLEF 2001 showed good results for stemming in 6 languages while the stemming evaluation in CLEF 2002 with 8 languages the results were much worse except for the new added language Finnish where one obtained 67 percent improvement in precision. (Swedish had only 4 percent improvement in precision). The interesting thing was that they used Inxight Linguist X tool and for Finnish, German and Dutch the used word splitting instead of stemming.

Hercules showed how Euroling- SiteSeeker stemmer is built up with rules and lexicons. Euroling. The stemmer gives 15 percent improvement in precision and 18 percent in recall for Swedish. Logs shows that around 10-12 percent of the all search queries are misspelled. An other experiment by Mansour Sarr was presented where Euroling-SiteSeekers spell checker was evaluated and where it was shown that spell checking improved precision with about 4 percent and recall by 11.5 percent for Swedish.

Construction of a Danish and a Norwegian stemmer by Hercules & Koenraad
Hercules showed the development and construction of Danish Stemmer using the original Swedish stemmer. Hercules uses Danish lemmatizers, and the summarization keyword dictionaries, Swedish-Danish book dictionary and one Danish native speaker for help. Now he will continue with the Norwegian stemmer with assistance from Koenraad, Gordana and Anja.

Hercules obtain the question from Magnus S why we did not use lemmatizers instead of stemmers, and Hercules said that stemmers are faster and simpler, specifically if one wants to index the text each day as in a search engine and not only once very seldom.

Application for funding
Applications with Volvo and KTH and SICS regarding clustering of the 750 000 corporate texts from Volvo Intranet, information using standard clustering algorithms as K-mean and Random Indexing that can extracts groups of related terms.

Next meetings possible dates

27-28 May 2003, Short informal meeting at Iceland- Jurgen, Hercules, Koenraad at the Nordoknet meeting and general seminar.
19-20-21 September 2003 Skagen Denmark
November 2003 Bergen Norway (To assist Anja in her work with NorSum)
January 2004 or maybe later at Fefor Norway or are Åre, Sweden? Final Meeting

Tasks by next meeting Iceland

Fix Danish and Norwegian Stemmer

Slides from the meeting (in PDF)
Anja Liseth OH-slides
Gordana Ilic OH-slides
Dick Stenmark OH-slides
Magnus Sahlgren OH-slides
Magnus Rosell OH-slides
Hercules Dalianis OH-slides 1 and 2
Martin Hassel OH-slides

Latest change 16 April 2003, 2003.