Minutes of the First ScandSum meeting April 18-21, 2002 Hotel Diplomat, Åre, Sweden.

Participants of the meeting

Hercules Dalianis KTH
Ola Knutsson KTH
Martin Hassel KTH
Johan Carlberger KTH
Johnny Bigert KTH
Koenraad de Smedt Univ in Bergen
Jürgen Wedekind CST at Univ in Copenhagen

Current version of the summarizer

Hercules Dalianis described the current version and the how to evaluate the SweSum summarizer.

Today the summarizer contains of a language-independent summarization engine and a static demorph-lexicon to look up keywords and their inflections. The keyword frequency in the text is calculated , where each inflected keyword which belongs to the same lemma in counted as one instance.

The format of Swedish Key word lexicon

o Key words is in news domain
o Also called "open class word lexicon”
o Key words can be noun, adjectives or adverbs

700 000 words 40 000 words

Inflected version Lemma
statsminister statsminister
statsministern statsminister
statsministerns statsminister
statsministrarna statsminister
statsministrarnas statsminister
.. ...
regeringen regeringen
regeringens regeringen
regeringarna regeringen
regeringarnas regeringen

... ....

The text is divided into sentences then is each sentence ranked according to its specific attributes, as position in the text with the weight *1/n, where n is the line number of the sentence, except the first sentence which obtain *1000 as a weight to get a high ranking to always be kept, then if the sentence contains bold text, which is the same as also being beginning of paragraph, that gives the weight *10. If a sentence contains a keyword it is ranking is multiplied with the keyword frequency and the weight *0,360. User key words which gives slanted summaries obtains the weight of *500. Today we have also a pronominalization module for third person singular han och hon but not det.

The sentence length is always normalized but not the text length, since that is not necessary due to the combination function.

Evaluation

The current version of the summarizer has been evaluated using an annotated 100 Swedish texts with the average length of 181 words, from the KTH corpus. A number of users summarized the texts increasingly 20,30,40 percent until the found the answers on a number of question in the summarized text. The result was that 84 percent of the questions where correct at 40 percent summarization.( 88 percent if the user atilizes user keywords). This gives that the summarizer is state of the art.
(One comment was to have a control group to see how many correct answer one obtains with full text base line)

Improvements

Martin Hassel then described possible improvement of the summarizer. One important improvement is to divide the text in clause level instead of sentence level and to do operations in a clause. This can be done with the Granska tagger and for Norwegian and Danish with other taggers. When one has tagged the text at clause and word level one can use rewrite or deletion rule to squeeze the text in even smaller units. Using tagging will also make the pronominal resolution much better and easier covering more cases. Other improvements of the summarization is to use named-entity tagging, Martin mentioned also to follow lexical chains in the summarization. One heuristic rule which was discussed was to keep whole paragraphs intact, that means to in certain cases summarize on paragraph level instead of sentence level.

Ola discussed deletion rules

Johnny discussed statistical tagging

Johan demonstrated the Siteseeker search engine with built extraction in the hit list and spell checking.

Norwegian language resources

Norwegian universities with language technology activities

* Bergen: UiB, HIT (Humanities Information Technologies)

* Trondheim: NTNU

* Oslo: UiO (Tekstlaboratoriet, Avd. for leksikografi)

Companies with language technology activities on the Norwegian market

* NST, SPNE, CLUE, COGNIT, Telenor, Lingsoft, ...

Lexical resources

* NTNU: NorKompLeks (Bokmål, partly Nynorsk)

* Oslo: Bokmålsordboken and derived lexicons

* Bergen: SCARRIE-lexicon (HIT Bergen): Bokmål

* Closed class lexicon (cover most stop words)

* Open class lexicon (noun, verb, adjective, adverb) consists of 360933 wordform entries organised in 72626 lemmas. No genitives are included due to high regularity.

Corpora

* Bergens Tidende (HIT Bergen): 10 M words collected by robot, not carefully checked, contains mixed Bokmål/Nynorsk

* Norwegian newspaper corpus (HIT Bergen): 2 M words, mostly Bokmål

* Tagged corpus (Tekslaboratoriet Oslo), newspaper and magazine part: 9.6 M words, Bokmål

Other tools

* rule-based lemmatizer and tagger (Oslo-Bergen); this tagger can also be used over the net: send a sentence "Hun kjøpte nye maskiner." and get the result "hun/pron kjøpe/verb ny/adj maskin/subst ./clb"

* grammar (LFG: ParGram project in Bergen)

* Wordnet (in the future; current work going on in Bergen)

* spelling and grammar checkers (SCARRIE, Bergen; grammar checker in Oslo)

Danish Language Resources

Danish institutions with language technology: see

http://www.cst.dk/dandokcenter/inst/index.html

Lexical resources:

- STO (incl. the SIMPLE lexicon)
Infos at:
http://www.cst.dk/sto/index.html
ppt presentation: A Lexical Database of Danish for Language Technology Applications
Examples at:
http://www.cst.dk/sto/leksikalskindgang/index.html
- The Scarrie lexicon
- Lemma/wordform list from "Restskrivningsorbogen"

Corpora

- The Parole Corpus (written ballanced corpus, 250.000 words, manually POS-tagged)
- Bergenholtz Corpus (4.5 mill words from books, newspapers, magazines, +??)
- Berlingske 99 ( 20 mill words, newspaper from 1999, tagged by the Brill tagger)
- Berlingske ( 20 mill words, from 1990-92)
- Korpus2000 (written, 25 mill words, tagged by CG-tagger, ready summer 2002)

Other tools

- tokenizer
- The Brill tagger (trained on the manually tagged Parole Corpus)
- Scarrie (spellig and grammar checker)
- NP-chunker (S. Abney's Cass)
- small LFG Grammar

Coming up (this year)

- Lemmatizer
- Named entity Recognizer

Notes from the discussion at the meeting

Questions related to current version of ScandSum/SweSum:

* use of verbs as keywords (in addition to nouns and adjectives)?

* use of frequency?

* use of syntactic information?

* should genitives be in lexicon or analyzed by the program?

* how should language depend

* One heuristic rule which was discussed was to keep whole paragraphs intact, that means to in certain cases summarize on paragraph level instead of sentence level.

Possible architecture for next version of ScandSum:

* do not use dictionary, but preprocessing for language dependent lemmatizing (or stemming) and tagging

+ format based on standard corpus tagging, e.g.:

+ format may include text structure labels as needed (e.g. clause) and markup.

* program will also consult frequencies, either a language-dependent external frequency list or an appendix to the preprocessing with a list of frequencies of the words in the text.

Proposal of taggset

http://www.nada.kth.se/~johnny/corpus/format.html

Planning of the next meeting

Persons who could contribute at future meetings

* Victoria Rosén (UiB, Universitet i Bergen) - grammar
* Kalervo Järvelin (University in Tampere)
* Dorte Haltrup (CST, at University in Copenhagen)
* Viggo Kann (KTH)
* Jon Atle Gulla (NTNU or Sintef ?)
* Janne Bondi Johannessen (UiO), Paul Meurer (UiB) - tagger and lemmatizer
* Tiit Roosmaa (University in Tartu)
* SICS people?

Possible dates

* 10 June 2002, Oslo (Nordic LT meeting 8-9 june 2002)
* [26 june 2002, Stockholm (NLDB 2002 27-28 june 2002)]
* 13-15 or 20-22 Sept 2002, Skagen, Denmark
* 25-28 Jan 2003, Geilo or Voss, Norway
* 5-8 April 2003, Åre

Tasks by next meeting

* Norwegian:
+ find out about using Norwegian tagger (Oslo-Bergen tagger) and grammar
+ provide list of function words
+ invite Victoria, Paul, and/or Janne to next meeting

* Danish: find out about using STO

* KTH
+ expression of interest to EU
+ specify new summarization engine architecture

Minutes of the First ScandSum meeting April 18-21, 2002 Hotel Diplomat, Åre, Sweden.

Participants of the meeting

Current version of the summarizer

Evaluation

Improvements

Norwegian language resources

Norwegian universities with language technology activities

Companies with language technology activities on the Norwegian market

Lexical resources

Corpora

Other tools

Danish Language Resources

Danish institutions with language technology: see

Lexical resources:

Corpora

Other tools

Coming up (this year)

Notes from the discussion at the meeting

Possible architecture for next version of ScandSum:

Planning of the next meeting

Possible dates

Tasks by next meeting

Tasks by Skagen meeting

Slides from meeting (in PDF)