Minutes of the Third ScandSum network meeting Sept 13-15, 2002, Color Hotel Skagen, Denmark

Participants of the meeting

Hercules Dalianis NADA-KTH
Martin Hassel NADA-KTH
Jürgen Wedekind CST-Copenhagen
Dorte Haltrup CST-Copenhagen
Bart Jongejan CST-Copenhagen
Kristin Bjarnadottir Inst. of Lexicography, University of Iceland
Till Christopher Lech, Cognit, Norway

Current version of the Swedish summarizer

Hercules Dalianis described and demonstrated the current version and the how to evaluate the SweSum summarizer see also First ScandSum meeting Åre

Danish resources tagger and lemmatizer

Dorte Haltrup described the various Danish resources, such as the tokenizer, tagger, lemmatizer and the named entity recognizer. The tokenizer is written in Perl. The tagger is the Brill tagger and the lemmatizer is written in C++. Both the tagger and the lemmatizer is language independent and can be trained for any language. The Named Entity recognizer is currently under development.

Bart Jongejan continued by describing the lemmatizer in dept. The Danish lemmatizer uses both statistical methods and the STO dictionary that has about 0.5 mil words . By training the system on a dictionary the lemmatizer have found 44205 lemmatization rules. Before the lemmatizer is executed it is necessary to run the taggger.

Icelandic resources

Kristin Bjarnadottir described the Icelandic resources. There are one tagged corpus the Icelandic frequency dictionary corpus (IFP) containing 500 000 words and one untagged corpus Institute of lexicography text corpora containing 25 mil. words and then there is also an Icelandic synonym lexicon, Icelandic Resources (HTML)

Norwegian company Cognit AS document analysis and summarization

Till Lech described Cognit's resources. The Corporum product contains taggers which can extract microontolologies, i.e. create relations between concepts. As a spin-off of this is the Corporum Summarizer creates summaries (extracts) of the text. The summarizer seems fast and efficient. The languages treated are Swedish, Norwegian (Bokmål), German and English. Corporum is used by a large number of customers Norsk Hydro, Statoil, etc. Corporum looks in the target group of customers which might use Autonomy.

Finlandic resources

Ari Pirkola from University of Tampere sent me a list of Finnish resources

From SweSum to Scandsum

Martin Hassel described different possible improvements on SweSum and also how to adapt it to other languages see also First ScandSum meeting Åre

Nordoknet and SiteSeeker

Jürgen described the architecture of the Nordic documentation center and how to make information accessible from all webbsites by using SiteSeeker search engine with built-in human language technology. Hercules demonstrated the SiteSeeker applied on all of Nordoknet webbsites.

One thing which Jürgen has to convey to his documentation center partners is that when using Java scripts in webbsite one also need to use clear text links to all web pages otherwise no search engine ever, including SiteSeeker, will manage to index some of the pages. (see here for instructions).

New applications for funding

There is only possible to apply for new Norfa network, we might apply for a working money network?

EuroSum application end of the year, EU-meeting in Luxembourg 23-24 October, we should send some representatives from the Eurosum network, but there is also a general EU-IST meeting in Copenhagen 4-6 Nov 2002.

Check with Swedish Vinnova språkteknologi program and with possible upcoming calls from Norges forskningsråd KUNSTI for international projects.

Writing in Årbog for Nordisk Sprogteknologisk forskningsprogram 2003

We should write max 10 pages in English about SweSum, text summarization techniques, application areas, ScandSum, EuroSum.
Hercules writes introduction and coordinates.
Till describes mobile computing and context awareness - Ambiesense.
Martin about the new architectures .
Dorte and Jürgen about the Danish approach DanSum and
Koenraad about NorSum.
Internal deadline for this Sept 30, 2002.

Possible meeting dates

* 25-28 Jan 2003, Fefor, Norway
* 3-6 April 2003, Åre, Sweden (changed from 5-8 April 2003)
* 30 May-1 June 2003, Reykjavik, Iceland in conjunction with NODALIDA 2003.

Tasks by next meeting Voss

Describe the experiences with the work of the Danish summarizer
Describe the new architecture of the Swedish / Danish summarizer
Continue work with Norwegian summarizer

Slides from meeting (in PDF)

Dorte Halstrup OH-slides in PDF Danish CST Resources
Bart Jongejan OH-slides in PDF Danish Lemmatizer
Till Leech OH-slides in PDF Corporum tools

Master thesis jobs defined (Stemmer and Spellchecker)

Latest change Feb 25, 2003