Automatically detecting the language ofa message

Date: Tue Oct 13 05:33:33 1998
Author:
Jacob Palme
To:
Discussions about KOM 2000 (26 )
In-Reply-To:
Re: Greeting
Language: English

Automatically detecting the language of a message

filtyp.sim

Many years ago, I wrote a program to check if a message was written in Swedish or English. The same method can probably be used to test for more than just these two languages.

My algorithm was very simple. I checked the first 300 words in the message, and counted the number of occurences of a few of the most common English and Swedish words. I then computed

quotient:= swedishcount/(swedishcoult+englishcount);
findlanguage:=
IF swedishcount > 1 and englishcount = 0 then swedish
ELSE IF swedishcount = 0 and englishcount > 1 then english
ELSE IF swedishcount = 0 and englighscount = 0 then donotknow
ELSE IF quotient > 0.7 then swedish
ELSE IF quotient < 0.05 THEN english
ELSE donotknow;
 

I tested my algorithm on a number of messages, and it worked perfectly. The only cases it got into problems with were texts which were not in either Swedish or English, and texts with a mixture of Swedish and English.

The lists of common Swedish and English words which I used were:

Typical Swedish words

"OCH ELLER DEN EN JAG DU HAN HON DE DEM "
"FÖR KAN SOM TILL MEN ÄR HUR MED FRÅN TVÅ "
"PÅ LÄSA ATT AV SPÅR FINNA SÖKA UPP NER BRA FEL SÄTT SÖK "
 

typical English words

"AND OR THE ONE YOU HE SHE THEY FOR CAN THAT THIS WHO "
"BUT HOW WITH TWO OF AT READ ON WHILE REPEAT BEGIN END PROCEDURE "
"TO CHAR ARE MAIN MOVE DEFINE INCLUDE INT SET CONTINUE CONNECT "
"CONT CONN COMP COMPILE TRACK TAPE SAVE RUN HELP EXIT "
"LET SET ERROR GOTO GOOD CANNOT FIND UP LAST LOGIN USED MEANS "
 

These lists may seem a little funny, but I wanted the algorithm to work also for texts in artificial languages, like programming language text based on English, and to recognize such texts as English.

By using this algorithm, KOM 2000 could easily check if a user has given the wrong language on his message, and either correct it automatically or ask the user if the language is really right.

Note1 : If this algorithm is to be used in KOM 2000, note that the special Swedish characters can occur in HTML in two formats, either

Å or &Aring;
Ä or &Auml;
Ö or &Ouml;
å or &aring;
ä or &auml;
ö or &ouml;
 

Also note that the checking should be case-insensitive, it should recognize "login", "LOGIN" and "Login" as English words, and why not also "lOgIn"? At least that is what I did in my program.

Note 2: Write the program so that it can easily be extended with testing for more than two languages. When testing for N languages, one should omit in the list of words any word which occurs in more than one of the languages tested.

I enclose the full source code of my program as an attachment.



Last modified: Tue Oct 13 05:33:33 1998

Attachment

(Note: This program also detects if the text is stored with 8-bit or 7-bit bytes, that is why the program seems more complex than really needed.

00030 BEGIN OPTIONS(/l);
00040
00050 % (952) 86-01-08 11.18 /1 rad/ Lars Enderin QZ
00060 % Mottagare: Lars Enderin QZ <10> -- Mottaget: 86-01-08 11.18
00070 % Mottagare: Jacob Palme QZ <74> -- Mottaget: 86-01-08 15.51
00080 % Kommentar till: (Text 899) av Jacob Palme QZ <1>
00100 % [rende: filtyp i simula
00110 %
00120 % Prova LEQ:FILMOD.*, test i TFILMO.SIM.
00130 % (TEXT 952)
00140 %
00150 % (884) 86-01-07 18.16 /3 rader/ Jacob Palme QZ
00160 % Mottagare: Lars Enderin QZ <7> -- Mottaget: 86-01-07 18.18
00170 % [rende: Filtyp i Simula
00180 %
00190 % Hur kan jag i ett SIMULA-program (a) Avl{sa om det i rib-en st}r
00200 % att filen {r 7-bits ascii eller bin{r (b) [ndra denna uppgift
00210 % i rib-en.
00220 % (Text 884)
00230 % (Kommentar i (Text 897) av Lars Enderin QZ)
00240
00250 EXTERNAL INTEGER PROCEDURE inbyte, filesize;
00260 EXTERNAL TEXT PROCEDURE conc, lastmsg, frontstrip, upto, rest;
00270 EXTERNAL TEXT PROCEDURE tagord, front, storbokstav;
00280 EXTERNAL REF (infile) PROCEDURE findinfile;
00290 EXTERNAL REF (outfile) PROCEDURE findoutfile;
00300 EXTERNAL BOOLEAN PROCEDURE bitget, bokstav;
00310 EXTERNAL CHARACTER PROCEDURE upc, findtrigger, fetchar;
00320 EXTERNAL PROCEDURE forceout, enterdebug;
00330
00340 CHARACTER ARRAY charclass[0:255];
00350
00390 TEXT ARRAY wordsarray[65:93,1:10];
00420 BOOLEAN ARRAY englisharray[65:93,1:15];
00450 INTEGER ARRAY wordscount[65:93];
00480
00510 BOOLEAN debug;
00540
00570 CHARACTER tab;
00600
00608 INTEGER lastfilesize;
00615
00630 PROCEDURE store(isenglish,data);
00660 BOOLEAN isenglish;
00690 TEXT data;
00720 BEGIN
00750 TEXT word; INTEGER firstchar;
00780 WHILE data.more DO BEGIN
00810 word:- tagord(data);
00840 IF word =/= NOTEXT THEN BEGIN
00870 firstchar:= rank(fetchar(word,1));
00900 wordscount[firstchar]:= wordscount[firstchar]+1;
00930 wordsarray[firstchar,wordscount[firstchar]]:- copy(word);
00960 englisharray[firstchar,wordscount[firstchar]]:= isenglish;
00990 END;
01020 END;
01050 END;
01080
01110 PROCEDURE initialize; BEGIN
01140 INTEGER i;
01170
01200 tab:= char(9);
01230
01260 ! Store Swedish and English words;
01290 store(FALSE,
01320 "OCH ELLER DEN EN JAG DU HAN HON DE DEM "
01350 "F\R KAN SOM TILL MEN [R HUR MED FR]N TV] "
01380 "P] L[SA ATT AV SP]R FINNA S\KA UPP NER BRA FEL S[TT S\K ");
01410 store(TRUE,
01440 "AND OR THE ONE YOU HE SHE THEY FOR CAN THAT THIS WHO "
01470 "BUT HOW WITH TWO OF AT READ ON WHILE REPEAT BEGIN END PROCEDURE "
01500 "TO CHAR ARE MAIN MOVE DEFINE INCLUDE INT SET CONTINUE CONNECT "
01530 "CONT CONN COMP COMPILE TRACK TAPE SAVE RUN HELP EXIT "
01560 "LET SET ERROR GOTO GOOD CANNOT FIND UP LAST LOGIN USED MEANS ");
01590 ! Mark typical chars for text and binary files;
01620 FOR i:= 0 STEP 1 UNTIL 255 DO charclass[i]:= 'T';
01650 FOR i:= 0 STEP 1 UNTIL 31, 127 DO BEGIN
01680 charclass[i]:= 'B';
01710 END;
01740 FOR i:= 0,7,8,9,10,11,12,13, 128 STEP 1 UNTIL 159 DO BEGIN
01770 charclass[i]:= '?';
01800 END;
01830 END of PROCEDURE initialize;
01860 OPTIONS(/-a);
01890 INTEGER PROCEDURE findbytesize(filename);
01920 TEXT filename;
01950 ! Returns 7 for 7-bit file,
01980 ! 8 for 8-bit file,
02010 ! 36 for other = binary file
02040 ! 0 for unknown type
02070 ! -1 for cannot find this file;
02100 BEGIN
02130 INTEGER ARRAY binarycount[7:8], textcount[7:8], unknowncount[7:8];
02160 INTEGER bytesize, counter, gotchar;
02190 INTEGER ARRAY variation[0:31,7:8];
02220 INTEGER ARRAY variacount[7:8];
02250 CHARACTER cclass;
02280 REF (infile) thefile;
02310 REAL sevenq, eightq;
02340
02355 lastfilesize:= 0;
02370 FOR bytesize:= 7,8 DO BEGIN
02400 thefile:- findinfile(IF bytesize = 7
02430 THEN filename
02460 ELSE conc(filename,"/bytesize:8"));
02490 IF thefile == NONE THEN GOTO notfound;
02520 thefile.open(NOTEXT);
02535 IF bytesize= 7 THEN lastfilesize:= filesize(thefile,1);
02550 counter:= 0;
02580 WHILE NOT thefile.endfile AND counter < 1000 DO BEGIN
02610 gotchar:= inbyte(thefile);
02640 IF NOT thefile.endfile THEN BEGIN
02670 cclass:= charclass[gotchar];
02700 IF gotchar < 32 AND cclass = 'B'
02730 THEN variation[gotchar,bytesize]
02760 := variation[gotchar,bytesize]+1;
02790 IF cclass <> '?' THEN counter:= counter+1;
02820 IF cclass = 'T'
02850 THEN textcount[bytesize]:= textcount[bytesize]+1
02880 ELSE IF cclass = 'B'
02910 THEN binarycount[bytesize]:= binarycount[bytesize]+1
02940 ELSE unknowncount[bytesize]:= unknowncount[bytesize]+1;
02970 END;
03000 END of WHILE NOT thefile.endfile;
03030
03060 FOR gotchar:= 0 STEP 1 UNTIL 31
03090 DO IF variation[gotchar,bytesize]>0
03120 THEN variacount[bytesize]:= variacount[bytesize]+1;
03150 thefile.close;
03180
03210 END of FOR bytesize;
03240
03270 IF debug THEN BEGIN
03300 outtext("Variation for 7/8 bit:");
03330 outint(variacount[7],10); outint(variacount[8],10);
03360 outtext(" Sample size:");
03390 outint(counter,10); outimage;
03420 END;
03450 IF counter < 10 THEN findbytesize:= 0 ELSE BEGIN
03480 IF textcount[7]+binarycount[7] = 0
03510 OR textcount[8]+binarycount[8]=0 THEN findbytesize:= 0
03540 ELSE BEGIN
03570 sevenq:= binarycount[7]/(textcount[7]+binarycount[7]);
03600 eightq:= binarycount[8]/(textcount[8]+binarycount[8]);
03630 IF debug THEN BEGIN
03660 outtext("Percent funny for 7/8 bit:");
03690 outfix(sevenq,3,20); outfix(eightq,3,20); outimage;
03720 END;
03750 IF counter > 100 THEN BEGIN
03780 ! Safer prediction with large sample;
03810 IF sevenq < 0.01 AND eightq > 0.03
03840 THEN findbytesize:= 7
03870 ELSE IF eightq < 0.01 AND sevenq > 0.03
03900 THEN findbytesize:= 8
03930 ELSE IF sevenq > 0.03 AND eightq > 0.03
03960 AND variacount[7] > 3 AND variacount[8] > 3
03990 THEN findbytesize:= 36
04020 ELSE findbytesize:= 0;
04050 END ELSE BEGIN
04080 ! Less safe prediction with small sample;
04110 IF sevenq = 0.00 AND eightq > 0.03
04140 THEN findbytesize:= 7
04170 ELSE IF eightq = 0.00 AND sevenq > 0.03
04200 THEN findbytesize:= 8
04230 ELSE IF sevenq > 0.08 AND eightq > 0.08
04260 AND variacount[7] > 3 AND variacount[8] > 3
04290 THEN findbytesize:= 36
04320 ELSE findbytesize:= 0;
04350 END;
04380 END;
04410 END;
04440
04470 IF FALSE THEN notfound: findbytesize:= -1;
04500
04530 END of PROCEDURE findbytesize;
04560 CHARACTER PROCEDURE findlanguage(filename,bytesize);
04590 TEXT filename; INTEGER bytesize;
04620 ! Returns
04650 ! 'E' for English,
04680 ! 'S' for Swedish,
04710 ! 'O' for other or unknown language
04725 ! 'N' if the file could not be opened;
04740 BEGIN
04770 REF (infile) fil;
04800 TEXT wordbuf, word;
04830 INTEGER wordcount, englishc, swedishc, index;
04860 INTEGER gotbyte, firstchar, max; CHARACTER gotchar;
04890 REAL quotient;
04920
04950 PROCEDURE testword; IF wordbuf.pos > 1 THEN BEGIN
04980 word:- storbokstav(front(wordbuf));
04995 wordcount:= wordcount+1;
05003 IF word.length > 1 THEN BEGIN
05010 firstchar:= rank(fetchar(word,1));
05040 max:= wordscount[firstchar];
05070 FOR index:= 1 STEP 1 UNTIL max DO BEGIN
05100 IF word = wordsarray[firstchar,index]
05130 THEN BEGIN
05160 IF englisharray[firstchar,index] THEN englishc:= englishc+1
05190 ELSE swedishc:= swedishc+1;
05220 GOTO return;
05250 END;
05280 END;
05295 END;
05310 return:
05340 wordbuf.setpos(1);
05370 END of PROCEDURE testword;
05400
05430 IF bytesize = 8 THEN filename:- conc(filename,"/BYTESIZE:8");
05460 fil:- findinfile(filename);
05475 IF fil == NONE THEN findlanguage:= 'N' ELSE BEGIN
05490 fil.open(NOTEXT);
05520 wordbuf:- blanks(15);
05550 WHILE NOT fil.endfile AND wordcount < 300 DO BEGIN
05580 gotbyte:= inbyte(fil);
05610 IF NOT fil.endfile THEN BEGIN
05625 IF gotbyte > 127 THEN gotbyte:= gotbyte-128;
05670 gotchar:= char(gotbyte);
05700 IF NOT bokstav(gotchar) THEN testword ELSE BEGIN
05730 IF wordbuf.more THEN wordbuf.putchar(gotchar);
05760 END;
05820 END of NOT fil.endfile;
05850 END of WHILE NOT fil.endfile;
05880 testword;
05910 fil.close;
05940
05970 IF debug THEN BEGIN
06000 outtext("English words:"); outint(englishc,5);
06030 outtext(" Swedish words:"); outint(swedishc,5);
06060 outimage;
06090 END;
06120 IF swedishc = 0 AND englishc = 0 THEN findlanguage:= 'O'
06150 ELSE BEGIN
06180 quotient:= swedishc/(swedishc+englishc);
06210 IF debug THEN BEGIN
06240 outtext("Swedish coefficient:"); outfix(quotient,2,6);
06270 outimage;
06300 END;
06330 findlanguage:= IF swedishc > 1 AND englishc = 0 THEN 'S'
06360 ELSE IF swedishc = 0 AND englishc > 1 THEN 'E'
06390 ELSE IF swedishc = 0 AND englishc = 0 THEN 'O'
06420 ELSE IF quotient > 0.7 THEN 'S'
06450 ELSE IF quotient < 0.05 THEN 'E'
06480 ELSE 'O';
06510 END;
06525 END;
06540 END of PROCEDURE findlanguage;
06570 PROCEDURE testsinglefiles; BEGIN
06600 BOOLEAN empty;
06630 WHILE NOT sysin.endfile AND NOT empty DO BEGIN
06660 INTEGER bytesize; CHARACTER language; TEXT filename;
06690
06720 outtext("Give file name: "); breakoutimage; inimage;
06750 IF NOT sysin.endfile THEN BEGIN
06780 filename:- copy(sysin.image.strip);
06810 empty:= filename == NOTEXT;
06840 IF NOT empty THEN BEGIN
06870 bytesize:= findbytesize(filename);
06885 IF lastfilesize > 0 THEN BEGIN
06893 IF bytesize=7 THEN lastfilesize:= lastfilesize*5
06894 ELSE IF bytesize=8 THEN lastfilesize:= lastfilesize*4;
06895 outint(lastfilesize,7);
06896 outtext(IF bytesize > 0 THEN " bytes. "
06897 ELSE " words. ");
06899 outimage;
06900 END;
06901 outtext(IF bytesize = -1 THEN "Cannot open the file"
06930 ELSE IF bytesize = 0 THEN "Unknown byte size"
06960 ELSE IF bytesize = 7 THEN "7-bit byte file"
06990 ELSE IF bytesize = 8 THEN "8-bit byte file"
07020 ELSE "Binary file"); outtext(". ");
07050 IF bytesize = -1 THEN BEGIN
07080 outtext(lastmsg); outchar('.');
07110 END;
07140 outimage;
07170 IF bytesize = 7 OR bytesize = 8 THEN BEGIN
07200 language:= findlanguage(filename,bytesize);
07230 outtext("Language: ");
07260 outtext(IF language = 'E' THEN "English"
07290 ELSE IF language = 'S' THEN "Swedish"
07320 ELSE "Unknown");
07350 outchar('.'); outimage;
07380 END of bytesize <= 8;
07410 END of NOT empty;
07440 END of NOT sysin.endfile;
07470 outimage;
07500
07530 END of WHILE NOT sysin.endfile;
07560
07590 END of PROCEDURE testsinglefiles;
07620 PROCEDURE testlist(inf,outf);
07650 REF (infile) inf;
07680 REF (outfile) outf;
07710 INSPECT inf DO BEGIN
07740 TEXT bef, ext, stripim, filename;
07770 CHARACTER trigger, language;
07800 INTEGER bytesize;
07830
07860 inimage;
07890 WHILE NOT endfile DO BEGIN
07920 stripim:- upto(frontstrip(image),12);
07950 stripim.setpos(1);
07980 trigger:= findtrigger(stripim,". !9!");
08010 IF debug THEN INSPECT outf DO BEGIN
08024 TEXT stripped;
08033 stripped:- image.strip;
08036 IF stripped.length > outf.image.length
08039 THEN outf.image:- copy(stripped);
08042 outtext(image.strip); outimage;
08070 END;
08100 INSPECT outf DO IF trigger <> char(0) THEN BEGIN
08130 IF trigger = '.' THEN BEGIN
08160 filename:- image.strip;
08190 outtext(filename);
08220 END ELSE BEGIN
08250 bef:- upto(stripim,stripim.pos-1);
08280 ext:- upto(frontstrip(rest(stripim)),4);
08310 IF fetchar(ext,1) = tab THEN ext:- NOTEXT;
08340 filename:- conc(bef,".",ext);
08370 outtext(bef); outchar(tab); outtext(ext);
08400 END;
08430 outchar(tab);
08460
08490 bytesize:= findbytesize(filename);
08496 IF lastfilesize = 0 THEN outtext(" ")
08499 ELSE BEGIN
08502 IF bytesize=7 THEN lastfilesize:= lastfilesize*5
08505 ELSE IF bytesize=8 THEN lastfilesize:= lastfilesize*4;
08508 outint(lastfilesize,7);
08511 outtext(IF bytesize > 0 THEN " bytes. "
08514 ELSE " words. ");
08516 END;
08520 outtext(IF bytesize = -1 THEN "Cannot open the file"
08550 ELSE IF bytesize = 0 THEN "Unknown byte size"
08580 ELSE IF bytesize = 7 THEN "7-bit byte file"
08610 ELSE IF bytesize = 8 THEN "8-bit byte file"
08640 ELSE "Binary file"); outtext(". ");
08670 IF bytesize = -1 THEN BEGIN
08700 outtext(lastmsg); outchar('.');
08730 END;
08760 outchar(' ');
08790
08820 IF bytesize = 7 OR bytesize = 8 THEN BEGIN
08850 language:= findlanguage(filename,bytesize);
08880 outtext("Language: ");
08910 outtext(IF language = 'E' THEN "English"
08940 ELSE IF language = 'S' THEN "Swedish"
08970 ELSE "Unknown");
09000 outchar('.');
09030 END of bytesize <= 8;
09060 outimage;
09090 END of findtrigger <> 0;
09120 IF debug THEN forceout(outf);
09150 inimage;
09180 END;
09210 END;
09240 PROCEDURE testlistoffiles; BEGIN
09270 TEXT infilnamn, outfilnamn;
09300
09330 REF (infile) inf;
09360 REF (outfile) outf;
09390
09420 outtext("Name of input file with list of file names: ");
09450 outimage;
09480 outtext("> "); breakoutimage; inimage;
09510 infilnamn:- copy(sysin.image.strip);
09540
09570 outtext("Name of output file: ");
09600 outimage;
09630 outtext("> "); breakoutimage; inimage;
09660 outfilnamn:- copy(sysin.image.strip);
09690
09720 inf:- findinfile(infilnamn);
09750 IF inf == NONE THEN BEGIN
09780 outtext("Cannot open input file. ");
09810 outtext(lastmsg); outchar('.');
09840 END;
09870
09900 outf:- findoutfile(outfilnamn);
09930 IF outf == NONE THEN BEGIN
09960 outtext("Cannot open output file. ");
09990 outtext(lastmsg); outchar('.');
10020 END;
10050
10080 inf.open(blanks(132));
10110 outf.open(blanks(132));
10140
10170 testlist(inf,outf);
10200 inf.close; outf.close;
10230
10260 END;
10290 ! Main program execution starts here;
10320
10350 outtext("FILTYP - Finds bytesize and language of file."); outimage;
10380 outimage;
10410
10440 initialize;
10470
10500 WHILE NOT sysin.endfile DO BEGIN
10530 CHARACTER command;
10560
10590 outtext("What do you want to do? (Test) single files, ");
10620 outtext("(Test) list of files,"); outimage;
10650 outtext("(Set) debug mode?"); outimage;
10680 outtext("FILTYP>"); breakoutimage; inimage;
10710 command:= upc(inchar);
10740 IF command = 'S' THEN testsinglefiles
10770 ELSE IF command = 'L' THEN testlistoffiles
10784 ELSE IF command = 'D' THEN BEGIN
10801 debug:= TRUE;
10805 outtext("Type ""PROCEED"" to continue execution in debug mode.");
10807 outimage;
10809 enterdebug(TRUE);
10816 END;
10830 END;
10860
10890 END of the whole program;