Date: Tue Oct 13 05:33:33 1998
Author: Jacob Palme
To: Discussions about KOM
2000 (26 )
In-Reply-To: Re: Greeting
Language: English
Many years ago, I wrote a program to check if a message was written in Swedish or English. The same method can probably be used to test for more than just these two languages.
My algorithm was very simple. I checked the first 300 words in the message, and counted the number of occurences of a few of the most common English and Swedish words. I then computed
quotient:= swedishcount/(swedishcoult+englishcount); findlanguage:= IF swedishcount > 1 and englishcount = 0 then swedish ELSE IF swedishcount = 0 and englishcount > 1 then english ELSE IF swedishcount = 0 and englighscount = 0 then donotknow ELSE IF quotient > 0.7 then swedish ELSE IF quotient < 0.05 THEN english ELSE donotknow;
I tested my algorithm on a number of messages, and it worked perfectly. The only cases it got into problems with were texts which were not in either Swedish or English, and texts with a mixture of Swedish and English.
The lists of common Swedish and English words which I used were:
Typical Swedish words
"OCH ELLER DEN EN JAG DU HAN HON DE DEM " "FÖR KAN SOM TILL MEN ÄR HUR MED FRÅN TVÅ " "PÅ LÄSA ATT AV SPÅR FINNA SÖKA UPP NER BRA FEL SÄTT SÖK "
typical English words
"AND OR THE ONE YOU HE SHE THEY FOR CAN THAT THIS WHO " "BUT HOW WITH TWO OF AT READ ON WHILE REPEAT BEGIN END PROCEDURE " "TO CHAR ARE MAIN MOVE DEFINE INCLUDE INT SET CONTINUE CONNECT " "CONT CONN COMP COMPILE TRACK TAPE SAVE RUN HELP EXIT " "LET SET ERROR GOTO GOOD CANNOT FIND UP LAST LOGIN USED MEANS "
These lists may seem a little funny, but I wanted the algorithm to work also for texts in artificial languages, like programming language text based on English, and to recognize such texts as English.
By using this algorithm, KOM 2000 could easily check if a user has given the wrong language on his message, and either correct it automatically or ask the user if the language is really right.
Note1 : If this algorithm is to be used in KOM 2000, note that the special Swedish characters can occur in HTML in two formats, either
Å or Å Ä or Ä Ö or Ö å or å ä or ä ö or ö
Also note that the checking should be case-insensitive, it should recognize "login", "LOGIN" and "Login" as English words, and why not also "lOgIn"? At least that is what I did in my program.
Note 2: Write the program so that it can easily be extended with testing for more than two languages. When testing for N languages, one should omit in the list of words any word which occurs in more than one of the languages tested.
I enclose the full source code of my program as an attachment.
(Note: This program also detects if the text is stored with 8-bit or 7-bit bytes, that is why the program seems more complex than really needed.
00030 | BEGIN OPTIONS(/l); |
00040 | |
00050 | % (952) 86-01-08 11.18 /1 rad/ Lars Enderin QZ |
00060 | % Mottagare: Lars Enderin QZ <10> -- Mottaget: 86-01-08 11.18 |
00070 | % Mottagare: Jacob Palme QZ <74> -- Mottaget: 86-01-08 15.51 |
00080 | % Kommentar till: (Text 899) av Jacob Palme QZ <1> |
00100 | % [rende: filtyp i simula |
00110 | % |
00120 | % Prova LEQ:FILMOD.*, test i TFILMO.SIM. |
00130 | % (TEXT 952) |
00140 | % |
00150 | % (884) 86-01-07 18.16 /3 rader/ Jacob Palme QZ |
00160 | % Mottagare: Lars Enderin QZ <7> -- Mottaget: 86-01-07 18.18 |
00170 | % [rende: Filtyp i Simula |
00180 | % |
00190 | % Hur kan jag i ett SIMULA-program (a) Avl{sa om det i rib-en st}r |
00200 | % att filen {r 7-bits ascii eller bin{r (b) [ndra denna uppgift |
00210 | % i rib-en. |
00220 | % (Text 884) |
00230 | % (Kommentar i (Text 897) av Lars Enderin QZ) |
00240 | |
00250 | EXTERNAL INTEGER PROCEDURE inbyte, filesize; |
00260 | EXTERNAL TEXT PROCEDURE conc, lastmsg, frontstrip, upto, rest; |
00270 | EXTERNAL TEXT PROCEDURE tagord, front, storbokstav; |
00280 | EXTERNAL REF (infile) PROCEDURE findinfile; |
00290 | EXTERNAL REF (outfile) PROCEDURE findoutfile; |
00300 | EXTERNAL BOOLEAN PROCEDURE bitget, bokstav; |
00310 | EXTERNAL CHARACTER PROCEDURE upc, findtrigger, fetchar; |
00320 | EXTERNAL PROCEDURE forceout, enterdebug; |
00330 | |
00340 | CHARACTER ARRAY charclass[0:255]; |
00350 | |
00390 | TEXT ARRAY wordsarray[65:93,1:10]; |
00420 | BOOLEAN ARRAY englisharray[65:93,1:15]; |
00450 | INTEGER ARRAY wordscount[65:93]; |
00480 | |
00510 | BOOLEAN debug; |
00540 | |
00570 | CHARACTER tab; |
00600 | |
00608 | INTEGER lastfilesize; |
00615 | |
00630 | PROCEDURE store(isenglish,data); |
00660 | BOOLEAN isenglish; |
00690 | TEXT data; |
00720 | BEGIN |
00750 | TEXT word; INTEGER firstchar; |
00780 | WHILE data.more DO BEGIN |
00810 | word:- tagord(data); |
00840 | IF word =/= NOTEXT THEN BEGIN |
00870 | firstchar:= rank(fetchar(word,1)); |
00900 | wordscount[firstchar]:= wordscount[firstchar]+1; |
00930 | wordsarray[firstchar,wordscount[firstchar]]:- copy(word); |
00960 | englisharray[firstchar,wordscount[firstchar]]:= isenglish; |
00990 | END; |
01020 | END; |
01050 | END; |
01080 | |
01110 | PROCEDURE initialize; BEGIN |
01140 | INTEGER i; |
01170 | |
01200 | tab:= char(9); |
01230 | |
01260 | ! Store Swedish and English words; |
01290 | store(FALSE, |
01320 | "OCH ELLER DEN EN JAG DU HAN HON DE DEM " |
01350 | "F\R KAN SOM TILL MEN [R HUR MED FR]N TV] " |
01380 | "P] L[SA ATT AV SP]R FINNA S\KA UPP NER BRA FEL S[TT S\K "); |
01410 | store(TRUE, |
01440 | "AND OR THE ONE YOU HE SHE THEY FOR CAN THAT THIS WHO " |
01470 | "BUT HOW WITH TWO OF AT READ ON WHILE REPEAT BEGIN END PROCEDURE " |
01500 | "TO CHAR ARE MAIN MOVE DEFINE INCLUDE INT SET CONTINUE CONNECT " |
01530 | "CONT CONN COMP COMPILE TRACK TAPE SAVE RUN HELP EXIT " |
01560 | "LET SET ERROR GOTO GOOD CANNOT FIND UP LAST LOGIN USED MEANS "); |
01590 | ! Mark typical chars for text and binary files; |
01620 | FOR i:= 0 STEP 1 UNTIL 255 DO charclass[i]:= 'T'; |
01650 | FOR i:= 0 STEP 1 UNTIL 31, 127 DO BEGIN |
01680 | charclass[i]:= 'B'; |
01710 | END; |
01740 | FOR i:= 0,7,8,9,10,11,12,13, 128 STEP 1 UNTIL 159 DO BEGIN |
01770 | charclass[i]:= '?'; |
01800 | END; |
01830 | END of PROCEDURE initialize; |
01860 | OPTIONS(/-a); |
01890 | INTEGER PROCEDURE findbytesize(filename); |
01920 | TEXT filename; |
01950 | ! Returns 7 for 7-bit file, |
01980 | ! 8 for 8-bit file, |
02010 | ! 36 for other = binary file |
02040 | ! 0 for unknown type |
02070 | ! -1 for cannot find this file; |
02100 | BEGIN |
02130 | INTEGER ARRAY binarycount[7:8], textcount[7:8], unknowncount[7:8]; |
02160 | INTEGER bytesize, counter, gotchar; |
02190 | INTEGER ARRAY variation[0:31,7:8]; |
02220 | INTEGER ARRAY variacount[7:8]; |
02250 | CHARACTER cclass; |
02280 | REF (infile) thefile; |
02310 | REAL sevenq, eightq; |
02340 |
02355 | lastfilesize:= 0; |
02370 | FOR bytesize:= 7,8 DO BEGIN |
02400 | thefile:- findinfile(IF bytesize = 7 |
02430 | THEN filename |
02460 | ELSE conc(filename,"/bytesize:8")); |
02490 | IF thefile == NONE THEN GOTO notfound; |
02520 | thefile.open(NOTEXT); |
02535 | IF bytesize= 7 THEN lastfilesize:= filesize(thefile,1); |
02550 | counter:= 0; |
02580 | WHILE NOT thefile.endfile AND counter < 1000 DO BEGIN |
02610 | gotchar:= inbyte(thefile); |
02640 | IF NOT thefile.endfile THEN BEGIN |
02670 | cclass:= charclass[gotchar]; |
02700 | IF gotchar < 32 AND cclass = 'B' |
02730 | THEN variation[gotchar,bytesize] |
02760 | := variation[gotchar,bytesize]+1; |
02790 | IF cclass <> '?' THEN counter:= counter+1; |
02820 | IF cclass = 'T' |
02850 | THEN textcount[bytesize]:= textcount[bytesize]+1 |
02880 | ELSE IF cclass = 'B' |
02910 | THEN binarycount[bytesize]:= binarycount[bytesize]+1 |
02940 | ELSE unknowncount[bytesize]:= unknowncount[bytesize]+1; |
02970 | END; |
03000 | END of WHILE NOT thefile.endfile; |
03030 | |
03060 | FOR gotchar:= 0 STEP 1 UNTIL 31 |
03090 | DO IF variation[gotchar,bytesize]>0 |
03120 | THEN variacount[bytesize]:= variacount[bytesize]+1; |
03150 | thefile.close; |
03180 | |
03210 | END of FOR bytesize; |
03240 | |
03270 | IF debug THEN BEGIN |
03300 | outtext("Variation for 7/8 bit:"); |
03330 | outint(variacount[7],10); outint(variacount[8],10); |
03360 | outtext(" Sample size:"); |
03390 | outint(counter,10); outimage; |
03420 | END; |
03450 | IF counter < 10 THEN findbytesize:= 0 ELSE BEGIN |
03480 | IF textcount[7]+binarycount[7] = 0 |
03510 | OR textcount[8]+binarycount[8]=0 THEN findbytesize:= 0 |
03540 | ELSE BEGIN |
03570 | sevenq:= binarycount[7]/(textcount[7]+binarycount[7]); |
03600 | eightq:= binarycount[8]/(textcount[8]+binarycount[8]); |
03630 | IF debug THEN BEGIN |
03660 | outtext("Percent funny for 7/8 bit:"); |
03690 | outfix(sevenq,3,20); outfix(eightq,3,20); outimage; |
03720 | END; |
03750 | IF counter > 100 THEN BEGIN |
03780 | ! Safer prediction with large sample; |
03810 | IF sevenq < 0.01 AND eightq > 0.03 |
03840 | THEN findbytesize:= 7 |
03870 | ELSE IF eightq < 0.01 AND sevenq > 0.03 |
03900 | THEN findbytesize:= 8 |
03930 | ELSE IF sevenq > 0.03 AND eightq > 0.03 |
03960 | AND variacount[7] > 3 AND variacount[8] > 3 |
03990 | THEN findbytesize:= 36 |
04020 | ELSE findbytesize:= 0; |
04050 | END ELSE BEGIN |
04080 | ! Less safe prediction with small sample; |
04110 | IF sevenq = 0.00 AND eightq > 0.03 |
04140 | THEN findbytesize:= 7 |
04170 | ELSE IF eightq = 0.00 AND sevenq > 0.03 |
04200 | THEN findbytesize:= 8 |
04230 | ELSE IF sevenq > 0.08 AND eightq > 0.08 |
04260 | AND variacount[7] > 3 AND variacount[8] > 3 |
04290 | THEN findbytesize:= 36 |
04320 | ELSE findbytesize:= 0; |
04350 | END; |
04380 | END; |
04410 | END; |
04440 | |
04470 | IF FALSE THEN notfound: findbytesize:= -1; |
04500 | |
04530 | END of PROCEDURE findbytesize; |
04560 | CHARACTER PROCEDURE findlanguage(filename,bytesize); |
04590 | TEXT filename; INTEGER bytesize; |
04620 | ! Returns |
04650 | ! 'E' for English, |
04680 | ! 'S' for Swedish, |
04710 | ! 'O' for other or unknown language |
04725 | ! 'N' if the file could not be opened; |
04740 | BEGIN |
04770 | REF (infile) fil; |
04800 | TEXT wordbuf, word; |
04830 | INTEGER wordcount, englishc, swedishc, index; |
04860 | INTEGER gotbyte, firstchar, max; CHARACTER gotchar; |
04890 | REAL quotient; |
04920 | |
04950 | PROCEDURE testword; IF wordbuf.pos > 1 THEN BEGIN |
04980 | word:- storbokstav(front(wordbuf)); |
04995 | wordcount:= wordcount+1; |
05003 | IF word.length > 1 THEN BEGIN |
05010 | firstchar:= rank(fetchar(word,1)); |
05040 | max:= wordscount[firstchar]; |
05070 | FOR index:= 1 STEP 1 UNTIL max DO BEGIN |
05100 | IF word = wordsarray[firstchar,index] |
05130 | THEN BEGIN |
05160 | IF englisharray[firstchar,index] THEN englishc:= englishc+1 |
05190 | ELSE swedishc:= swedishc+1; |
05220 | GOTO return; |
05250 | END; |
05280 | END; |
05295 | END; |
05310 | return: |
05340 | wordbuf.setpos(1); |
05370 | END of PROCEDURE testword; |
05400 | |
05430 | IF bytesize = 8 THEN filename:- conc(filename,"/BYTESIZE:8"); |
05460 | fil:- findinfile(filename); |
05475 | IF fil == NONE THEN findlanguage:= 'N' ELSE BEGIN |
05490 | fil.open(NOTEXT); |
05520 | wordbuf:- blanks(15); |
05550 | WHILE NOT fil.endfile AND wordcount < 300 DO BEGIN |
05580 | gotbyte:= inbyte(fil); |
05610 | IF NOT fil.endfile THEN BEGIN |
05625 | IF gotbyte > 127 THEN gotbyte:= gotbyte-128; |
05670 | gotchar:= char(gotbyte); |
05700 | IF NOT bokstav(gotchar) THEN testword ELSE BEGIN |
05730 | IF wordbuf.more THEN wordbuf.putchar(gotchar); |
05760 | END; |
05820 | END of NOT fil.endfile; |
05850 | END of WHILE NOT fil.endfile; |
05880 | testword; |
05910 | fil.close; |
05940 | |
05970 | IF debug THEN BEGIN |
06000 | outtext("English words:"); outint(englishc,5); |
06030 | outtext(" Swedish words:"); outint(swedishc,5); |
06060 | outimage; |
06090 | END; |
06120 | IF swedishc = 0 AND englishc = 0 THEN findlanguage:= 'O' |
06150 | ELSE BEGIN |
06180 | quotient:= swedishc/(swedishc+englishc); |
06210 | IF debug THEN BEGIN |
06240 | outtext("Swedish coefficient:"); outfix(quotient,2,6); |
06270 | outimage; |
06300 | END; |
06330 | findlanguage:= IF swedishc > 1 AND englishc = 0 THEN 'S' |
06360 | ELSE IF swedishc = 0 AND englishc > 1 THEN 'E' |
06390 | ELSE IF swedishc = 0 AND englishc = 0 THEN 'O' |
06420 | ELSE IF quotient > 0.7 THEN 'S' |
06450 | ELSE IF quotient < 0.05 THEN 'E' |
06480 | ELSE 'O'; |
06510 | END; |
06525 | END; |
06540 | END of PROCEDURE findlanguage; |
06570 | PROCEDURE testsinglefiles; BEGIN |
06600 | BOOLEAN empty; |
06630 | WHILE NOT sysin.endfile AND NOT empty DO BEGIN |
06660 | INTEGER bytesize; CHARACTER language; TEXT filename; |
06690 | |
06720 | outtext("Give file name: "); breakoutimage; inimage; |
06750 | IF NOT sysin.endfile THEN BEGIN |
06780 | filename:- copy(sysin.image.strip); |
06810 | empty:= filename == NOTEXT; |
06840 | IF NOT empty THEN BEGIN |
06870 | bytesize:= findbytesize(filename); |
06885 | IF lastfilesize > 0 THEN BEGIN |
06893 | IF bytesize=7 THEN lastfilesize:= lastfilesize*5 |
06894 | ELSE IF bytesize=8 THEN lastfilesize:= lastfilesize*4; |
06895 | outint(lastfilesize,7); |
06896 | outtext(IF bytesize > 0 THEN " bytes. " |
06897 | ELSE " words. "); |
06899 | outimage; |
06900 | END; |
06901 | outtext(IF bytesize = -1 THEN "Cannot open the file" |
06930 | ELSE IF bytesize = 0 THEN "Unknown byte size" |
06960 | ELSE IF bytesize = 7 THEN "7-bit byte file" |
06990 | ELSE IF bytesize = 8 THEN "8-bit byte file" |
07020 | ELSE "Binary file"); outtext(". "); |
07050 | IF bytesize = -1 THEN BEGIN |
07080 | outtext(lastmsg); outchar('.'); |
07110 | END; |
07140 | outimage; |
07170 | IF bytesize = 7 OR bytesize = 8 THEN BEGIN |
07200 | language:= findlanguage(filename,bytesize); |
07230 | outtext("Language: "); |
07260 | outtext(IF language = 'E' THEN "English" |
07290 | ELSE IF language = 'S' THEN "Swedish" |
07320 | ELSE "Unknown"); |
07350 | outchar('.'); outimage; |
07380 | END of bytesize <= 8; |
07410 | END of NOT empty; |
07440 | END of NOT sysin.endfile; |
07470 | outimage; |
07500 | |
07530 | END of WHILE NOT sysin.endfile; |
07560 | |
07590 | END of PROCEDURE testsinglefiles; |
07620 | PROCEDURE testlist(inf,outf); |
07650 | REF (infile) inf; |
07680 | REF (outfile) outf; |
07710 | INSPECT inf DO BEGIN |
07740 | TEXT bef, ext, stripim, filename; |
07770 | CHARACTER trigger, language; |
07800 | INTEGER bytesize; |
07830 | |
07860 | inimage; |
07890 | WHILE NOT endfile DO BEGIN |
07920 | stripim:- upto(frontstrip(image),12); |
07950 | stripim.setpos(1); |
07980 | trigger:= findtrigger(stripim,". !9!"); |
08010 | IF debug THEN INSPECT outf DO BEGIN |
08024 | TEXT stripped; |
08033 | stripped:- image.strip; |
08036 | IF stripped.length > outf.image.length |
08039 | THEN outf.image:- copy(stripped); |
08042 | outtext(image.strip); outimage; |
08070 | END; |
08100 | INSPECT outf DO IF trigger <> char(0) THEN BEGIN |
08130 | IF trigger = '.' THEN BEGIN |
08160 | filename:- image.strip; |
08190 | outtext(filename); |
08220 | END ELSE BEGIN |
08250 | bef:- upto(stripim,stripim.pos-1); |
08280 | ext:- upto(frontstrip(rest(stripim)),4); |
08310 | IF fetchar(ext,1) = tab THEN ext:- NOTEXT; |
08340 | filename:- conc(bef,".",ext); |
08370 | outtext(bef); outchar(tab); outtext(ext); |
08400 | END; |
08430 | outchar(tab); |
08460 | |
08490 | bytesize:= findbytesize(filename); |
08496 | IF lastfilesize = 0 THEN outtext(" ") |
08499 | ELSE BEGIN |
08502 | IF bytesize=7 THEN lastfilesize:= lastfilesize*5 |
08505 | ELSE IF bytesize=8 THEN lastfilesize:= lastfilesize*4; |
08508 | outint(lastfilesize,7); |
08511 | outtext(IF bytesize > 0 THEN " bytes. " |
08514 | ELSE " words. "); |
08516 | END; |
08520 | outtext(IF bytesize = -1 THEN "Cannot open the file" |
08550 | ELSE IF bytesize = 0 THEN "Unknown byte size" |
08580 | ELSE IF bytesize = 7 THEN "7-bit byte file" |
08610 | ELSE IF bytesize = 8 THEN "8-bit byte file" |
08640 | ELSE "Binary file"); outtext(". "); |
08670 | IF bytesize = -1 THEN BEGIN |
08700 | outtext(lastmsg); outchar('.'); |
08730 | END; |
08760 | outchar(' '); |
08790 | |
08820 | IF bytesize = 7 OR bytesize = 8 THEN BEGIN |
08850 | language:= findlanguage(filename,bytesize); |
08880 | outtext("Language: "); |
08910 | outtext(IF language = 'E' THEN "English" |
08940 | ELSE IF language = 'S' THEN "Swedish" |
08970 | ELSE "Unknown"); |
09000 | outchar('.'); |
09030 | END of bytesize <= 8; |
09060 | outimage; |
09090 | END of findtrigger <> 0; |
09120 | IF debug THEN forceout(outf); |
09150 | inimage; |
09180 | END; |
09210 | END; |
09240 | PROCEDURE testlistoffiles; BEGIN |
09270 | TEXT infilnamn, outfilnamn; |
09300 | |
09330 | REF (infile) inf; |
09360 | REF (outfile) outf; |
09390 | |
09420 | outtext("Name of input file with list of file names: "); |
09450 | outimage; |
09480 | outtext("> "); breakoutimage; inimage; |
09510 | infilnamn:- copy(sysin.image.strip); |
09540 | |
09570 | outtext("Name of output file: "); |
09600 | outimage; |
09630 | outtext("> "); breakoutimage; inimage; |
09660 | outfilnamn:- copy(sysin.image.strip); |
09690 | |
09720 | inf:- findinfile(infilnamn); |
09750 | IF inf == NONE THEN BEGIN |
09780 | outtext("Cannot open input file. "); |
09810 | outtext(lastmsg); outchar('.'); |
09840 | END; |
09870 | |
09900 | outf:- findoutfile(outfilnamn); |
09930 | IF outf == NONE THEN BEGIN |
09960 | outtext("Cannot open output file. "); |
09990 | outtext(lastmsg); outchar('.'); |
10020 | END; |
10050 | |
10080 | inf.open(blanks(132)); |
10110 | outf.open(blanks(132)); |
10140 | |
10170 | testlist(inf,outf); |
10200 | inf.close; outf.close; |
10230 | |
10260 | END; |
10290 | ! Main program execution starts here; |
10320 | |
10350 | outtext("FILTYP - Finds bytesize and language of file."); outimage; |
10380 | outimage; |
10410 | |
10440 | initialize; |
10470 | |
10500 | WHILE NOT sysin.endfile DO BEGIN |
10530 | CHARACTER command; |
10560 | |
10590 | outtext("What do you want to do? (Test) single files, "); |
10620 | outtext("(Test) list of files,"); outimage; |
10650 | outtext("(Set) debug mode?"); outimage; |
10680 | outtext("FILTYP>"); breakoutimage; inimage; |
10710 | command:= upc(inchar); |
10740 | IF command = 'S' THEN testsinglefiles |
10770 | ELSE IF command = 'L' THEN testlistoffiles |
10784 | ELSE IF command = 'D' THEN BEGIN |
10801 | debug:= TRUE; |
10805 | outtext("Type ""PROCEED"" to continue execution in debug mode."); |
10807 | outimage; |
10809 | enterdebug(TRUE); |
10816 | END; |
10830 | END; |
10860 | |
10890 | END of the whole program; |