ICT/KTH 05-sep-2010/FK id1006 Java Programming Assignment 1 - Parser and SimpleMetric It is recommended that you submit this work no later than Tuesday, 12 October 2010. Solution examples will be presented on 13 October. This assignment is part of the individual student examination on the course id1006 Java Programming. The assignment is to be done by an individual student. When the assignment has been approved, it corresponds to 1/5th credit of the 7.5 credits (hp) given for the completed course. HOW TO SUBMIT THE COMPLETED ASSIGNMENT The completed work is delivered electronically in the form of an email addressed to java.assignments@fc.dsv.su.se In an emergency, send the email to fki@kth.se. Send one email per assignment. Since there are three programming tasks and one essay in this course, a student is expected to submit a total of four emails. The subject of the email must contain the text 'id1006 Assignment xxx' or 'id1006 Essay' etc. The body of the email must contain the submitting student's name and optional civic registration number (Sv. 'personnummer'). The sender id on the email is NOT sufficient identification. The body of the email must also contain all additional information needed to identify the submitted work and the context in which it is being submitted. For example, a re-submission. Submitted files (e.g. program sources) should be adjoined to the email as one or more attachments. SOURCE CODE is to be submitted as PLAIN TEXT, ie files that can be compiled by the Java standard development kit (javac). All files necessary to build the program or programs must be submitted together. If you send an archive, let it be ZIP. Typeset documents (e.g. Ms Word, Open Office, LaTex etc) MUST be submitted in PDF, the Portable Document Format by Adobe. This is currently the optimal way to guarantee cross-platform readability of electronic documents. Submitted work is expected to be carefully prepared, annotated, commented and above all original. Where it is not, quotes, citations, and references are to be CLEARLY indicated. Images, graphics and other multimedia products can only be incorporated into the submitted work with the permission of the copyright holder, and the permission must be expressed in the submitted work. Submitted work will be tested for originality. --- The three programming assignments for the fall 2010 instance of the course are all related. Together, they create a simple and extensible application for estimating the readability of english text. A readability index (of which there are several) is language dependent, statistical and computable. They are usually constructed by computing a ratio between the average number of words per sentence, and the proportion of complicated words to simple words. As a result, they usually return a single figure, like 30, or 14.2. There is for example the LIX (Sv. "läsbarhetsindex", Eng. "readability index") which is computed thus: O L * 100 lix = - + ------- P O where O = the number of words L = the number of long words (longer than six characters) P = the number of sentences in the text. LIX is primarily for swedish texts, but can of course be computed on english too. As can be seen, it is the sum of two values: the average number of words per sentence, and the percentage of long words in the text. This means that long sentences and long words will give a higher value, indicating a text that is harder to read. Another famous readability metric for english is the Flesch index: O Y Flesch = 206.876 - (1.015 * -) - (84.6 * -) P O where O and P are as for Lix, while Y = the number of syllables in the text. This is a weighted sum of (again) the average number of words per sentence, and the average number of syllables per word. Since more syllables requires more letters in a word, it is quite similar to the Lix metric. The major difference, however, is that Flesch is continuous over word length, while Lix uses a threshold for long words at six characters. In order to complete the three assignments, you are given certain Java classes in source code form. You will also create other Java classes to specification, like a parser, readability meters, a syllable counter and main programs. Whenever possible, you should strive to reuse your classes and code. For example, the parser written for assignment 1 is to be reused without copying or modification in assignments 2 and 3. For the readability meters and the main programs, it is better to copy the source code to a new class name and modify it. With the first assignment you receive five source files. These are to be used in all assignments (except for ParserTest which is the main program for the first assignment): Token.java Superclass for WorkToken and PunctToken. WordToken.java Subclass of Token, represents a word. PunctToken.java Subclass of Token, represents punctuation. TextMeter.java The interface definition implemented by metric modules. ParserTest.java A example main program. Assignment 1 - Parser.java and SimpleMeter.java The first assignment is to write a parser for english text (like the one you are reading now). The parser only needs to recognise two kinds of structure: words and punctuation. The readability of the text is measured by a plug-in given to the parser when it is created. This plug-in is a class that implements the interface TextMeter (given as source-code). The parser informs the TextMeter what tokens it has parsed from the text, and the TextMeter measures and remembers the statistics it needs. The communication between the parser and the TextMeter works like this: for each word and punctuation found, an instance of class WordToken or PunctToken is created, and passed on to the TextMeter. The TextMeter inspects each token and extracts the information it needs from it. For the parser, you should name the source file Parser.java and the skeleton for the class is: public class Parser { public Parser (TextMeter tm) { // constructor with argument ... } public TextMeter getMeter () { ... } public void parse (String s) { ... } } The constructor takes an argument, which is a reference to a TextMeter implementation. All the parser needs to know about it, is that it implements the TextMeter interface. The parser should use a local variable to remember the reference, so that it can call it later when method parse(String) is called. The method TextMeter getMeter() should return the reference that was given in the constructor. This is because the main program that calls the parser should be able to get the reference back without having to remember it itself. The method void parse (String) is called with a string to be parsed. The parser should scan the string, generate instances of WordToken or PunctToken, and send these to the TextMeter, using TextMeter.add (tkn). For example, if the local reference to the TextMeter is a variable called 'meter': WordToken wtk; PunctToken ptk; ... meter.add (wtk); ... meter.add (ptk); Simple parsers can be defined and described by a Deterministic Finite State Automaton (DFSA). This is a simple state machine and for the parser needed here only four states are required: start, space, word, and punct. The following table describes the whole automata, including the next character and the action to take: State Next character Actions Next state ============================================================= 0:start space - start ------------------------------------------------------------- 0:start punctuation - start ------------------------------------------------------------- 0:start letter start a WordToken word ------------------------------------------------------------- 1:word letter add char to token word ------------------------------------------------------------- 1:word punctuation send token to meter punct start a PunctToken ------------------------------------------------------------- 1:word space send token to meter space ------------------------------------------------------------- 2:space space - space ------------------------------------------------------------- 2:space letter start a WordToken word ------------------------------------------------------------- 2:space punctuation start a PunctToken punct ------------------------------------------------------------- 3:punct punctuation add char to token punct ------------------------------------------------------------- 3:punct letter send token to meter word start a WordToken ------------------------------------------------------------- 3:punct space send token to meter space ------------------------------------------------------------- A traditional and efficient way to iterate over all characters in a string is this: String s = ...; for (int i = 0; i < s.length (); i++) { char c = s.charAt(i); ... } However, while it is NOT possible to do: for (char c : s) { ... } it IS possible to write: for (char c : s.toCharArray ()) { ... } BUT! this will require the creation of an array object each time the loop is executed. If this is done rarely, that is no problem. But if it is done for every word in a large text, then the memory requirements per word will more than double, since there will be both the string for the word and the array object holding a copy of each character in the string. A word in the parser is defined as a string of letters, as indicated by public static boolean Character.isLetter(c) which is a method in the java.lang package. The java.lang package is always available and does not need to be imported. Space between words and punctuation is defined by public static boolean Character.isSpaceChar(c) However, for punctuation there is no predicate defined, you must write one yourself. Punctuation is defined any of the characters from the set ".!?:;" (period, exclamation mark, question mark, colon, and semicolon). Such a predicate can be written like: protected boolean isPunctuation (char c) { if (c == '.') { return true; } else if (c == '!') { return true; } ... else { return false; } } However, that requires several lines of code. A shorter version would be to use a boolean expression directly in the return statement: protected boolean isPunctuation (char c) { return (c == '.') || (c == '!') || ... || (c == ';'); } In fact, there is an even shorter way of doing it, by clever use of one of the methods in java.lang.String. Finding it out is left as an excercise. Finally, any character which is not a letter, a space, or punctuation, is regarded as invisible and skipped over. Important note: when you reach the end of the string being parsed, make sure that any token you have been building also is sent to the TextMeter. Otherwise you will not get correct results. The SimpleMeter In order to test the parser, you will need to write an implementation of a TextMeter, called SimpleMeter. The SimpleMeter counts two things, the number of words, and the number of sentences (the number of punctuations). The file should be named SimpleMeter.java and the skeleton looks like: public class SimpleMeter implements TextMeter { ... } // SimpleMeter Since the SimpleMeter class implements the TextMeter interface, all methods in that interface must be given implementations. For example: int nofWords = 0; ... public void add (WordToken wt) { // Increment the count of words. } ... The two methods TextMeter.add(WordToken) and TextMeter.add(PunctToken) are used to put information into the meter. The other two methods, TextMeter.properties() and TextMeter.get(String) requires some explanation. Since readability metrics can differ a lot, and we want to be as general as possible, we let the TextMeter implementation return several values, expressed as a map from keys to values. That is why the return value of properties has that type: import java.util.Map; ... public Map properties() { ... } The Map type is an interface defined in the standard library package java.util. Using the interface as a return type, allows the implementation to choose any class which implements that interface to hold the map. A Map, then, is a lookup table where values are stored under keys. The general definition of a Map is Map meaning that the key (K) is the first type and the value (V) is the second type. When we declare that something is a Map, we can choose what the types should be for the key and the value. The most general type of Map would be: Map Because all Java objects automatically extends class Object, this declaration says that any type can be a key, and any type can be a value. However, that is rarely useful. It is much more common that we want to restrict the allowed types to something specific, so that we can be sure of what types are expected. This is especially important when retrieving values from the map. So, when we declare a type to be Map we are telling the Java compiler that instances of this type may only have strings as keys, and strings as values. The compiler will then check our code to make sure we do not try to use something else (although null is usually allowed as a special kind of value). Having the Map type, we also need an implementation for it, something for the SimpleMeter to use to actually store the data. In java.util there is such an implementation: HashTable As can be seen, it also can be restricted in terms of which types it accepts (again, it is the Java compiler which enforces this), so a suitable declaration in class SimpleMeter would be: import java.util.Map; import java.util.HashTable; ... protected Map stats = new HashTable(); The import clauses at the beginning of the source file are needed so that the compiler understands which Map and HashTable we want. Then we declare a variable called stats (statistics), saying it is protected (it can only be accessed from within class SimpleMeter), and that is has the type Map. In the same line, we also initialize it, by creating an instance of a HashTable, restricted in type to match the type of the variable. Since the HashTable class implements the Map interface, it IS a Map, and this is exactly what we want. It is now quite easy to implement the TextMeter.properties() method, and it simply returns the value of the stats variable: public Map properties () { // Load the stats table with its properties ... return stats; } However, before we return the property map (stats), we need to put some properties into it. The Map interface contains a suitable method: Map.put(K key, V value) so in our case all we need to do is to decide on names for the properties and put them in. For example: ... stats.put ("words", /* the count of words */); // More properties... ... The Map.put() method will overwrite any previous value put in using the same key, so there is no need to remove any old values. The expected property names for the SimpleMeter are "words" and "sentences". As for the values, you must make sure the expression you put in evaluates to a String type, or the compiler will complain if it cannot resolve the conversion automatically. For all the primitive data types, like int, float, double, char etc, their wrapper classes in package java.lang (Integer, Float, Double, etc) have static toString() methods that will do this. For example, the statement: String s = Integer.toString (4711); would give s the value "4711", which is a string. (Note: the wrapper classes exist in order to be able to treat the primitive data types as objects. While you can put a String in a Map, you cannot put an int in there. It is a primitive data type, and not an object, and collections like maps, sets and lists can only contain objects. For that reason instances of e.g. Integer can be created to 'wrap' around the primitive data type value. In addition to that, the wrapper classes also contain useful functions for converting the values to and from strings of text.) There are some more things to consider here. Since we do not know when someone will call the properties() method, we must make sure that the stats map is updated before we return it. Likewise, we also have the String TextMeter.get(String key) method to implement, and it should of course return exactly the same up-to-date value as can be found in the properties. There are several ways of doing so: one way is to have a separate method which stuffs the current properties into the stats Map, and both properties() and get(String key) calls it. Another is to load the map only in the implementation of the properties() method, and then implement get(String key) by calling properties() first. The main program - ParserTest. The source code for this provided. It is really nothing complicated; a new parser is created and given a new SimpleMeter instance. The strings taken from the commandline (the argument to main, an array of String) are each sent to the parser. Then, finally, the meter and the property map are retrieved from the parser and all the properties printed. That last operation is rather involved, so it will be explained in detail. This is what the code looks like at the end of the main() method: for (Map.Entry M : parser.getMeter ().properties ().entrySet ()) { System.out.printf ("%10s : %8s%n", M.getKey (), M.getValue ()); } First of all there is a for-loop, using the rather recent (Java 5) foreach construct: for ( type variable : collection-or-array) { ... } which iterates over all elements in the collection or array (collections are found int the java.util package; Map and HashTable are examples of collections). So the type in the statement above is Map.Entry the variable is M and the collection-or-array is a series of method calls that in the end returns a collection. Broken down the method calls look like this: parser is the reference to the parser instance. .getMeter () returns the reference to the SimpleMeter given to the parser when it was constructed. .properties () returns the Map from the SimpleMeter .entrySet () is a method in the Map interface, and it returns a set of Entry objects. Since a Set is a collection (interface java.util.Set) this works well with the foreach loop. The general declaration of the value returned by Map.entrySet() is Map.Entry but since we know that the property map contains keys and values that are all strings, we can specialize this type in the declaration of the iteration variable M, which is why it is declared Map.Entry Inside the for-loop, is a call to System.out.printf(String, expression ...) There are other ways of printing to the standard output, like System.out.print( ) System.out.println( ) but the printf method offers formatted printing. The first argument to printf is a formatting string, similar in spirit to the printf function in the C language. The printf method is documented in the java.io.PrintStream class and the complete syntax for the formatting string can be found in java.util.Formatter. The formatting string used here looks like this: "%10s : %8s%n" Each percent sign starts a formatting field. After the field comes the width of the field and a conversion indicator, in this case 's', for string. The %n conversion generates a newline at the end of the formatted string. After the formatting string follows two arguments, since we have specified two string conversions in the formatting string. These reference the iteration variable M (which will be referring to the current Map.EntrySet instance in the foreach loop), and from that the key and the value or that pair is retrieved and can be formatted and printed. Final important points Do not use packages! We do not need them. Put all your files for assignment 1 in the same folder. When you do assignment 2 and 3, put them in their own folders and only with the new files you get or create. Set the CLASSPATH environment variable so that the javac and java commands can find your classes. For example, if your assignment folders are named a1, a2 and a3, then in a Windows environment: SET CLASSPATH=.;..\a1;..\a2;..\a3 The first dot includes the current directory, which probably is one of the three folders. The .. syntax means the parent directory to the current directory, and then down again into a1, a2 or a3. If you do this you will be able to compile and run assignemnts 2 and 3 without having to copy the TextMeter, Parser, and token classes into each folder. Example output: >java ParserTest "A white fox smiled. Why? Because the hens were out..." words : 10 sentences : 3 > >java ParserTest "1.2.3.4.5." words : 0 sentences : 0 > Grades for this assignment: For the E, D and C grades the general grading criteria apply. For the C grade it is important that you use the proper javadoc syntax in the source code you write. You must document the class, using the @author tag. You must document every method, explaining what it is for. If the method takes arguments, use the @param tag. If the method returns values, use the @return tag. If you have programmed the method to throw exceptions, use the @throws tag. Remember that a javadoc comment starts with /** and ends with */, and it goes immediately before that which is commented. For the B and A grades, your solution fulfills the requirement for a C, and you modify the property listing in ParserTest.java so that the properties are printed sorted alphabetically by the key string. B - an unambiguous algorithmic outline A - you code it, document it, and it works. -fk