CS2040S: Data Structures and Algorithms

Problem Set 6

Collaboration Policy. You are encouraged to work with other students on solving these problems. However, you must write up your solution by yourself. We may, randomly, ask you

questions about your solution, and if you cannot answer them, we will assume you have cheated.

In addition, when you write up your solution, you must list the names of every collaborator, that

is, every other person that you talked to about the problem (even if you only discussed it briefly).

You can do so by leaving a comment at the start of your .java file. Any deviation from this policy

will be considered cheating, and will be punished severely, including referral to the NUS Board of

Discipline.

1

Problem 6. (Automatic Writing)

Writing-intensive modules can be hard: so many 10-page essays, and not nearly enough time to

catch up on the latest e-lectures (please watch them!). CS2040S is here to help. For this problem

set, you will develop an automatic writing program that can easily produce pages and pages of new

text. And it will adapt to your chosen style. If you use an old essay as input, your new essay will

sound just like it was written by you! If you use Shakespeare as input, your new essay will sound

as if it was written by the Bard himself.

The basic idea is to take an input text and calculate its statistical properties. For example, given

a specific string “prope”, what is the probability that the next letter is an ‘r’? What is the

probability that the next letter is a ‘q’? Your program will take a text as input, calculate this

statistical information, and then use it to produce a reasonable output text.

Claude Shannon first suggested this technique in a seminal paper A Mathematical Theory of Communication (1948). This paper contained many revolutionary ideas, but one of them was to use

a Markov Chain to model the statistical properties of a particular text. Markov Chains are now

used everywhere; for example, the Google PageRank algorithm is built on ideas related to Markov

Chains.

Markov Models

Given a text, you can build up the Markov Model. A Markov Model captures the frequency of

a specific letter/character appearing after a specific preceding string (which can be of varying

length). The order of the Markov model is the length of that preceding string.

For example, if we have the following text:

a b d a c a b d a c b d a b d a c d a

We can build the following Markov Model of order 1:

a b 1/2

a c 1/2

b d 1

c a 1/3

c b 1/3

c d 1/3

d a 1

This implies the following:

• After the string ‘a’, half the time you find a ‘b’, and half the time you find a ‘c’.

2

• After the string ‘b’, you always find a ‘d’.

• After the string ‘c’, one-third of the time you find letters ‘a’, ‘b’, or ‘d’ (i.e., they are equally

common after a ‘c’).

• After the string ‘d’, you always find an ‘a’.

You can think of these as probabilities (though so far, there is no randomness at all). Notice that

in the text above the table, there are three instances when the character ‘a’ is followed by a ‘b’, and

there are three instances when ‘a’ is followed by a ‘c’. Similarly, ‘b’ is always followed by a ‘d’, and

‘d’ is always followed by an ‘a’. The character ‘c’ is followed by an ‘a’ once, a ‘b’ once, and a ‘d’ once.

A Markov Model of order 2 captures how likely a given letter is to follow a string of length 2.

Suppose we have the following text:

a b c d a b d d a b c d d a b d

Here we have an example of Markov Model of order 2 built by the text above.

ab c 1/2

ab d 1/2

bc d 1

bd d 1

cd a 1/2

cd d 1/2

da b 1

dd a 1

Notice that in the text above, there are two instances when the string ‘ab’ is followed by the letter

‘c’ and two instances when the string ‘ab’ is followed by the letter ‘d’. After the string ‘bc’, you

always get the letter ‘d’, and after the string ‘bd’, you always get the letter ‘d’, etc.

Producing a New Text

Once you have your Markov Model, you can go about generating a new text. You need to start

with a seed string of the same length as the order of the Markov Model. For example, if the Markov

Model is of order 6, you need to start with a string of length 6.

We use the term ‘kgrams’ to refer to the k-character strings, where k is the order. In

order to generate the next character, you look back at the previous k characters (inclusive of the

current last character). Look up that kgram in your Markov Model, and find the frequency that

each character appears after that kgram. If the kgram never appeared in your Markov Model, then

your newly-generated text is completed. Otherwise, you randomly choose the next character based

on the probability distribution indicated by the Markov Model.

3

Once you have found the next character that way, you add it to the end of your string, and repeat

the process as many times as you want!

Problem Details

For this problem, you have been provided with the TextGenerator Java class, more information

will be provided in the section below. You will submit one Java class: MarkovModel.

Problem 6.a. Implement only the following 4 methods:

MarkovModel(int order, long seed): Constructs a Markov Model of the specified order.

You can assume that the order will be at least 1. The seed should be used to initialize the

pseudorandom number generator that your class will use (the template code already does

this). A pseudorandom number generator’s number sequence is completely determined by

the seed. So, if a pseudorandom number generator is reinitialized with the same seed, it will

produce the same sequence of numbers.

void initializeText(String text): This builds your Markov Model based on the specified

text. When this routine is complete, your class would have computed the table that maps

each kgram to the frequency that each ASCII character follows that kgram (where k is

the order of the Markov Model). This method may be called multiple times on the same

MarkovModel object. Each repeated call is expected to build the MarkovModel again.

int getFrequency(String kgram): Returns the number of times the specified string kgram

appears in the input text. The behaviour of this method is only defined if the length of

the kgram is equal to the order of the Markov Model. This should return a number zero or

greater. For example, in the Figure 1 below, the frequency of kgram ‘aa’ is 1. Notice we do

not count the kgram which appears at the end of the text, where nothing may follow it. Be

ensure your code handles invalid cases explicitly.

int getFrequency(String kgram, char c): Returns the number of times the specified

character c appears immediately after the string kgram in the input text. The behaviour

of this method is only defined if the length of the kgram is equal to the order of the Markov

Model. For example, in the Figure 1 below, the frequency of ‘g’ appearing after the kgram

‘ga’ is 4. If the string kgram never appears in the original text, then you should return 0. Be

ensure your code handles invalid cases explicitly.

Note: both getFrequency methods are expected to have a constant run time.

Problem 6.b. Next, implement the nextCharacter method:

char nextCharacter(String kgram): Returns a random character chosen according to the

Markov Model. The probability of a character ‘c’ should be equal to getF requency(kgram,c)

getF requency(kgram)

.

4

That is, the probability of character ‘c’ should be equal to the frequency that ‘c’ follows the

string kgram in the text. If there is no possible next character (e.g., because the kgram does

not appear in the text, or only appears at the very end of the text), then return the specially

defined token final char NOCHARACTER = 0 (defined in the template file). The kgram must

be the length specified by the order of the Markov model. To generate the random choice,

you must use the pseudorandom number generator with the specified seed. You must use

the process described in ”Random character generation” below to generate the

random choice.

You may assume that every character in the text is an ASCII character (inclusive of the extended

ASCII character), except for 0 (the NULL character), which does not count as a character. Figure 1

has an example of the information computed by your Markov Model for a specific example string.

frequency probability

a c g a c g

aa 1 1 0 0 1 0 0

ag 4 2 0 2 2/4 0 2/4

cg 1 1 0 0 1 0 0

ga 5 1 0 4 1/5 0 4/5

gc 1 0 0 1 0 0 1

gg 3 1 1 1 1/3 1/3 1/3

Figure 1: Markov Model produced by the string gagggagaggcgagaaa.

Implementation advice

There are several approaches to designing the MarkovModel class. Here we provide some tips for

how you might implement it.

Basic structure.

There are two standard ways you might store the information about the kgram. You might use a

symbol table (i.e., a hash table) that maps strings of length k (where k is the order of the Markov

model) to an array containing 255 integers, one representing each possible ASCII character. The

array records the number of time each character follows the given string. For example, the character

‘a’ is 97, in ASCII. Hence, if k = 2, given an input string ‘xya’, you would add to your hash table an

entry with the key equal to ‘xy’ and the value equal to an array of integers where value[97] == 1.

Alternatively, you might use a symbol table that maps strings of length k to another symbol table;

the second symbol table maps ASCII characters to integers. Again, these integers represent the

frequency that a given character follows the initial string. You may also use an alternate solution,

as long as it efficiently supports the required operations.

5

For the purposes of this Problem Set, you should use a Hash Table and NOT use Tries or

TreeSet or TreeMap to store information about the kgram. If you are unsure whether you can

use a particular data structure, please post in the forums to seek clarifications.

Random character generation

Presumably you have now built your Markov Model, and now know the proper frequencies that

each character is followed by each kgram. In order to generate the next character in the text, you

need to make a random choice. For testing purposes, you must use the random number generator

specified in the template code: java.util.Random, and you must use the seed specified in the

constructor (via a setSeed call on the random number generator object which is already in the

template code). If you run the same test twice with the same seed, you will get the same answer!

This is very useful for testing.

A second requirement (for testing purposes) is that you choose the next character randomly in the

following way. Here is an example. Imagine that for the given kgram, there are four characters that

might follow it: ‘j’, ‘m’, ‘p’, ‘z’ (all the other characters appear zero times after this kgram). The

‘j’ appears 2 times, the ‘m’ appears 5 times, the ‘p’ appears 3 times, and the ‘z’ appears 1 time.

Thus this particular kgram appears 11 times (not counting the end of the text, where nothing may

follow it).

To choose a random character, you first select a random integer from [0, 10], i.e., a range containing

11 integers. You can do this by calling generator.nextInt(11). You can then partition this range

of random numbers:

• if you get {0, 1} then you return ‘j’;

• if you get {2, 3, 4, 5, 6}, then you return ‘m’;

• if you get {7, 8, 9}, then you return ‘p’;

• if you get {10}, then you return ‘z’.

You must process the possible letters in alphabetic order (or by order of their ASCII character

values).

In general, if there is a set of C possible next characters which appear a total of N times after the

k-letter prefix, you should choose a random number from [0, N − 1], and then go through the set

in alphabetical order to determine which character was chosen by the random selection, weighting

each character by the number of times it appears.

The reason we ask you to choose the random character in this way is twofold: first, it is a reasonably

efficient way to sample from a distribution, and second, it will allow us to ensure that every solution

produces the same random sequence (which makes it easier to test).

6

Other tips

Here are a few other things that may be useful:

• You can treat an ASCII character either as a char or an int. If you have a char ch, you can

store it in an integer as follows: int iChar = (int) ch. Similarly, you can do the opposite:

char newChar = (char) iChar. In general, it is not recommended that you force Java to

ignore types like this. However, in this case, as long as you are careful to not use any values

above 255, this is a safe thing to do. (If your integer is bigger than 255 and you put it in

a char, you may not get what you expect.) This is convenient, for example, if you want to

enumerate all the ASCII characters using a for loop, since you can just count up to 255.

• A Java HashMap is a parameterized data type. That means that when you create it, you have

to specify what the key and value types are. For example, if you want to use a key type K

and a value type V, you would use the class HashMap<K, V>. In your case today, if you want

a key type of String and a value type of Integer, you use a HashMap<String, Integer>.

(Whenever you need to use the name of the class, whether to declare the variable or to create

a new object, you can just use that full name including the String and Integer types.)

• Notice, though, that for your key and value types, you cannot use primitive types like int or

char. Instead, you have to use the wrapper class version of these types: Integer, Character,

etc. That is why, above, we used Integer as the value type. You can mostly use int and

Integer interchangeably and everything just works, with Java automatically converting back

and forth between the two as needed. (An Integer is just a class that contains inside it an

int.)

• Once you have declared your hashmap with the key and value types as String and Integer,

then when you use the put and get methods, they act just like they should, e.g., get takes

a String as input and returns an Integer as output.

• You might find a variety of the methods in the String class useful. If you have not already,

look them up in the Java documentation. For example, you can use the charAt method to

retrieve a particular character in a string; you can use substring(int j) to retrieve the

suffix of a string (from j onwards), and you can use substring(int i, int j) to retrieve

the substring between character i (inclusive) and j (exclusive). Do look up the details.

Text Generator

We have provided you with a text generator class that you can use with your Markov Model class

to generate text. The text generator class takes three input parameters, i.e., the main method has

argument (arg[0], arg[1], arg[2]):

• k, the order of the Markov model;

• n, the number of characters to generate;

7

• the filename of the text to use as a model. (This file should be in the project root directory,

above the src folder.)

You can set the input parameters in IntelliJ under Run → Edit Configurations, where you can

create a new configuration (of type Application). There you can set the program arguments. For

example: ”3 15 PS6Test.in” as the program arguments.

The text generator reads in the file as a long text String, creates the MarkovModel, and calls

initializeString on your Markov Model class using the text.

It then generates a new text, using the first k characters of the original input text as a seed (and

as the first k characters of the output text). In more detail, it begins with a String kgram equal

to the first k letters from your text file. It generates the next character by calling nextCharacter

on your Markov Model, using the initial kgram as your input string. Then it updates the kgram,

adding the new character to the end. It continues in this way, at every step using the most recent

k characters to query the Markov Model for the next character.

If it ever reaches a point where there is no possible next character, then it stops (outputting a string

shorter than desired).

We will test the functionality of your nextCharacter method in Problem 6.b. by making use of

this TextGenerator class that we have provided for you.

Optional Experiments

You may want to experiment with different values of k to determine which values yield reasonable

sounding texts. If k is too large, then the text output will be identical to the input text. On the

other hand, if k is too small, then the output text is quite garbled and ungrammatical English.

There is a small range of k for which the output text both sounds like real English, but also sounds

like a new and unique text.

Problem 6.c. (BONUS) Word-based Markov Model:

You might also experiment with using words instead of characters. For example, instead of looking

at the probability that character ‘c’ comes after the string ‘cra’, you could look at the probability

that the word cat comes after the word yellow (i.e., order 1, where order is defined based on the

number of words), or the probability that cat comes after the phrase the vicious yellow (i.e., order

3). If you develop your Markov Model based on words, you might get a more interesting text, as

long as you begin with a sufficiently long text.

In order to use words instead of characters, you will have to design your MarkovModel class more

carefully. You are now not restricted to the same scheme as the Random character generation, and

you may wish to explore other schemes for random word generation.

Submit your well-commented MarkovModel class to Coursemology. You are also to upload your

(potentially) modified TextGenerator class, OR any other class which makes use of your modified

MarkovModel class to generate text. This is to allow your tutors to test your implementation for

8

this problem.

Note: This bonus problem is an extension to the basic character-based Markov Model. You must

still submit your basic version, and it must work reasonably well before your submission for

this word-based Markov Model is considered for bonus marks.

Creative Competition

Submit the best, most interesting text that your program produces. Post your best, most creative

work to the forum (in the specified location). The one(s) with the most upvotes win(s)! You could

aim for several goals: (i) plausibility (i.e., does it read like a real text), (ii) novelty, and (iii) humor.

All who post a valid submission in the forum will attain the “Born to Write” achievement.

In order to generate an interesting text, you will need to find a good (long) source text. In the

past, people have used legal documents (e.g., the text of a specific law), Shakespeare, news reports,

and/or arbitrary texts from Project Gutenberg (gutenberg.org). You may splice together 2-3

input texts. You may also experiment with modifying the initial kgram. Instead of using the first k

letters of the text as an initial kgram, you may choose an alternate choice of k characters. (Specify

whether your entry was generated using characters, as in the assignment, or using words, as in the

extension, or using something else.)

On a closing note:

“Half the fire a funny little man began to stay at heavens. ‘How beautiful this kingdom’

said their new emperor.”

9