ECE220: Computer Systems & Programming

Machine Problem 3

Using Dynamic Programming to Compute Levenshtein Distance

Your task this week is to write subroutines that use dynamic

programming to compute the Levenshtein distance between

two strings, then integrate these subroutines into your MP2

code to complete the application. Given two strings and the

costs for Insertion, Deletion, and Substitution, your first

subroutine must initialize a table, and the second subroutine must use dynamic programming to compute

the correct substring Levenshtein distances, predecessor offsets, and predecessor types for each entry in the

table. Essentially, your subroutine builds the input table to MP2.

The objective for this week is to give you additional experience with understanding and manipulating arrays

of data in memory.

As explained in MP2, the Levenshtein distance represents the minimum number of insertions, deletions,

and substitutions necessary to transform one string into another, and is widely used to measure the

difference between two strings, with applications ranging from correction of typos to genetic sequence

alignment. We have generalized the idea slightly to assign costs for each operation: insertion, deletion, and

substitution. In the example shown, the word “possibility” is transformed into “capitalization” by inserting

the first two letters (“ca”), deleting the “oss,” substituting “ta” in place of “bi,” inserting “za” and “io,” and

finally substituting “n” for “y.” The cost of each operation in the example was 2 for insertions and deletions

and 3 for substitutions. In total, the change required six insertions, three deletions, and three substitutions,

for a total Levenshtein distance of 27.

The Task

Computing the Levenshtein distance is fundamentally a recursive operation: any distance can be computed

based on the distances for shorter (prefix) strings. Given the costs CI for insertion, CD for deletion, and CS

for substitution, let’s say that we want to know the Levenshtein distance between “PARTY” and “TALL”,

which we denote as [PARTY,TALL]. All three of the following inequalities must hold:

[PARTY,TALL] ≤ [PARTY,TAL] + CI (1)

[PARTY,TALL] ≤ [PART,TALL] + CD (2)

[PARTY,TALL] ≤ [PART,TAL] + CS (3)

For (1), we transform “PARTY” to “TAL” then insert “L” to obtain “TALL”.

For (2), we transform “PART” to “TALL” then delete “Y” (from “PARTY”).

For (3), we transform “PART” to “TAL” then substitute “L” for “Y”.

Since the final characters of “PARTY” and “TALL” do no match, the last operation in transforming

“PARTY” to “TALL” must be one of the three shown, and thus

[PARTY,TALL] = MIN ([PARTY,TAL] + CI , [PART,TALL] + CD , [PART,TAL] + CS)

Rather than computing Levenshtein distances repeatedly for substrings, however, we use a technique called

dynamic programming. Dynamic programming uses additional` memory to reduce computation. In this

case, we use a two-dimensional table to record the Levenshtein distances for all substrings and eventually

build up the desired answer. Dynamic programming is an important algorithmic technique and is used

frequently in both software and hardware.

In MP3, your program is again given two NUL-terminated ASCII strings starting at memory addresses

x3800 and x3840, but now you must use your new subroutines to fill in the table at x4000 and to compute

the Levenshtein distance between the strings. The insertion cost CI is given to your program at memory

address x3880; the deletion cost CD is at memory address x3881; and the substitution cost CS is at memory

address x3882. Note that your program is no longer given the final Levenshtein distance, as it was in MP2.

Your first subroutine, INIT_WORK, must fill in column 0 and row 0 of the table. Recall that the table has

M columns and N rows, where M is the value stored at memory address x38E0, and N is the value stored

at memory address x38E1, both written by your FIND_M_N subroutine. Recall also that the table uses

row-major order, so address of entry (R,C) at Row R and Column C is then given by the expression

x4000 + 3(RM+C).

To understand how INIT_WORK should fill in the necessary entries, let’s start by thinking about the

meaning of the entry (0,0). This entry corresponds to the base case, the Levenshtein distance [,] (between

two empty strings). Since the two strings are the same, the Levenshtein distance is 0, and there is no

preceding operation. As a result, entry (0,0) is always the same: 0 for the first memory address (x4000),

0 for the offset in the second memory address (x4001), and #-1 for the predecessor type in the third memory

address (x4002). Be sure that INIT_WORK sets these three values—in particular, while the simulator may

initially contain the value 0 at addresses x4000 and x4001, failing to store 0s into both addresses in your

subroutine is considered a bug and will result in lost points.

Column 0—entries (R,0) for R > 0—corresponds to the Levenshtein distance between an

empty string and all prefixes of the second string. Since the first string is empty, the only

possible preceding operation is Insertion, and thus we can fill in each of the entries with R

times the insertion cost, as shown to the right for the example second string “ZAPPY”. In this

case, entry (1,0) corresponds to [,Z], entry (2,0) corresponds to [,ZA], entry (3,0)

corresponds to [,ZAP], and so forth. For each entry (R,0)—again, for R > 0—the

predecessor offset should point to entry (R – 1,0), which means a value of -3M, and the

predecessor type should be 0 for insertion. Again, your subroutine must fill these entries

explicitly—relying on memory already having a 0 in it will result in lost points.

Row 0—entries (0,C) for C > 0—corresponds to the Levenshtein distance between all prefixes

of the first string and an empty string. Since the second string is empty, the only possible

preceding operation is Deletion, and thus we can fill in each of the entries with C times the

deletion cost, as shown below for the example first string “APPLE”. In this case, entry (0,1)

corresponds to [A,], entry (0,2) corresponds to [AP,], entry (3,0) corresponds to [APP,],

and so forth. For each entry (0,C)—again,

for C > 0—the predecessor offset should

point to entry (0,C – 1), which means a value

of -3, and the predecessor type should be 1

for deletion.

For simplicity, all registers are caller-saved for the INIT_WORK subroutine.

The second subroutine, CALC_DISTANCE, must fill in the rest of the table. You may assume that your

INIT_WORK subroutine has been called before CALC_DISTANCE is called.

As you saw earlier, any given entry in the table depends only on the entries above it (insertion), to the left

of it (deletion), and diagonally up and to the left (substitution/match) of it. CALC_DISTANCE can thus

proceed to fill in the table row by row, starting with row 1, and to fill each row from left to right, starting

with column 1. Your subroutine should use a doubly-nested loop for this purpose.

This part of the code will need more values than the LC-3 has registers, so think carefully about where you

plan to put all the values you need as your write your code, and fill in comments (a register table, as well

as notes about values that don’t fit into registers) as you go.

For each entry, start by comparing the appropriate characters from the strings to see if they match. If they

do, the cost for a match is exactly the same as the cost with one less character in each string (the entry at

offset -3(M+1) from the entry being computed)—in other words, drop the CS term from the right side of

inequality (3). If the characters do not match, the substitution cost CS must be added to the potential

predecessor entry’s distance (the first memory location in that entry).

We suggest that you mark the entry with the substitution/match predecessor offset (as -3(M+1)) and

value (as 2) regardless, allowing your later code to simply overwrite these values if a shorter distance is

found, or leave them intact if no shorter distance is found.

Next, compare the distance just calculated (for substitution/match) with the distance possible using deletion.

Read the Levenshtein distance from the entry at offset -3 from the entry being computed and add the

deletion cost CD to find the distance with deletion as a predecessor. If it is better (smaller) than that possible

with substitution/match, adjust the entry appropriately (offset -3 and predecessor type 1).

Finally, compare the best distance found so far with the distance possible using insertion. Read the

Levenshtein distance from the entry at offset -3M from the entry being computed and add the insertion cost

CI to find the distance with insertion as a predecessor. If it is better (smaller) than that found previously,

adjust the entry appropriately (offset -3M and predecessor type 1).

Be sure that you also record the best distance found in the entry before moving on to fill in the next entry.

Once you have filled in the entire table, return the Levenshtein distance of the final entry (in the lower right

corner of the table) in R1.

Note that ties are possible between the three

possible predecessor operations, and must be

broken as implied by the preceding description.

Specifically, substitution/match must be

preferred over deletion and insertion, and

deletion must be preferred over insertion. You

are welcome to write your code in any way that

works (do pay attention to the style guidelines, of

course), but you must match the tie-breaking

rules as well as all other parts of the specification.

An example in which ties occur (here

CS = CD = CI = 1) appears to the right, with all

variants of tie-breaking illustrated by the two

green circles. The orange arrows represent other

ties. The blue arrows represent the correct

solution.

For simplicity, all registers are caller-saved for

the CALC_DISTANCE subroutine.

Note that you will need to make minor adjustments to your main program to make use of your new

subroutines. We suggest that you make these adjustments first so that you can test your subroutines more

easily as you develop them. In your main program from MP2, you used PRINT_DECIMAL to print the

Levenshtein distance. To do so, you loaded the distance from memory address x38C0 into R1. Replace

that part of the code with two subroutine calls, first to INIT_WORK, then to CALC_DISTANCE.

You may, of course, want to make other adjustments as you develop the subroutines in order to focus more

directly on whether they are working, but the changes mentioned above are needed for the final program to

work completely.

Specifics

Your code must be written in LC-3 assembly language and must be contained in a single file called

mp3.asm in the mp/mp3 subdirectory of your repository. We will not grade any other files.

You are given the following:

o a NUL-terminated ASCII string starting at memory address x3800,

o a second NUL-terminated ASCII string starting at memory address x3840,

o a positive insertion cost CI at memory address x3880,

o a positive deletion cost CD at memory address x3881, and

o a positive substitution cost CS at memory address x3882.

You may make the following assumptions about the values provided to you (and no other

assumptions):

o Both strings are valid, but either or both could be empty (0-length).

o All three costs are positive.

Your program must start at x3000, and must produce the appropriate output exactly. Our testing

and the rubric will focus on your new subroutines, but you will lose points if you do not integrate

them properly into the main program.

You must write the INIT_WORK subroutine to use the stored values of M and N and the costs

given (see above) to fill in row 0 and column 0 of the table starting at address x4000. The table

has N rows and M columns. Table entries consist of three memory locations each and are stored

consecutively in row-major order. Each entry consists of a computed Levenshtein distance, a

predecessor offset, and a predecessor type. Entry (0,0) must have distance 0, offset 0, and type -1.

Entries (R,0) for R > 0 must have distance R∙ CI, offset -3M, and type 0 (insertion). Entries (0,C)

for C > 0 must have distance C∙ CD, offset -3, and type 1 (deletion). All registers are caller-saved.

You must write the CALC_DISTANCE subroutine to complete the rest of the table based on the

stored values of M and N, the strings at x3800 and x3840, the costs given in memory (see above)

and the initialized table at address x4000. All remaining entries must be filled in to indicate the

minimum predecessor operation between substitution/match, deletion, and insertion. If ties occur,

substitution/match must be preferred over deletion and insertion, and deletion must be preferred

over insertion. All registers are caller-saved.

Both of your subroutines must be able to execute more than once according to the specifications

given, and both of your subroutines must be independent of any other code EXCEPT that you may

assume that we call INIT_WORK before we call CALC_DISTANCE. Note, for example, that you

may NOT assume that we call FIND_M_N before calling either—we may instead fill M and N

in memory directly, and your code must work in this case.

You may write and use additional subroutines, but your interfaces to INIT_WORK and

CALC_DISTANCE must match the specification exactly, and your main program must call both

subroutines and make use of the results appropriately.

Your code must not access the contents of memory locations other than those required for this MP

or declared within your program (using .FILL or .BLKW).

Your code must be well-commented and must include a table describing how registers and

additional memory locations are used within each subroutine. Follow the style of examples

provided to you in class and in the textbook.

Do not leave any additional code in your program when you submit it for grading.

Testing

Remember that testing is your responsibility. We have provided several test files and scripts to help you

debug your code, and suggest that you adopt the following strategy when developing your program:

1. Start by adjusting the main program, which should be easy, and adding placeholders for your new

subroutines. Note that PRETTY_PRINT is likely to loop infinitely until you finish

CALC_DISTANCE, so you may want to turn it off (comment out the call) at first. Don’t forget to

turn it back on!

2. Develop INIT_WORK, then use the example test files provided to test and debug this subroutine.

Inspect the table by hand to make sure that entry (0,0) as well as the first row and column are filled

in appropriately for different values of M, N, CI, and CD.

3. Next, develop CALC_DISTANCE, and test with small examples first. The scripts that we have

provided are designed for use after your program is complete, so be sure to put your main program

in order before using them directly.

4. Once your code seems to be working with the given examples, leverage the tools made available

to you (courtesy of the ZJUI alumni) to help you to improve your coding style and programming

ability. As in MP1 and MP2, while you are not required to use the VSCode extension, you will

lose points if the extension reports any warnings or errors in your program. Please try out the

Webtool interface when debugging your code, too. When you submit a copy of your code to

Github, we will automatically test it against our solution and give you feedback about errors in your

code as well as differences between your code and our solution. For MP3, you need only make up

new strings and costs to create new examples, but verifying the results by hand may be slow, so

think carefully. Please note, as always, that while our tools are fairly thorough, they do not

guarantee that your code is free of bugs, nor that you will earn full points.

Grading Rubric

Functionality (70%)

5% – main program produces correct output

5% – INIT_WORK fills in entry (0,0) appropriately

10% – INIT_WORK fills in column 0 appropriately

10% – INIT_WORK fills in row 0 appropriately

10% – CALC_DISTANCE finds the correct Levenshtein distance

20% – CALC_DISTANCE fills in the table correctly

10% – CALC_DISTANCE breaks ties correctly

Style (5%)

5% – entry pointer is never calculated based on M nor N in CALC_DISTANCE (note: offsets

MUST use M, but think about how your subroutine fills in entries in the table)

-10% PENALTY – VSCode extension reports any errors or warnings; note that even a single

warning incurs the full 10% penalty

Comments, Clarity, and Write-up (25%)

5% – a paragraph appears at the top of the program explaining what it does (this is given to you;

you just need to document your work)

10% – each subroutine has a register/memory table (comments) explaining how registers/memory

are used in that part of the code

10% – code is clear and well-commented

Note that some categories in the rubric may depend on other categories and/or criteria. For example, if

your code does not assemble, you will receive no functionality points. You will also be penalized heavily

if your code executes data or modifies itself (do not write self-modifying code).

Sharing MP2 Solutions

To help you in testing your code, you may make use of another student’s MP2 solution as part of your MP3,

provided that you strictly obey the following:

You may not obtain another student’s MP2 solution until after class on Tuesday 12 October.

Violation of this rule is an academic integrity violation, will result in BOTH students receiving 0

for MP2, and may have additional consequences.

You must clearly mark the other student’s code in your own MP3, and must include the student’s

name in comments indicating that you are using their code. Failure to mark their code

appropriately will result in your receiving a 0 for MP3.

As you should know already, you may not share any additional code beyond the solution to MP2.

Please also note that if the other student’s code has bugs that lead to your introducing bugs into your MP3

code, you may lose points as a result. We in no way guarantee the accuracy of any student’s MP2 solution.