ROB311: Artificial Intelligence

Project #4: Reinforcement Learning

Overview

In this project, you will dabble with reinforcement learning algorithms. First, you will work with with two

closely-related techniques for sequential decision making: value iteration and policy iteration for Markov decision problems. Thereafter, you will take a step back to understand the exploration-exploitation tradeoff by

solving the multi-armed bandit problem. The goals are to:

• understand value iteration and the use of the Bellman update equations;

• compare value iteration with policy iteration for the same problem domains; and

• solve a classic reinforcement learning problem, referred to as the multi-arm bandit problem.

The project has three parts, worth a total of 50 points. All submissions will be via Autolab; to prevent the

usage of Autolab as a debugging tool, the maximum submissions are limited to 14 . To complete the project,

you will need to review some material that goes beyond that discussed in the lectures—more details are

provided below. The due date for project submission is Friday, April 14, 2022, by 23:55 EDT.

Similar to Project #3, each part already has some basic code in place for you to start with. There are three

types of files provided to you in the handout package: support files (named mdp_*.py or mab_*.py),

template files (named part*.py) and test files (named test*.py). The students are only required to

complete and submit the template files. To minimize the number of submissions, please use the test files to

run preliminary tests on your function implementations before submitting on Autolab.

Part 1: Markov Decision Processes via Value Iteration

Value iteration is a well known method for solving Markov decision problems (processes). This iterative

technique relies on what is known as the ‘Bellman update’ (see original work by Bellman in 1957), which you

will code up as part of the project. Your tasks are to:

1. Write a short function, get_transition_model(), that generates the state transition model (matrix)

for the simple cleaning robot problem described by Figure 1. This transition model will be needed to solve

an instance of the cleaning robot MDP (see next bullet).

2. Implement the value iteration algorithm given in AIMA(4ed) on pg. 653, which accepts an MDP description

as input (states, possible actions, transition matrix, rewards, discount factor) and produces an epsilonoptimal policy (and a utility function). The function, value_iteration(), will also accept a threshold

(ϵ) and a maximum number of iterations to run.

You will submit the completed template files part1_1.py and part1_2.py through Autolab. We will test

your code on several problems, including on the grid world example given in AIMA(4ed) on pg. 646! We

have included an environment file (mdp_grid_env.py) in the handout package that defines the variables

which are required for the AIMA problem. The function init_stochstic_model() (i.e., the function that

1 of 4

ROB311: Artificial Intelligence Project #4: Reinforcement Learning

Figure 1: A simple cleaning robot problem. There are six states; the robot wants to put rubbish in the bin

(State 5) and to recharge its batteries (State 0). Both of these states are terminal states (i.e., once reached,

the robot does not leave the state). The robot may choose to move left or right to an adjacent cell. Due to

environment uncertainties, such as a slippery floor, for example, state transitions are not deterministic: when

trying to move in a certain direction, the robot succeeds with a probability of 0.8; with a probability of 0.15 it

remains in the same state, and with a probability of 0.05 it may move in the opposite direction. The reward

in State 0 is 1, in State 5 is 5, and is zero otherwise.

generates the transition model) has deliberately not been implemented, so as not to give away all the test

cases. After implementing this function, you should be able to test your solution on more complex problem

instances, as done in Autolab. Thus, you have the option to create additional tests for the grid environment

using the provided test files—this should help with debugging, etc.

Part 2: Markov Decision Processes via Policy Iteration

As discussed in the lectures, value iteration is not the only way to solve an MDP; another popular alternative

is policy iteration. The policy iteration framework is different than that of value iteration: we begin with an

initial, sub-optimal policy (possibly random), and then refine that policy. Your task is to:

1. Implement the policy iteration algorithm given in AIMA(4ed) on pg. 657, which accepts an MDP description as input (states, possible actions, transition matrix, rewards, discount factor) and produces an

optimal policy (and a utility function). The function, policy_iteration(), will also accept a variable

that specifies the maximum number of iterations to run.

You will submit the completed template file part2.py through Autolab. We will test your code on several

problems, including on the grid world example given in AIMA(4ed) on pg. 646, as above.

Part 3: Multi-Armed Bandit

First mentioned in the foundation book, Reinforcement Learning: An Introduction, by Sutton and Barto, the

multi-armed bandit is a simple and classic RL problem. An agent is free to choose any action in a state independent environment, where the reward for each action is given according to an underlying probability

distribution. This situation can be best compared to a casino with multiple slot machines such that each one

has its own probability distribution of success. The fact that we do not have access to this probability distribution is what makes this problem non-trivial. This problem brings up the classic exploration vs. exploitation

tradeoff. Your task is to:

2 of 4

ROB311: Artificial Intelligence Project #4: Reinforcement Learning

Figure 2: The received reward averaged over 500 experiments for each execution step of a RL policy on the

MAB environment.

1. Implement an agent in part3.py to solve the multi-arm bandit (MAB) problem. The provided class

methods update_state and get_action are used by the testing scripts, as shown in test_part3.py,

and hence their method signatures must be kept compatible. You are allowed to implement additional class

methods for better code organization.

Each arm in this problem provides a reward of +1 with an independent probability, and 0 otherwise. The

goal of the RL agent is to maximize the received reward as fast as possible while it picks a slot machine arm

one at a time. Figure 2 shows a plot of the received reward as an efficient RL agent executes its policy across

400 episodes.

Grading

We would like to reiterate that only the functions implemented in the template files will affect your marks. In

order to make this task more challenging, a submission threshold of 14 has been set on Autolab. Given the

stochastic nature of the problem setup, the RL policy in part 3 will give a different result on each execution

— we recommend that you save the last 4 submission to maximize marks on your final 10th submission. For

this project, Autolab has a timeout of 600 seconds per submission. Points for each portion of the project will

be assigned as follows:

• Value Iteration – 20 points (4 test × 5 points each)

The first test (Part 1A) will evaluate your state transition model, to ensure that you understand how to

write down such models from a high-level description (i.e., that given in the caption to Figure 1). The

next three tests (Part 1B) will evaluate your value-based MDP solver, first for the simple 1-D cleaning

robot world, and then for the grid world defined in AIMA (with some tweaks).

Time limit: 1 second per test.

3 of 4

ROB311: Artificial Intelligence Project #4: Reinforcement Learning

• Policy Iteration – 10 points (2 tests × 5 points each)

The two tests will evaluate your policy-based MDP solver on different types of grid worlds.

Time limit: 1 second per test.

• Multi-Armed Bandit – 20 points (2 tests × 10 points each)

Your RL agent will be run on a random MAB environment for 400 episodes. This experiment will be

run 500 times and the average received reward over the episodes will be evaluated. The goal of this

task is to receive a final reward that is as close as possible to the maximum possible reward, and reach

this value with the fewest episodes. For example, the RL agent in figure 2 is able to achieve more than

90% of maximum reward within 100 episodes. Two tests will evaluate your episode execution time as

well as your reward learning curve. The marking scheme for the first test on Autolab is provided as a

comment in test_part3.py for reference and further clarity.

Time limit: to ensure that the experiments finish running within the Autolab time limit, each episode

must take less than 0.001s to run. See test_part3.py for details on how this is enforced.

Total: 50 points

Grading criteria include: correctness and succinctness of the implementation of support functions, proper

overall program operation, and code commenting. Please note that we will test your code and it must run successfully. Code that is not properly commented or that looks like ‘spaghetti’ may result in an overall deduction

of up to 10%.

4 of 4