ECE368: Probabilistic Reasoning

Lab 1: Classification with Multinomial and Gaussian Models

1 Na¨ıve Bayes Classifier for Spam Filtering

In the first part of the lab, we use a Na¨ıve Bayes Classifier to build a spam email filter based on whether

and how many times each word in a fixed vocabulary occurs in the email. Suppose that we need to classify

a set of N emails, and each email n is represented by {xn, yn}, n = 1, 2, . . . , N, where yn is the class label

which takes the value

yn =

(

1 if email n is spam,

0 if email n is non-spam (also called ham),

(1)

and xn is a feature vector of the email n. We use a multinomial model to construct the feature vector xn.

Let W = {w1, w2, . . . , wD} be the set of the words (called the vocabulary) that appear at least once in the

training set. The feature vector xn is defined as a D-dimensional vector xn = [xn1, xn2, . . . , xnD], where

each entry xnd, d = 1, 2, . . . , D is the number of occurrences of word wd in email n. Thus the total number

of words in email n can be expressed as ln = xn1 + xn2 + . . . + xnD.

We assume that each email n of length ln is generated by a sequence of ln independent events that randomly

draw words from the vocabulary W. (This is known as the na¨ıve Bayes assumption.) For each event,

let p(wd | yn = 1) be the probability that word wd is picked, given that the email belongs to spam; let

p(wd | yn = 0) be the probability that word wd is picked, given that the email belongs to ham. Note that

p(wd | yn = 1) and p(wd | yn = 0) are different, which gives us a way to classify spam vs. ham. For example,

words like “dollar”,“winner” would be more likely to occur in spam than in ham. Also, note that both

p(wd | yn = 1), d = 1, 2, . . . , D and p(wd | yn = 0), d = 1, 2, . . . , D should sum to one, i.e.,

X

D

d=1

p(wd | yn = 1) = 1, (2)

X

D

d=1

p(wd | yn = 0) = 1. (3)

The probabilities p(wd | yn = 1), p(wd | yn = 0), d = 1, . . . , D should be learned from the training data.

We make use of the word frequencies to model each email n probabilistically. Since each word in the email

is seen as independently drawn from the vocabulary W, the distribution of the feature vector xn given label

yn can be seen as a multinomial distribution as follows,

p (xn | yn) = (xn1 + xn2 + . . . + xnD)!

(xn1)!(xn2)! . . .(xnD)!

Y

D

d=1

p (wd | yn)

xnd

. (4)

We assume that the prior class distribution p(yn) is modeled as

p(yn = 1) = π, (5)

p(yn = 0) = 1 − π, (6)

where π is a fixed parameter (e.g., 0.5).

1

In the following, we first estimate the probabilities p(wd | yn = 1), p(wd | yn = 0), d = 1, . . . , D using the

training set; we then build a classifier based on Bayes’ rule and make predictions on the testing set.

Download classifier.zip under Modules/Lab1/ on Quercus and unzip the file. The spam emails for training are

in the subfolder /data/spam/. The ham emails for training are in the subfolder /data/ham/. The unlabeled

emails for testing are in the subfolder /data/testing/.

Please answer the questions below and complete the routine classifier.py. File util.py contains a few functions/classes that will be helpful in writing the code for the classifier.

Questions

1. Training. We estimate the conditional probability distribution of the D-ary random variable as specified

by p(wd | yn = 1) and p(wd | yn = 0), d = 1, . . . , D, from the training data using a bag-of-words model

as follows. For notational simplicity, we define pd = p(wd | yn = 1) and qd = p(wd | yn = 0).

(a) We put all the words from the spam emails in the training set in a bag and simply count the

number of occurrences of each word wd, d = 1, · · · , D. We do the same for ham emails. The

maximum likelihood estimates of pd and qd based on these counts are not the most appropriate to

use when the probabilities are very close to 0 or to 1. For example, some words that occur in one

class may not occur at all in the other class. In this problem, we use the technique of “Laplace

smoothing” to deal with this problem. Please write down such an estimator for pd and qd as

functions of the training data {xn, yn}, n = 1, 2, . . . , N using Laplace smoothing for the D-ary

random variable.

(b) Complete the function learn distributions in file classifier.py. In learn distributions, you first build

the vocabulary {w1, . . . , wD} by accounting for all the words that appear in the training set at

least once; you then estimate pd and qd, d = 1, 2, . . . , D using your expressions in part (a).

2. Testing. We classify the unlabeled emails in /data/testing/ using the trained classifier.

(a) Let {x, y} be a data point from the testing set whose class label y is unknown. Write down the

maximum a posterior (MAP) rule to decide whether y = 1 or y = 0 based on the feature vector

x. The d-th entry of x is denoted by xd. Please incorporate pd and qd in your expression. Please

think carefully how to treat words that do not appear in either ham or spam training sets. Please

assume that π = 0.5.

(b) Complete the function classify new email in file classifier.py to implement the MAP rule, and run

it on the testing set. There are two types of errors in classifying unlabeled emails: Type 1 error

is defined as the event that a spam email is misclassified as ham; Type 2 error is defined as the

event that a ham email is misclassified as spam. Write down the numerical values of these two

numbers of errors made by your classifier on the testing data. To avoid numerical underflow in

your code, please work with the log probability log p(y|x) in your code.

(c) In practice, Type 1 error and Type 2 error lead to difference consequences (or costs). Therefore, we

may wish to trade off one type of error against the other in designing the classifier. For example,

we usually want to achieve a very low Type 2 error since the cost of missing a useful email can

be severe, while we can tolerate a relative high Type 1 error as it merely causes inconvenience.

Please provide a way to modify the decision rule in the classifier such that these two types of

error can be traded off. In other words, change the decision rule in a way such that Type 2 error

would decrease at a cost of Type 1 error, and vice versa. Test your method on the testing set and

provide the following plot: Let the x-axis be the number of Type 1 errors and the y-axis be the

number of Type 2 errors in the testing data set. Plot at least 10 points corresponding to different

pairs of Type 1 and Type 2 errors, as a result of adjusting the classification rule. The two end

points of the plot should be: 1) the one with zero Type 1 error; and 2) the one with zero Type 2

error. The code should be included in file classifier.py.

2

(d) Why do we need Laplace smoothing? Briefly explain what would go wrong if we use maximum

likelihood estimation in the training process, by considering a scenario in which a testing email

contains both a word w1 that appears only in the ham training set (but not in the spam training

set), and a word w2 that appears only in the spam training set (but not in the ham training set).

How does Laplace smoothing resolve this issue?

The training and test data for this problem are taken from V. Metsis, I. Androutsopoulos and G. Paliouras,

“Spam Filtering with Naive Bayes – Which Naive Bayes?” Proceedings of the 3rd Conference on Email and

Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.

2 Linear/Quadratic Discriminant Analysis for Height/Weight Data

When the feature vector is real-valued (instead of binary), a Gaussian vector model is appropriate. In this

part of the lab, we use linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) for

the height/weight data of people, and visualize the classification results of male and female persons based

on height and weight.

Suppose that the data set contains N samples. Let xn = [hn, wn] be the feature vector, where hn denotes

the height and wn denotes the weight of a person indexed by n. Let yn denote the class label. Here yn = 1

is male, and yn = 2 is female. We model the class prior as p(yn = 1) = π and p(yn = 2) = 1 − π. For this

problem, let π = 0.5.

For the class conditional distributions, let µm be the mean of xn if class label yn is male, and let µf be the

mean of xn if class label yn is female. For LDA, a common covariance matrix is shared by both classes, which

is denoted by Σ; for QDA, different covariance matrices are used for male and female, which are denoted by

Σm and Σf , respectively.

Download ldaqda.zip under Modules/Lab1/ on Quercus and unzip the file. The data set for training is in file

trainHeightWeight.txt, whereas the data set for testing is in file testHeightWeight.txt. Each file uses the same

format to represent the data: the first column corresponds to the class labels, the second column corresponds

to the heights, and the third column corresponds to the weights.

Please answer the questions below and complete function ldaqda.py. File util.py contains a few functions/classes that will be useful in writing the code.

Questions

1. Training and visualization. We estimate the parameters in LDA and QDA from the training data in

trainHeightWeight.txt and visualize the LDA/QDA model.

(a) Please write down the maximum likelihood estimates of the parameters µm, µf

, Σ, Σm, and Σf

as functions of the training data {xn, yn}, n = 1, 2, . . . , N. The indicator function I(·) may be

useful in your expressions.

(b) Once the above parameters are obtained, you can design a classifier to make a decision on the

class label y of the new data x. The decision boundary can be written as a linear equation of x

in the case of LDA, and a quadratic equation of x in the case of QDA. Please write down the

expressions of these two boundaries.

(c) Complete function discrimAnalysis in file ldaqda.py to visualize LDA and QDA. Please plot one

figure for LDA and one figure for QDA. In both plots, the horizontal axis is the height with range

[50, 80] and the vertical axis is the weight with range [80, 280]. Each figure should contain: 1) N

colored data points {xn, n = 1, 2, . . . , N} with the color indicating the corresponding class labels

3

(e.g., blue represents male and red represents female); 2) the contours of the the conditional Gaussian distribution for each class (To create a contour plot, you need first build a two-dimensional

grid for the range [50, 80] × [80, 280] by using function np.meshgrid. You then compute the conditional Gaussian density at each point in the grid for each class. Finally use function plt.contour,

which takes the two-dimensional grid and the conditional Gaussian density on the grid as inputs

to automatically produce the contours.); 3) the decision boundary, which can also be created by

using plt.contour with appropriate contour level.

2. Testing. We test the obtained LDA/QDA model on the testing data in testHeightWeight.txt. Complete

function misRate in file ldaqda.py to compute the misclassification rates for LDA and QDA, defined as

the total percentage of the misclassified samples (both male and female) over all samples.

The data for this problem are taken from: K. Murphy, Machine Learning: A Probabilistic Approach, MIT

Press, 2012.

4

Sale!