Lab 3: Some more Python, Bernoulli processes, Poisson distribution

Like the previous lab, we want to put all of our imported packages towards the top of the lab in a cell that’s easy to run as needed. This way we have access to all the methods we need right from the start.

%matplotlib inline

import matplotlib.pyplot as plt

import matplotlib.image as img

import numpy as np

import scipy as sp

import scipy.stats as st

import pickle as pkl

import csv as csv

print (“Modules Imported!”)

Some More on Python:

Dictionaries and Classes:

In the first lab we learned about lists, arrays, and tuples. There is yet another sort of grouping of terms and that is a dictionary. It is denoted with curly brackets { } instead of parenthesis ( ) for tuples and brackets [ ] for lists. It is like a list or array but instead of being indexed by the integers 0,1,2,3,4…, a dictionary has a key followed by a value colon followed by a value. So that each value is associated to a given key. Below is a dictionary that has the names of fast food chains as the keys, and the ratings out of 10 as the values.

Rating = {‘Burger King’: 4, ‘Five Guys’:7, ‘Chipotle’:6, ‘Panda Express’:5, ‘Subway’:4} #Creates a dictionary

print (Rating.keys()) #Returns an array of the keys

print (Rating[‘Burger King’]) #Returns the value associated with the key ‘Burger King’

There should be two questions that come to your mind when first using the dictionary: What happens if we try to retrieve a value from a key that is not in the dictionary? What happens if the same key appears in the dictionary twice? In response to the first question, if there is no key, python will throw an error. Thus, it is always good to check whether the key is in the dictionary before trying to retrieve a value.

Rating = {‘Burger King’: 4, ‘Five Guys’:7, ‘Chipotle’:6, ‘Panda Express’:5, ‘Subway’:4} #Creates a dictionary

for i in [‘Burger King’, ‘Five Guys’, ‘Chick-Fil-A’] :

print (i,Rating[i]) #Will give an error since ‘Chick-Fil-A is not an actual key

Rating = {‘Burger King’: 4, ‘Five Guys’:7, ‘Chipotle’:6, ‘Panda Express’:5, ‘Subway’:4} #Creates a dictionary

for i in [‘Burger King’, ‘Five Guys’, ‘Chick-Fil-A’] :

if i in Rating: #First checks if i is a key in the dictionary

print (i,Rating[i])

In response to the second question, when we try it below, we find that it takes on the most recent value given to the keyword.

Rating = {‘Burger King’: 4, ‘Five Guys’:7, ‘Chipotle’:6, ‘Panda Express’:5, ‘Subway’:4, ‘Chipotle’: 9} #Creates a dictionary

print (Rating.keys())

print ([Rating[i] for i in Rating.keys()])

print (Rating)

We can declare classes in python similarly to that of JAVA. We use the keyword “class” followed by the name of the class and then a colon. Tab indentation remains the same as before so that anything included within the tab of the class is contained within the class. We can include class variables or use the “def” keyword to create class functions. Below is an example of a class.

class Student:

def __init__(self, name, ID):

self.n = name

self.i = ID

def getName(self):

return self.n

def getID(self):

return self.i

The above code is just an example and won’t return anything, but make sure you run it anyway. Like the modules that we imported, if we create a custom class and run it once, then all the other cells in our Python notebook will have access to it. There are a few things that should have stood out to you in the code we just ran. The first is the “init” function. It is a version of a constructor method common to object oriented programming languages such as Java, and is what you would use to declare a new instance of your class. Second is the “self” keyword that appears in all of the methods. In order to have access to methods and variables within the class itself, you need to reference the class by using the keyword “self”. It’s kind of like the “this” keyword in JAVA, but is more explicitly expressed here. Finally, the “init” function indicates that in our class we pass two parameters (other than self) which will become instance variables for the instances of the class that we will create. The code below creates an instance of the Student class.

s = Student(“Kevin”, “4123”)

print (s.getName())

print (s.getID())

print (s.n)

print (s.i)

Notice how the instance variables we created were not in fact private, so our get methods are not needed (other than to illustrate how things work, of course).

Reading and Writing Files

It is very useful to know how to read and write files in python. So below we will go over some of the basics with I/O. When loading and saving files you can specify the entire filepath, but it is probably much easier to keep the files coordinating to each lab in the same folder and just use relative filepaths. We can write to a text file very easily using the code below. If you were to look in the folder where this ipython notebook file is held, you would see the file below.

#Writes a simple statement to a text file

filepath = ‘lab3_simple.txt’

f = open(filepath, ‘w’) #Opens file. ‘w’ signifies we want to write to it.

#’w’ erases existing file; use ‘a’ to append to an existing file

f.write(‘This is a simple example’) #Writes to the text file

f.close()

print (‘The file has been written’)

Likewise we can load the text file back using the following:

filepath = ‘lab3_simple.txt’

f = open(filepath) #Opens the file, default behavior is to read (not write)

print (f.read()) #Reads the text file

f.close()

This is fairly easy yet, since it’s a text file everything we store in it needs to be a string. This becomes a bit of a pain if we would want to store things like a dictionary that describes a random variable. This has a mix of strings, floats, and possibly others. While it’s easy to get the string of each of these and save them in a text file, it’s much harder to load back and then parse through to convert everything into the variables we want. Instead we can use the Python Pickle module. Let’s use it to save the dictionary we created above.

grades = {‘Bart’:75, ‘Lisa’:98, ‘Milhouse’:80, ‘Nelson’:65}

import pickle # import module first

f = open(‘gradesdict.pkl’, ‘wb’) # Pickle file is newly created where foo1.py is

pickle.dump(grades, f) # dump data to f

f.close()

filepath = ‘lab3_dictionary.pkl’

d = {‘one’:(1./6,-1),’two’:(1./6,5),’three’:(1./6,-5),’four’:(1./6,1),’five’:(1./6,-5),’six’:(1./6,1)}

f = open(filepath,’wb’) # The ‘wb’ is for openning file to be written to in binary mode

pkl.dump(d,f)

f.close()

print (‘The file has been written’)

Now you should see a .pkl file in the same folder which represents our dictionary. It’s a bit less conveniant than a text file however, because it’s not exactly readable by an outside program. However, we can load it back and manipulate our dictionary just as before. (Note: Due to the way files are written using pickel, a pickel file written using a Windows computer will be hard to open with a computer using Linux and vice versa)

filepath = ‘lab3_dictionary.pkl’

f = open(filepath, ‘rb’) # The ‘rb’ is for openning file to be read in binary mode

d = pkl.load(f)

f.close()

print (d[‘one’])

print (d[‘five’][1])

It would be nice if we could load in files from csv formats to be able to manipulate them. This can be done through the “csv” module. Along with this lab notebook, there should also be a csv file called SacramentoCrime. This is just a random set of data I found on the internet but is fine for our purposes. It has over 7000 crime logs and each one of those logs has 9 different bits of information. We can load the data in and manipulate it with the following.

filepath = ‘SacramentoCrime.csv’

data = [] #Creates an empty list

f = open(filepath) #Opens the file path in the default ‘r’ mode

reader = csv.reader(f)

for row in reader:

data.append(row)

f.close() # data is now a list of lists

data = np.array(data) #Converts our list to a numpy array to make it a little easier to work with

print (‘Data size:’, np.size(data), ‘, Data shape:’, np.shape(data),’\n’)

print (‘The following is the list of headers:’)

print (data[0],’\n’)

print (‘The following is some random data corresponding to the headers’)

print (data[77])

N_row = np.shape(data)[0] # the number of rows in the data matrix

x = [float(a) for a in data[1:N_row, 8]] # Loads column 8 of data (numbering begins at zero) into x

y = [float(a) for a in data[1:N_row, 7]] # Loads column 7 of data (numbering begins at zero) into y

# convert string to float fot plotting plot later

plt.scatter(x,y, color = ‘red’, edgecolor = ‘black’)

plt.title(‘Location of Crimes in Sacremento’)

plt.xlabel(‘Longitude’)

plt.ylabel(‘Latitude’)

plt.axis([-121.7,-121.2,38.4,38.7])

Finally we can also load in image files. You should have a file along with this lab called SacramentoMap.png. Make sure that this is also in the same folder as the ipython notebook. We can load and plot the image with the following code. It should look similar to the outline given by our crime map.

filepath = ‘SacramentoMap.png’

sac = img.imread(filepath)

image = plt.imshow(sac)

These were just the basics of file loading and saving. Depending on formatting and other issues, it may be necessary to dive into these modules a bit deeper to better suit your circumstances. However, this is a very good start to being able to use I/O.

The Lambda Keyword:

Finally, I use it in one of the topics below so I figured it may be good to go over it first here. “lambda” is a reserved keyword in Python. This may frustrate you when trying to simulate a Poisson process or random variable because in the literature the parameter for a Poisson or exponential distribution is often lambda, λ , but it’s just the way it is. In python, you can pass functions the same as variables. You can set functions equal to variables. The keyword lambda signals the creation of an anonymous function (it’s not bound to a name). It allows functions to be written in a single line and to be passed with relative ease. The best way to understand it is just to look at some examples. So here are a few.

# Simple function as we would normally define it

def f(x):

return x**3

print (f(3))

g = lambda x:x**3 #Same exact function using the lambda keyword

print (g(3))

# Function that returns a value that is itself a function defined by lambda

def f(n):

return lambda x:x**n

g = f(3) #g is the function x^3

h = f(2) #h is the function x^2

print (g(3))

print (h(3))

n = range(20) #Creates a list from 0 to 19

y = filter(lambda x:x%2==0,n) #Filters out everything in n, not satisfying the given function

print (y)

Hopefully this gives you a basic idea of what the lambda function is and does. We will not use it very extensively in this lab, but it’s still good to know and may come in handy.

Bernoulli Processes:

In the first lab, you were introduced to both the Bernoulli distribution and the binomial distribution. A random process is simply a collection of random variables indexed by time. A Bernoulli process is given by X=(X1,X2,…) where Xt∼Bernoulli(p) for each t and the X ‘s are mutually independent. It is a sequence of Bernoulli RVs. We can calculate probabilities involving the process at multiple times fairly easily, e.g. P{X3=1,X6=0,X11=1,X13=1}=p(1−p)pp=p3(1−p) . When considering a random process, it is helpful to visualize, or produce by computer simulation, a typical sample path. A sample path of a random process is the deterministic function of time that results by performing the probability experiment for the underlying probability space, and selecting a realization, or variate, for each of the random variables invovled. Generating a sample path of a random process by computer simulation is particularly simple in case the random variables of the process are mutually independent, such as for Bernoulli processes. For such processes, variates of the individual random variables can be generated separately. Below is a sample path of a Bernoulli process X=(X1,X2,…) with p=1/7. Run the code several times to see different sample paths.

p = 1./7 #Probability

T = 30 #Number of time steps

X = [] #Creates a list for the values of the random variables

for i in range(1,T+1): #range(1,T+1) is the list of numbers 1 through T

X.append(st.bernoulli.rvs(p)) #Fills the list with Bernoulli(p) variates

plt.plot(range(1,T+1),X, ‘o’)

plt.title(‘Sample Path of Bernoulli process with p=1/7’)

plt.ylim((0,2))

plt.ylabel(‘$X(\omega)$’) #You can use LaTex in the Python code

plt.xlabel(‘Time’)

The same Bernoulli process can be described in four different ways.

Using X=(X1,X2,…) as above.

Using L=(L1,L2,…), where Li is the number of trials after the i−1th count up to and including the time of the ith count.

Using S=(S1,S2,…), where Si is the time the ith count occurs.

Using C=(C1,C2,…) where Ct is the number of counts up to and including time t

(A diagram of each of these representations can be found in your ECE 313 textbook section 2.6)

For example, if

X=0,1,0,1,0,0,1,1,1,0,1 , then

L=2,2,3,1,1,2

S=2,4,7,8,9,11

C=0,1,1,2,2,2,3,4,5,5,6 .

Problem 1: Write an expanded version of the code above to display the sample paths of X,L,S, and C all for the same realization of the experiment. To do so, plot the sample paths of X and C up to time 30 as before, and print the first ten values of L and of S. You don’t need to plot L and S. You may need to generate more than 30 X values to determine the first ten values of L and S. To reiterate, your values of L,S and C should be determined by X.

(If you just generate a large number of trials assuming it will produce at least 10 values of L and S, you may lose a few points. To prevent this way of generation, consider using a while loop.)

# Your code here

End of Problem 1

The equivalent descriptions above suggest another method to simulate a Bernoulli random process. Each Li has a geometric distribution with parameter p, and the L ‘s are independent. The geometric distribution is given by its pmf: p(i)=(1−p)i−1p for i≥1. For example, the probability that the first count occurs on the third trial is P{L1=3}=P{X1=0,X2=0,X3=1}=(1−p)(1−p)p=(1−p)2p which we determined before.

Problem 2: Write new code for simulation of a Bernoulli random process by first generating L=(L1,⋯,L30) according to a geometric distribution and then generating X,S, and C from L. Print all values in sequences L , X , S and C .

# Your code here

End of Problem 2

Poisson distribution as limit of binomial distribution

There is yet another important piece to this puzzle, and that is the Poisson distribution. The Poisson distribution has a single parameter λ and a probability mass function given by: p(k)=e−λλkk! for k≥0. The parameter λ represents a mean such as the number of hits of a website in one minute, or the number of mispelled words in a document. Thus p(k) represents the probability the number of events occuring is k given that the average number events that occur is λ . The Poisson distribution is frequently used because it is a good approximation for the binomial distribution when n is large, p is small, and np≈λ . It is simpler than the binomial; it only has one parameter and it doesn’t involve binomial coefficients. Let’s say you create a website and that your website gets on average of 1200 hits per day. This is set up as a Poisson distribution where λ=1200 , but we can also model this as a binomial. If we were to break down the day into minute increments then the probability that a hit occurs in any given minute is p=120024∗60=56 and there are n=24∗60=1440 minutes in a day. Below is a graph of this binomial approximation of the Poisson.

lamb =1200 #Average number of hits per day

n = 60*24. #Number of minutes in a day

p = lamb/n #Probability of a hit occuring in a minute

print (‘p =’, p)

k = range(2*lamb)

plt.plot(k,st.binom.pmf(k,n,p), ‘b’, label = ‘Binomial’)

plt.plot(k,st.poisson.pmf(k,lamb), ‘r’, label = ‘Poisson’)

plt.title(‘PMF of Hits Per Day’)

plt.legend()

x = np.linspace(0,2*lamb,10000)

plt.figure()

plt.plot(x,st.binom.cdf(x,n,p), ‘b’, label = ‘Binomial’)

plt.plot(x,st.poisson.cdf(x,lamb), ‘r’, label = ‘Poisson’)

plt.ylim(0,1.2)

plt.title(‘CDF of Hits Per Day’)

plt.legend()

These two distributions don’t really look that close to each other. Why is that? In order for this approximation to be accurate, we require that n be large, p be small, and np≈λ . Here n is fairly large but p is not close to zero at all. The variance of the Poisson(1200) distribution is 1200, while the variance of the Binom(1440,5/6) distribution is only 1440(5/6)(1/6)=200. Clearly, we haven’t broken the day up into small enough increments. So let’s now break it up into seconds.

lamb = 1200 #Average number of hits per day

n = 60*60*24. #Number of seconds in a day

p = lamb/n #Probability of a hit occuring in a minute

print (‘p =’, p)

X = st.binom(n,p)

Y = st.poisson(lamb)

k = range(2*lamb)

plt.plot(k,X.pmf(k), ‘b’, label = ‘Binomial’)

plt.plot(k,Y.pmf(k), ‘r’, label = ‘Poisson’)

plt.title(‘PMF of Hits Per Day’)

plt.legend()

x = np.linspace(0,2*lamb,10000)

plt.figure()

plt.plot(x,X.cdf(x), ‘b’, label = ‘Binomial’)

plt.plot(x,Y.cdf(x), ‘r’, label = ‘Poisson’)

plt.ylim(0,1.2)

plt.title(‘CDF of Hits Per Day’)

plt.legend()

Now our approximation is so close that the two distributions are almost indistinguishable from each other. If we kept increasing n and decreasing p we would find that the approximation continues to improve. So, symbolically, limn→∞,p→0,np→λBinom(n,p)=Pois(λ). If you encounter a binomial variable with large n and small p, it may be easier to calculate probabilities based on the Poisson distribution.

Problem 3: While working on this lab course, I have a probability of p=.014 of finishing a section during any given minute. Let’s say that there are 300 sections that need to be completed and I have 8 weeks to create the lab (assume I work 40 hours/week). What’s the probability that I complete the lab before the start of the semester? Equivalently what is the probability that I finish at least 300 sections? In order to answer this question, do the following:

Create a binomial variable X to represent the number of sections I complete (for this and other parts of the problem, assume I keep working at the same rate if I finish completing 300 sections).

Create a Poisson variable Y to represent the same number, using the Poisson approximation. Make sure to print out what λ is.

Find the probability of my success (i.e. completing at least 300 sections) using the CDFs of each RV. Do they agree?

Find the probability that I finish exactly 300 sections using the pmf of each RV. Do they agree?

# Your code here

Answer: (Your answer here) Suppose X has the pdf: fλ(u)={λe−λx0u≥0u<0 with parameter λ>0. Suppose the value of λ is unknown and it is observed that X=8 . Find the maximum likelihood estimate, λˆML .\

Let X∼N(3,100) (in other words, X is a gaussian random variable with mean 3 and variance 100 ). Find the numerical values of the probabilities for the following events. Use the following approximations when needed. Q(0.5)=0.309,Q(1.0)=0.159,Q(1.5)=0.067 Q(2.0)=0.023,Q(2.5)=0.006,Q(3.0)=0.001

\ Suppose X has the pdf: fθ(u)={(uθ)e−u22θ0u≥0u<0 with parameter θ>0. Suppose the value of θ is unknown and it is observed that X=2 . Find the maximum likelihood estimate, θˆML, of θ.

Suppose Y is uniformly distributed over the interval [−5a,5a], where a is an unknown positive parameter. Find the maximum likelihood estimate a^ML for the observation Y=5 .

Suppose that for each i , link i fails at time Ti , where T1 , T2 and T3 are independent, exponentially distributed with parameters λ1=3 , λ2=8 and λ3=1 , respectively. The network fails at time T , where T=min{T1,T2,T3} . Hence \[P{T>t}=P{T1>t,T2>t,T3>t},fort≥0.\]

Express FcT(t)=P{T>t} for t≥0 in terms of t

Find the pdf of T in terms of t for t≥0 .

Suppose X=Y+N where Y and N are uncorrelated with mean zero, Var(Y)=2 , and Var(N)=1 . Find the minimum MSE for linear estimation, E[(Y−E^[Y|X])2] , and find E^[Y|X=5] .

Reminder: the minimum MSE for linear estimation =σ2Y−(Cov(X,Y))2Var(X) . And the minimum MSE linear estimator E^[Y|X=u]=μY+(Cov(X,Y)Var(X))(u−μX) .

Consider two random variables X and Y . Select the options where X and Y must be uncorrelated

Suppose X and Y have a bivariate Gaussian joint distribution with E[X]=E[Y]=1 and Var( X ) = 2. The variance of Y and the correlation coeffcient are not given. Finally, suppose X is independent of X+Y . Answer the following questions.

Find Cov( X , Y ).-2

Find conditional mean E[Y|X=3] . -1

Find conditional mean E[X|X+Y=3] . 1

Find conditional mean E[X|X+Y=3,Y=3] . 0

Reminder: The formula for MMSE linear estimator is:

E^[Y|X]=E[Y]+Cov(X,Y)Var(X)(X−E[X]).

A gambler repeatedly plays a game, each time winning 2 units of money with probability 0.1 and losing 3 units of money with probability 0.9. Let A be the event that after 100 plays, the gambler has at least as much money as at the beginning. Please answer the following two questions. Round your answer to the third decimal. For example, write 0.653 if you get 0.6528.

(a) Find an upper bound on P(A) using the Chebychev inequality: P{X−E[X]≥t}≤P{|X−E[X]|≥t}≤Var[X]t2.

(b) Find the approximate value of P(A) based on the Gaussian approximation. Use Q function to express your answer. (To be definite, use the continuity correction.)

End of Problem 3

Academic Integrity Statement

By submitting the lab with this statement, you declare you have written up the lab entirely by yourself, including both code and markdown cells. You also agree that you should not share your code with anyone else. Any violation of the academic integrity requirement may cause an academic integrity report to be filed that could go into your student record. See Students’ Quick Reference Guide to Academic Integrity for more information.