CSDS 313: Introduction to Data Analysis

Assignment 2: Data and Distributions

Problem 1

The purpose of this exercise is to investigate how different distributions can have similar statistics

and/or visualizations. Suppose you are given a normal distribution N (µ, σ). We would like to

estimate a uniform distribution U(a, b) (i.e., the range of the distribution is [a, b]) with identical

statistics to the given normal distribution. These statistics are specified as follows:

(i) Find the parameters (a and b) of a uniform distribution in terms of µ and σ such that the mean

and standard deviation of uniform distribution is the same as the given normal distribution.

(ii) Find the parameters (a and b) of a uniform distribution in terms of µ and σ such that the

25th and 75th percentile points of the uniform distribution and the given normal distribution

are the same. Assume you can compute inverse cumulative distribution function Φ−1

(p, µ, σ)

of a normal distribution N (µ, σ) for any 0 ≤ p ≤ 1. See probit function for more information.

Hint: You should estimate the parameters of uniform distribution a and b by simply using

Φ

−1

(p, µ, σ).

For parts (i) and (ii) separately, obtain a uniform distribution U(a, b) as a function of µ and σ i.e.,

find a = fa(µ, σ) and b = fb(µ, σ). Then, estimate the parameters of uniform distributions U1(a1, b1)

and U2(a2, b2) corresponding to parts (i) and (ii) for the normal distribution N (µ = 2, σ = 5). Simulate 10 000 data points from each of the U1(a1, b1), U2(a2, b2) and N (2, 5) distributions separately.

Visualize the 3 simulated distributions using histograms, error bars, and boxplots. Compare and

comment on how the obtained uniform distributions are similar or unsimilar to the given normal

distribution. Also, compare and comment on how they are similar or unsimilar to each other.

Note that, you can compute the probit function Φ−1

(p, µ, σ) as follows:

MATLAB: norminv function.

Python: norm.ppf function in scipy.stats package.

R: qnorm function.

Problem 2

For this exercise, we will use two datasets that are provided with the assignment:

The file “airport routes.csv” contains the number of available routes of 3409 airports all

around the world (as of February 2017). Each row indicates an airport (identified with a

3-letter code) and the number of routes. For example, ”CLE, 81” indicates that Cleveland

Hopkins International Airport has outgoing flights to 81 different airports. See data source

for more information.

The file “movie votes.csv” contains the average rating (between 1 and 10) of 4392 movies in

TMDb database sorted in descending order. Each row contains a movie name and the average

TMDb vote of that movie. For example, “The Godfather”, 8.4, “Interstellar”,8.1 etc.

See data source for more information.

For each of these datasets, consider the following models:

(a) Suppose the given data points follow a power law distribution. Estimate the corresponding α

parameter. You can use the maximum likelihood estimation in Newman’s notes on power-law.

(b) Suppose the given data points follow an exponential distribution.

Estimate the corresponding λ parameter.

(c) Suppose the given data points follow a uniform distribution.

Estimate the corresponding range parameters [a, b] of the uniform distribution.

(d) Suppose the given data points follow a normal distribution.

Estimate the corresponding µ and σ parameters.

For each these dataset separately, compare the models you estimated in parts (a) to (d). Which

distribution do you think the data follows and why? Explain. For each model, generate random

data samples drawn from the respective distribution. Use visualizations of the empirical data and

the data you generate to support your conclusions.

Problem 3

Recall the rocket problem from exercise 3: You are working as chief data scientist at a rocket

production company. You know that your company’s competitor is assigning integer IDs to their

rockets. In other words, if the competitor produced M rockets, there is a rocket with ID i for all

1 ≤ i ≤ M. Your company’s intelligence wasable to collect the IDs of n rockets produced by the

competitor and these IDs are 1 ≤ x1 ≤ x2 ≤…≤ xn. You can assume that the IDs collected by the

intelligence represent a uniform sampling of the M IDs.

(i) What is the maximum liklihood estimator for M. Simulate the rockets and intelligence reports

to show if the maximum liklihood estimator is an unbiased estimator. (hint: make sure to

choose a large M and and large number of trials for your simulation)

(ii) Let MˆMV U = xn(

n+1

n

) − 1. Let MˆMEAN = 2(Pn

i=1xi/n) − 1 Simulate the rockets and

intelligence reports to show which of the above unbiased estimators (MˆMV U or MˆMEAN ) has

the lower variance.