CSDS 313: Introduction to Data Analysis
Assignment 3: Associations and Statistical Inference
1 Problem 1
In this exercise, our aim is to quantify the association between genomic variants in the plant
Arabidopsis Thaliana while assessing the statistical significance of their association. In this context,
the variables are genomic variants and the samples are individual plants. We will use the two
datasets that are provided with the assignment:
The file “p1a.csv”contains a binary matrix of size 199 x 2 which indicates the presence of
two genomic variants for 199 individuals. A value of 1 indicates that the genomic variant is
present for the corresponding individual, and 0 indicates it is not.
The file “p1b.csv”contains a binary matrix of size 199 x 15 which has 199 samples (individuals)
and 15 variables (genomic variants). Each column represents a variable, and each entry (1
or 0) in the column indicates the presence of the corresponding variable in each of the 199
1.1 Part (a)
For the two variables provided in “p1a.csv”, assess the association between them by computing
each of the following statistics:
Pearson’s chi-squared χ
For each of these statistics, assess the statistical significance by computing a p-value for the null
hypothesis of ”there is no association” (see Note (1) below). Select a significance level α that
we can use to reject the null hypothesis and accept the alternate hypothesis ”there is a non-zero
association.” if a p-value is less than α (see Note (2) below for the selection of α). For each of the
three specified statistics, explain, in complete sentences, your findings: Is there a statististically
significant association between the given variables? How strong does the association seem? Do
the results of different statistics agree? If not, what does this mean (i.e., how do you explain this
discrepancy)? Make sure to specify the values of the test statistics and the p-values as well as the
selected α level and the number of permutations N you use for the permutation tests.
Note (1): The p-value in this context indicates the probability of observing a ”more significant”
result (for a statistic of interest like mutual information) than the observed result when there is no
association between the genomic variants (which is our null hypothesis).
Note (2): You are welcome to select the commonly used 0.05 as your α threshold. However, you are
encouraged to try different α values and observe how the conclusions of a researcher might change
dramatically if they are not careful with the interpretations of the hypothesis test results.
Hint (1): For mutual information and Jaccard index, you need to do a Monte Carlo simulation
(e.g., a permutation test) to assess statistical significance. For Pearson’s χ
, you can use the parametric distribution (i.e., Pearson’s chi-squared test) instead of doing a Monte-Carlo simulation to
generate the null distribution.
Hint (2): The idea of the permutation test is to randomly permute the entries in each column
(variable) separately, in order to break any possible associations between the variables without
modifying their distributions. This way, we can generate a null distribution for the test statistic
(like mutual information) that represents our null hypothesis. Performing the permutation test is
simple: First, generate N random permutations of the data and for each of these permuted data,
compute the test statistic of interest. Then, count the number of times permuted data has more
significant values (e.g., higher values for mutual information) than the actual data, let’s denote
this count c. Then, the p-value is simply equal to c + 1
N + 1
. Notice that, a p-value obtained from a
permutation test cannot be lower than 1
N + 1
Hint (3): An alternative to a permutation test (or Monte-Carlo simulation in general) is to obtain
a null distribution analytically by fitting a known parametric distribution. For example, it is known
that Pearson’s χ
follows a well-studied distribution called chi-squared distribution when the null
hypothesis of ”no association” is correct. Thus, it is possible to obtain a p-value analytically from
the chi-squared distribution (indeed, this is what the Pearson’s chi-squared test does).
Hint (4): To visualize the results of a permutation test, you can draw a sorted scatter plot while
marking the observed value (as shown in ’example-figure1.jpg’).
1.2 Part (b)
For each of the 105 (15×14/2) pairs of variables provided in “p1b.csv”, repeat the steps in part
(a) to compute the values of the test statistics as well as the p-values separately for: Mutual
information, Jaccard Index and Pearson’s chi-squared χ
statistics. Select an α level to reject the
null hypothesis of no association. For each of the three test statistics, determine the pairs with
significant association at α level while adjusting for false discovery rate (see Note (3) below). Draw
scatter plots (or other appropriate visualizations) to compare the results of mutual information
(MI), Jaccard Index (JI) and Pearson’s chi-squared χ
. How much do the results of these three
test statistics agree to one another? Which two of these three statistics are most similar (MI-JI,
, or JI-χ
)? If you were to use only one of these three statistics, which one would you prefer
to use and why? If you were to consider only your preferred test statistic among the given three,
would your conclusions (about the associations of the provided variable pairs) change a lot? Make
sure to specify the number of significantly associated pairs and the number of overlap between the
three test statistics. Also, specify the selected α level and the number of permutations N you use
for the permutation tests.
Note (3): Here, differently from part (a), we are testing for multiple hypotheses (one for each
variable pair). Therefore, if we were to reject a null hypothesis as suggested in part (a) (when
p-value is less than α), we would have a considerably higher false discovery rate (FDR) than α
(Think: If each test has a α chance of failure, what is the probability that there is at least one test
that fails? For α = 0.05 and n = 105 tests, that would be 99.54% chance of failure!). Thus, we need
to use an appropriate procedure that take into account the number of hypotheses to make sure the
FDR is indeed less than α. For this purpose, you can apply one of the following two approaches:
(1) Benjamini-Hochberg procedure (which is, in many fields, the go-to way of adjusting for multiple
hypotheses) to determine the statistically significant pairs, (2) A Monte Carlo simulation specifically
designed to control the FDR based on the distributions generated by the permutations (e.g., to
compute the p-value of the top-scoring pair, we can use the distribution of the test statistics of
the top-scoring pairs across all permutations, which would be a ”multiple-hypthesis-corrected” null
Note (4): Using the Benjamini-Hochberg procedure, essentially a stricter threshold is used to
reject a null hypothesis (e.g., the threshold of α
for rejecting the null hypothesis of the most
significant pair). Therefore, for estimating the p-values using permutation test, you need to use
more permutations compared to part (a) to be able find a significant association. Thus, make sure
that your number of permutations N is high enough to be able to find at least one significant pair.
2 Problem 2
In this exercise, our aim is to quantify the associations between continuous variables and assess the
statistical significance of these associations. For this purpose, we will use the three datasets that
are provided with the assignment:
The file “p2a.csv”contains a matrix of size 2400 x 2 which has 2400 samples and 2 variables.
The file “p2b.csv”contains a matrix of size 110 x 2 which has 110 samples and 2 variables.
The file “p2c.csv”contains a matrix of size 2100 x 2 which has 2100 samples and 2 variables.
2.1 Part (a)
For the two variables provided in “p2a.csv”, assess the association between them by computing
Pearson correlation ra and computing a p-value pa for the null hypothesis of no association. Select a
significance level α and reject the null-hypothesis if the p-value is less than α. Explain, in complete
sentences, your findings: Is there a statistically significant association (at α level) between the
provided variables? What is the magnitude and the direction of the association?
2.2 Part (b)
Repeat part (a) for the variable pair provided in “p2b.csv”and compute Pearson correlation rb and
p-value pb. Compare the Pearson correlations ra and rb as well as the p-values pa and pb. Explain
your findings: Which variable pair (in part a or b) has a stronger association according to the
comparison of the correlations? Which variable pair has a stronger association according to the
comparison of the p-values? Do the comparsions according to correlation coefficients and p-values
agree on which variable pair indicate the same stronger association? If not, why is there such a
discrepancy? Next, draw scatter plots (variable 1 vs. variable 2) to visualize the data for both
part (a) and part (b). Which variable pair (in part a or b) has a stronger association do you think
according to the scatter plots? Does your conclusion agree with the comparisons of the p-values
and correlation coefficients? If not, explain why.
2.3 Part (c)
Repeat part (b). But, this time compare the associations in the datasets “p2a.csv”and “p2c.csv”.
Make sure to answer the questions in part (b), this time for comparing the variable pairs in parts
(a) and (c).