Sale!

Computational Biology (Homework #1) Problem 1

$30.00

Category:
5/5 - (2 votes)

Comp790-Computational Biology (Homework #1) Problem 1

• There are 3 data files provided for the following questions, including, Levine_matrix.csv, cell_
graph.edgelist, and population_names_Levine_13dim.txt. Instructions for how to use these data
will be provided in each homework problem.
• Unless explicitly asked to write code from scratch, you can use publicly available code. Please reference
any code or libraries that you use.
• You are welcome to consult with other colleagues, but please write up your own independent solution.
• You are welcome to use Python, Julia, or R here, though all hints are given for Python, and we will
use Node2Vec (Python)
• You are welcome to write up your assignment using the HW1_790-166.tex template, or write up the
solutions in the method of your choice.
• This homework is worth 82 points total.
Problem 1
(23 Points Total) (Adjacency Matrix Math)
Consider an undirected, unweighted graph with N nodes and its corresponding adjacency matrix, A.
• (5 points) Let 1 be the column vector of only 1s, and in this case, N 1s. Using A and 1, write an
expression for a vector of node degrees, k, such that the ith entry of k, ki
, represents the number of
edges incident on node i.
• (3 points) Again, using A and 1 write an expression for the total number of edges in the graph.
• (1 point) Verify the two expressions that you just defined for the following matrix, A, and show that
you get k = [1, 2, 1] and that the number of edges in the graph is 2. You can show this by drawing the
graph.
A =


0 1 0
1 0 1
0 1 0


• (10 points) (Graph Laplacian) Recall that the the un-normalized Graph Laplacian, L is defined as
L = D − A. Use the provided graph (encoded as an edgelist) cell_graph.edgelist and write a
function that computes the Graph Laplacian. As a reminder, D is an N × N matrix with node degrees
on the diagonal, and A is the adjacency matrix. You can copy and paste your code or take a screen
shot, since it should only be a few lines of code.
– (3 points) Use the function you just wrote to calculate the Graph Laplacian of the Graph stored in
cell_graph.edgelist. Make a histogram to visualize the distribution of eigenvalues of L. What
is the smallest eigenvalue of L? (hint: you can use https://numpy.org/doc/stable/reference/
generated/numpy.linalg.eig.html#numpy.linalg.eig, or equivalent.)
– (1 point) How does the observation of this smallest eigenvalue relate to the structure of the graph?
1
Your Name Here Comp790-Computational Biology (Homework #1) Problem 2
Figure 1: A visualization of the graph in homework problem 2.
Problem 2
(29 Points Total) Playing with Single Cell Data
We will consider data from a mass cytometry experiment obtained from http://flowrepository.org/id/
FR-FCM-ZZPH. Here, we are considering the expression of 13 different protein markers across a set of cells.
It has already been pre-processed for you. From the entire dataset, 5,000 cells were sampled for further
analysis. You can use the following accompanying data as follows.
• Levine_matrix.csv is the cell × marker matrix. Note that the last column is labels for the cells. Let’s
call this matrix, X. You should not use this column for any kind of clustering. Some cells are
not labeled (hence are called NaN).
• population_names_Levine_13dim.txt maps the cell labels from the last column of X (number labels)
to biologically-interpretable cell-type names.
• cell_graph.edgelist is an edgelist for a between-cell graph. We will call this, G. Note that the
nodes in G correspond to the rows in X. So, node i maps to row i of X, etc.
1) Clustering on cell × marker data (7 points): Use a clustering algorithm of your choice to generate a
cell-to-cluster partition for the cells, using the matrix, X. Use normalized mutual information (NMI) to compute overlap between the true and predicted cell labels. Note that because not all cells are labeled,
you can compute this only based on the labeled cells. Feel free to use an available implementation, such as, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_
mutual_info_score.html.
2) Graph Partitioning (7 points): Use a graph clustering algorithm to partition G into clusters. Consult
course notes from days 2-3 for some ideas about this. Similar to Problem 2 – question (1) compute NMI
between the labels obtained in graph partitioning and the true cell labels.
3) (3 points) Comment on any observations you observe between the quality of the partitions obtained clustering on X in comparison to partitioning G. Which approach do you think works better, using the original
data, or the graph?
4) Rare Cell-types (5 points) Plasmacytoid_DC_cells, or pDCs (label 21) are a popular rare cell type,
meaning many clustering algorithms will not be able to reliably find them. Report the number of distinct
clusters where you found pDCs in both the clustering of X and in the partitioning of G.
Problem 2 continued on next page. . . 2
Your Name Here Comp790-Computational Biology (Homework #1) Problem 2 (continued)
5) Cell Classification (10 points) Select cells from X with the following labels, {11, 12, 17, 18} and {1, 2, 3}.
In general, cells with labels {11, 12, 17, 18} are T-cells and cells with labels {1, 2, 3} are monocytes. Convert
this to a binary classification problem by labeling T-cells with 0 and monocytes with 1. Use your favorite
classifier to predict the labels of these cells. Use an ROC curve to visualize the performance. If the performance was not good, explain what could have gone wrong. If your performance is very good, can you
identify features from X that were helpful in predicting labels?
Problem 3
(30 points total) node2vec
We will use the implementation of node2vec available in github, https://github.com/aditya-grover/
node2vec to create vector representations for each node in G encoded in cell_graph.edgelist.
1) (Clustering on Node2Vec Features (10 points)) First, use default parameters and follow the instructions in the README on the graph in cell_graph.edgelist. This will create a 128-dimensional vector
for each node. Cluster the nodes based on these vectors and compare to the ground truth labels in the
last column of Levine_matrix.csv using NMI. Compare your results to Problem 2, question 3. Does an
embedding of the graph offer any apparent advantages in classifying cells?
2) (Parameters, part 1 (5 points)) Try a few different values for the number of dimensions –dimensions,
such that some of them are less than 128, and some of them are more than 128. Cluster cells again with
the embeddings obtained in different dimensions. Again, you can compute the NMI between the cluster
assignments and the ground truth labels. Comment on some observations, and show a plot of NMI plotted
against the number of dimensions used.
3) (Parameters, part 2 (5 points)) Recall that the parameters p and q control the ‘breadth’ vs ‘depth of
the walk’. Choose one of these parameters to vary, and repeat the previous question using the default 128
dimensions, but varying values for either p or q. Comment on some observations, and show a plot of NMI
against p or q (whichever one you chose).
4) (Cell Classification, Part II (10 points)) Repeat Problem 2, question (5). However, instead of using
only X as the feature matrix, we are going to combine the marker expressions with node2vec features. Let
N be your matrix generated through node2Vec. Create a new matrix called X = [X|N]. That is, you will
simply concatenate X and N. Formulate the same classification problem from Problem 2, question (5) to
classify T-cells from monocytes. Again, report your ROC curve. Comment on the performance, especially
in comparison to the results obtained in Problem 2, question (5).
3

PlaceholderComputational Biology (Homework #1) Problem 1
$30.00
Open chat
Need help?
Hello
Can we help?