Assignment 4: Variant Analysis and Mappability


5/5 - (2 votes)

## Assignment 4: Variant Analysis and Mappability

Some of the tools you will need to use may only run in a linux or mac environment.
If you do not have access to a linux/mac machine, download and install a virtual
machine or ubuntu instance following the directions here:

Alternatively, you might also want to try out this docker instance that has these tools preinstalled:

### Assignment Overview

In this assignment you will take a look at mappability, get some experience with small variant analysis, and analyze some variant data.

For questions 1, 2 and 4 of this assignment, you will be using chromosome 22 from build 38 of the human genome – download it here: []( Whenever we refer to chromosome 22 in this assignment, this is what you should use.

As a reminder, any questions about the assignment should be posted to [Piazza](

### Question 1. Mappability Analysis [15 pts]

– 1a. Using k = 100, extract every k-mer from chromosome 22, and align them back to chromosome 22 using bowtie2. How many 100-mers are mapped back to the correct position (the place they were extracted from) with mapq >= 20? How many are mapped with mapq >= 20, but to an incorrect position?

– 1b. We consider k-mers that are mapped back to their correct position in the genome with mapq >= 20 to be uniquely mappable. For each position in chromosome 22, count how many uniquely mappable k-mers are mapped to it. Make a histogram of the number of positions in chromosome 22 with a given number of uniquely mappable k-mers overlapping them. Use k = 100, and the 100-mers you generated in 1a. [Hint: What is the maximum number of k-mers that can uniquely map to a single position? How many 100-mers can overlap one base in chromosome 22?]

– 1c. Repeat the previous two questions for k = 25, 50 and 200, and discuss (in a couple of sentences) the effect of k on how many k-mers are uniquely mappable and on how many uniquely mappable k-mers overlap each position in the genome.

### Question 2. Small Variant Analysis [10 pts]

Download the read set from here:

For this question, you may find this tutorial helpful:

– 2a. Using bowtie2, how many reads align to the chromosome 22 reference? How many reads did not align? How many aligned reads had a mate that did not align (AKA singletons)? Count each read in a pair separately.
[Hint: Build the index using `bowtie2-build`, align reads using `bowtie2`, analyze with `samtools flagstat`.]

– 2b. How many reads are mapped to the reverse strand? Count each read in a pair separately.
[Hint: Find out what SAM flags mean [here]( and use samtools view.]

– 2c. How many high-quality (QUAL > 20) single nucleotide and indel variants does the sample have? Of the indels, how many are insertions and how many are deletions?
[Hint: Identify variants using `freebayes` – sort the SAM file first. Then call variants with freebayes. Filter using `bcftools filter`, and summarize using `bcftools stats`.]

### Question 3. Binomial Distribution [10 pts]

– 3a. For coverage n = 10 to 200, calculate the maximum number of minor allele reads (round down) that would make your one-sided binomial test reject the null hypothesis p=0.5 at 0.05 significance. Plot coverage on the x-axis and (number of reads)/(coverage) on the y-axis. Note this is the minimum number of reads that are necessary to believe we might have a heterozygous variant on the second haplotype rather than just mere sequencing error.

– 3b. What asymptote does the plot seem to approach? Why is this?

#### Question 4. De novo mutation analysis [20 pts]

For this question, we will be focusing on the de novo variants identified in this paper:<br>

Download the de novo variant positions from here (Supplementary Table S4):<br>

Download the gene annotation of the human genome here: <br>

Download the annotation of regulatory variants from here:<br>

**NOTE** The variants are reported using version 37 of the reference genome, but the annotation is for version 38. Fortunately, you can ‘lift-over’ the variants to the coordinates on the new reference genome using several avaible tools. I recommmend the [UCSC liftover tool]( that can do this in batch by converting the variants into BED format. Note, some variants may not successfully lift over, especially if they become repetitive and/or missing in the new reference, so please make a note of how many variants fail liftover.

– 4a. How much of the genome is annotated as a gene?

– 4b. What is the sequence of the shortest gene on chromosome 22? [Hint: `bedtools getfasta`]

– 4c. How much of the genome is an annotated regulatory sequence (any type)? [Hint `bedtools merge`]

– 4d. How much of the genome is neither gene nor regulatory sequences? [Hint: `bedtools merge` + `bedtools subtract`]

– 4e. Using the [UCSC liftover tool](, how many of the variants can be successfully lifted over to hg38?

– 4f. How many variants are in genes? [Hint: convert xlsx to BED, then `bedtools`]

– 4g. How many variants are in *any* annotated regulatory regions? [Hint: `bedtools`]

– 4h. What type of annotated regulatory region has the most variants? [Hint: `bedtools`]

– 4i. Is this a statistically significant number of variants (P-value < 0.05)? [Hint: If you don’t want to calculate this analytically, you can do an experiment. Try simulating the same number of variants as the original file 100 times, and see how many fall into this regulatory type. If at least this many variants fall into this feature type more than 5% of the trials, this is not statistically significant]

### Packaging

The solutions to the above questions should be submitted as a single PDF document that includes your name, email address, and
all relevant figures (as needed). Make sure to clearly label each of the subproblems and give the exact commands and/or code snippets you used for
solving the question. You do not need to show code for plotting. Submit your solutions by uploading the PDF to [GradeScope](, and remember to select where in your submission each question/subquestion is.

If you submit after this time, you will start to use up your late days. Remember, you are only allowed 96 hours (4 days) for the entire semester!

## Resources

#### [Bowtie2]( – Short read aligner

conda install bowtie2
bowtie2-build chr22.fa chr22
bowtie2 -x chr22 -1 sample/pair.1.fq -2 sample/pair.2.fq > sample.sam

While running alignments, if you get an error `Warning: Exhausted best-first chunk memory for read XXXX; skipping read` increase the command-line parameter `–chunkmbs`.

#### [FreeBayes]( – Small Variant Identification

conda install freebayes
freebayes -f chr22.fa sample.bam > sample.vcf

#### [bcftools]( – VCF Summarry

conda install bcftools
bcftools filter -e “QUAL<20” sample.vcf > filtered.vcf
bcftools stats filtered.vcf > stats.txt

#### [BEDTools]( – Genome Arithmetic

Get bedtools from conda, if you haven’t already.

conda install bedtools

PlaceholderAssignment 4: Variant Analysis and Mappability
Open chat
Need help?
Can we help?