Day 3 Morning


On day 1, we ran through a pipeline to map reads against a reference genome and call variants, but didn’t do much with the variants we identified. Among the most common analyses to perform on a set of variants is to construct phylogenetic trees. Here we will explore different tools for generating and visualizing phylogenetic trees, and also see how recombination can distort phylogenetic signal.

For the first several exercises, we will use the A. baumannii genomes that we worked with yesterday afternoon. The backstory on these genomes is that Abau_A, Abau_B and Abau_C are representatives of three clones (as defined by pulsed-field gel electrophoresis - a low-resolution typing method) that were circulating in our hospital.

One of the goals of our published study was to understand the relationship among these clones to discern whether:

  1. the three clones represent three independent introductions into the hospital or
  2. the three clones originated from a single introduction into the hospital, with subsequent genomic rearrangement leading to the appearance of unique clones.

The types of phylogenetic analyses you will be performing here are the same types that we used to decipher this mystery. The other two genomes you will be using are ACICU and AB0057. ACICU is an isolate from a hospital in France, and its close relationship to our isolates makes it a good reference for comparison. AB0057 is a more distantly related isolate that we will utilize as an out-group in our phylogenetic analysis. The utility of an out-group is to help us root our phylogenetic tree, and gain a more nuanced understanding of the relationship among strains.

Execute the following command to copy files for this afternoon’s exercises to your scratch directory:



cd /scratch/micro612w19_fluxod/username

cp -r /scratch/micro612w19_fluxod/shared/data/day3_morn ./

Perform whole genome alignment with Mauve and convert alignment to other useful formats

[back to top] [HOME]

An alternative approach for identification of variants among genomes is to perform whole genome alignments of assemblies. If the original short read data is unavailable, this might be the only approach available to you. Typically, these programs don’t scale well to large numbers of genomes (e.g. > 100), but they are worth being familiar with. We will use the tool mauve for constructing whole genome alignments of our five A. baumannii genomes.

i. Perform mauve alignment and transfer xmfa back to flux

Use cyberduck/scp to get genomes folder Abau_genomes onto your laptop

Run these commands on your local system/terminal:

cd ~/Desktop (or wherever your desktop is) 

mkdir Abau_mauve

cd Abau_mauve 

- Now copy Abau_genomes folder residing in your day3_morn folder using scp or cyberduck:

scp -r ./

Run mauve to create multiple alignment

i. Open mauve 
ii. File -> align with progressiveMauve 
iii. Click on “Add Sequnce” and add each of the 5 genomes you just downloaded(under Abau_genomes folder)
iv. Name the output file “mauve_ECII_outgroup” and make sure it is in the directory you created for this exercise i.e Abau_mauve
v. Click Align! 
vi. Wait for Mauve to finish and explore the graphical interface

Use cyberduck or scp to transfer your alignment back to flux for some processing

scp ~/Desktop/Abau_mauve/mauve_ECII_outgroup 
ii. Convert alignment to fasta format

Mauve produces alignments in .xmfa format (use less to see what this looks like), which is not compatible with other programs we want to use. We will use a custom script to change the alignment format to fasta format

Now run these command in day3_morn folder on flux:

module load bioperl

perl -i mauve_ECII_outgroup -o mauve_ECII_outgroup.fasta -f fasta -c

Perform some DNA sequence comparisons and phylogenetic analysis in APE, an R package

[back to top] [HOME]

There are lots of options for phylogenetic analysis. Here, we will use the ape package in R to look at our multiple alignments and construct a tree using the Neighbor Joining method.

Note that ape has a ton of useful functions for more sophisticated phylogenetic analyses!

i. Get fasta alignment you just converted to your own computer using cyberduck or scp

cd ~/Desktop/Abau_mauve

scp ./
ii. Read alignment into R

Fire up RStudio, set your working directory to ~/Desktop/Abau_mauve/ or wherever you have downloaded mauve_ECII_outgroup.fasta file and install/load ape

Use the read.dna function in ape to read in you multiple alignments. Print out the variable to get a summary.

abau_msa = read.dna('mauve_ECII_outgroup.fasta', format = "fasta") 
iii. Get variable positions

The DNA object created by read.dna can also be addressed as a matrix, where the columns are positions in the alignment and rows are your sequences. We will next treat our alignment as a matrix, and use apply and colSums to get positions in the alignment that vary among our sequences. Examine these commands in detail to understand how they are working together to give you a logical vector indicating which positions vary in your alignment.

abau_msa_bin = apply(abau_msa, 2, FUN = function(x){x == x[1]}) 

abau_var_pos = colSums(abau_msa_bin) < 5
iv. Get non-gap positions

For our phylogenetic analysis we want to focus on the core genome, so we will next identify positions in the alignment where all our genomes have sequence.

non_gap_pos = colSums(as.character(abau_msa) == '-') == 0
v. Count number of variants between sequences

Now that we know which positions in the alignment are core and variable, we can extract these positions and count how many variants there are among our genomes. Do count pairwise variants we will use the dist.dna function in ape. The model parameter indicates that we want to compare sequences by counting differences. Print out the resulting matrix to see how different our genomes are.

abau_msa_var = abau_msa[,abau_var_pos & non_gap_pos ]
var_count_matrix = dist.dna(abau_msa_var, model = "N")
vi. Construct phylogenetic tree

Now we are ready to construct our first phylogenetic tree!

We are going to use the Neighbor Joining algorithm, which takes a matrix of pairwise distances among the input sequences and produces the tree with the minimal total distance. In essence, you can think of this as a distance-based maximum parsimony algorithm, with the advantage being that it runs way faster than if you were to apply a standard maximum parsimony phylogenetic reconstruction.

As a first step we are going to build a more accurate distance matrix, where instead of counting variants, we will measure nucleotide distance using the Jukes-Cantor model of sequence evolution. This is the simplest model of sequence evolution, with a single mutation rate assumed for all types of nucleotide changes.

dna_dist_JC = dist.dna(abau_msa, model = "JC")

Next, we will use the ape function nj to build our tree from the distance matrix

abau_nj_tree = nj(dna_dist_JC)

Finally, plot your tree to see how the genomes group.


Perform SNP density analysis to discern evidence of recombination

[back to top] [HOME]

An often-overlooked aspect of a proper phylogenetic analysis is to exclude recombinant sequences. Homologous recombination in bacterial genomes is a mode of horizontal transfer, wherein genomic DNA is taken up and swapped in for a homologous sequence. The reason it is critical to account for these recombinant regions is that these horizontally acquired sequences do not represent the phylogenetic history of the strain of interest, but rather in contains information regarding the strain in which the sequence was acquired from. One simple approach for detecting the presence of recombination is to look at the density of variants across a genome. The existence of unusually high or low densities of variants is suggestive that these regions of aberrant density were horizontally acquired. Here we will look at our closely related A. baumannii genomes to see if there is evidence of aberrant variant densities.

i. Subset sequences to exclude the out-group

For this analysis we want to exclude the out-group, because we are interested in determining whether recombination would hamper our ability to reconstruct the phylogenetic relationship among our closely related set of genomes.

  • Note that the names of the sequences might be different for you, so check that if the command doesn’t work. Try printing out abau_msa_no_outgroup to check that it worked.

abau_msa_no_outgroup = abau_msa[c('ACICU_genome','AbauA_genome','AbauC_genome','AbauB_genome'),]


abau_msa_no_outgroup = abau_msa[c('ACICU_genome.fasta/1-3996847','AbauA_genome.fasta/1-3953855','AbauB_genome.fasta/1-4014916','AbauC_genome.fasta/1-4200364', 'Abau_AB0057_genome.fa/1-4050513'),]
ii. Get variable positions

Next, we will get the variable positions, as before

abau_msa_no_outgroup_bin = apply(abau_msa_no_outgroup, 2, FUN = function(x){x == x[1]}) 

abau_no_outgroup_var_pos = colSums(abau_msa_no_outgroup_bin) < 4
iii. Get non-gap positions

Next, we will get the core positions, as before

abau_no_outgroup_non_gap_pos = colSums(as.character(abau_msa_no_outgroup) == '-') == 0
iv. Create overall histogram of SNP density

Finally, create a histogram of SNP density across the genome. Does the density look even, or do you think there might be just a touch of recombination?

hist(which(abau_no_outgroup_var_pos & abau_no_outgroup_non_gap_pos), 10000)

Perform recombination filtering with Gubbins

[back to top] [HOME]

Now that we know there is recombination, we know that we need to filter out the recombinant regions to discern the true phylogenetic relationship among our strains. In fact, this is such an extreme case (~99% of variants of recombinant), that we could be totally misled without filtering recombinant regions. To accomplish this we will use the tool gubbins, which essentially relies on elevated regions of variant density to perform recombination filtering.

i. Run gubbins on your fasta alignment

Go back on flux and load modules required by gubbins

module load bioperl python-anaconda2/201607 biopython dendropy reportlab fasttree RAxML fastml/gub gubbins

Run gubbins on your fasta formatted alignment



cd /scratch/micro612w19_fluxod/username/day3_morn

sed -i 's/.fa.*//g' mauve_ECII_outgroup.fasta -v -f 50 -o Abau_AB0057_genome mauve_ECII_outgroup.fasta
ii. Create gubbins output figure

Gubbins produces a series of output files, some of which can be run through another program to produce a visual display of filtered recombinant regions. Run the script to create a pdf visualization of recombinant regions.

The inputs are:

  1. the recombination filtered tree created by gubbins (mauve_ECII_outgroup.final_tree.tre),
  2. the pdf file to create (mauve_ECII_outgroup.recombination.pdf) and
  3. a .embl representation of recombinant regions (mauve_ECII_outgroup.recombination_predictions.embl). -t mauve_ECII_outgroup.final_tree.tre -o mauve_ECII_outgroup.recombination.pdf mauve_ECII_outgroup.recombination_predictions.embl
iii. Download and view gubbins figure and filtered tree

Use cyberduck or scp to get gubbins output files into Abau_mauve on your local system

cd ~/Desktop/Abau_mauve

scp  ./
scp  ./

Open up the pdf and observe the recombinant regions filtered out by gubbins. Does it roughly match your expectations based upon your SNP density plots?

Finally, lets look at the recombination-filtered tree to see if this alters our conclusions.

To view the tree we will use the ape package in R:

# In RStudio

# Load ape library

# Path to tree file
tree_file <- '~/Desktop/Abau_mauve/mauve_ECII_outgroup.final_tree.tre'

# Read in tree
tree <- read.tree(tree_file)

# Plot tree

How does the structure look different than the unfiltered tree?

  • Note that turning back to the backstory of these isolates, Abau_B and Abau_C were both isolated first from the same patient. So this analysis supports that patient having imported both strains, which likely diverged at a prior hospital at which they resided.

Overlay metadata on your tree using R

[back to top] [HOME]

For the final exercise we will use a different dataset, composed of USA300 methicillin-resistant Staphylococcus aureus genomes. USA300 is a strain of growing concern, as it has been observed to cause infections in both hospitals and in otherwise healthy individuals in the community. An open question is whether there are sub-clades of USA300 in the hospital and the community, or if they are all the same. Here you will create an annotated phylogenetic tree of strains from the community and the hospital, to discern if these form distinct clusters.

i. Download MRSA genome alignment from flux

Use cyberduck or scp to get genomes onto your laptop

cd ~/Desktop (or wherever your desktop is) 
mkdir MRSA_genomes 
cd MRSA_genomes

scp  ./
scp  ./

ii. Look at SNP density for MRSA alignment in R

Before we embark on our phylogenetic analysis, lets look at the SNP density to verify that there is no recombination

mrsa_msa = read.dna('2016-03-09_KP_BSI_USA300.fa', format = 'fasta') 
mrsa_msa_bin = apply(mrsa_msa, 2, FUN = function(x){x == x[1]}) 
mrsa_var_pos = colSums(mrsa_msa_bin) < nrow(mrsa_msa_bin) 
hist(which(mrsa_var_pos), 10000)

Does it look like there is evidence of recombination?

***iii. Create neighbor joining tree in R ***

We will be using R to construct a neighbor-joining tree. Neighbor joining trees fall under the category of “distance-based” tree building. There are more sophisticated algorithms for tree building, including maximum likelihood and bayesian methods.

There are a number of bioinformatics tools to create trees. Note that in your research it is not a good idea to use these phylogenetic tools completely blind and I strongly encourage embarking on deeper learning yourself, or consulting with an expert before doing an analysis for a publication.

One resource here at University of Michigan is the Phylogenetic Methods course, EEB 491.

First, we will need to install and load in some packages to work with tree objects in R.


library(ape) #installed previously, load in if not already in your environment 

Next, we will create a pairwise SNV distance matrix of the MRSA samples to create a neighbor joining tree.

mrsa_msa_var = mrsa_msa[, mrsa_var_pos]
dna_dist = dist.dna(mrsa_msa_var, model = 'N', as.matrix = TRUE)

Finally, we can use the distance matrix to construct a neighbor joining tree using the function NJ()

NJ_tree = NJ(dna_dist) 

We can look at our tree using plot()


We can explore other representations of our tree in R using the flag “type” in the plot function

plot(NJ_tree, type = 'fan')
plot(NJ_tree, type = 'cladogram')
plot(NJ_tree, type = 'phylogram') #default
***iv. Add annotations to tree in R ***

Now, we will overlay our data on whether an isolate was from a community or hospital infection onto the tree.

Here is a great example of some of the different ways to annotate your trees in R:

First, we will read in our metadata.

metadata = read.table('HA_vs_CA', header = TRUE, stringsAsFactors = FALSE)

Next, we will create our isolate legend and assign colors to the legend.

isolate_legend = structure(metadata[,2], names = metadata[,1])
isolate_colors = structure(c('red', 'blue'), names = sort(unique(isolate_legend)))

Finally, we can use the function tiplabels() to add annotations to our tree and the function legend() to put a legend on our tree.

plot(NJ_tree, type = 'fan', no.margin = TRUE, 
     cex = 0.5, label.offset = 3, align.tip.label = FALSE)
tiplabels(pie = to.matrix(isolate_legend,names(isolate_colors)), 
          piecol = isolate_colors, cex = 0.3, adj = 1.4)
legend('bottomleft', legend = names(isolate_colors), 
       col = "black", = isolate_colors, pch = 21, cex = 1)
  • Do community and hospital isolates cluster together, or are they inter-mixed?

Note, there are also web tools out there to overlay metadata onto your tree and to manipulate the tree in other ways. One such tool is iTOL.