Microbial Comparative Genomics Workshop - 2021

A 3 day microbial bioinformatics workshop conducted by Dr. Evan Snitkin at University of Michigan. This module covers the basics of microbial genomic analysis using publicly available tools that are commonly referenced in genomics literature. Students will learn the steps and associated tools that are required to process, annotate and compare microbial genomes.

Date: April 21 - 23 2021

Prerequisites

Prior participation in a Data Carpentry Workshop

Workshop

Day 1 Morning

[HOME]

Goal

  • This morning we will learn how to set up our unix and conda environment which is a necessity when it comes to working on command line.
  • Setting up an environment will make our life easier and running commands more enjoyable.
  • We will brush up on few unix programs that some of you learned in Data Carpentry workshop and see how they can be employed for accessing and parsing omics datasets.
  • We will also learn how for loops and awk can be employed to parse and extract complex information from common bioinformatics file formats.
  • At the end of the session, an R exercise will give you an overview as to how you can parse and visualize omics datasets.

Installing and setting up Cyberduck for file transfer

During workshop, we will transfer different output files from great lakes to your local system. Cyberduck makes it easier to drag and drop any remote file onto your local system and vice versa. Of course, you can use “scp” to transfer files but Cyberduck provides a graphical interface to manage file transfer and helps avoid typing long file paths and commands.

1. Go to this cyberduck website and download the executable for your respective operating system.
2. Double-click on the downloaded zip file to unzip it and double click cyberduck icon.
3. Type sftp://greatlakes-xfer.arc-ts.umich.edu in quickconnect bar(or click on Open Connection button on top left corner), press enter and enter your great lakes username and password.
4. This will take you to your great lakes home directory /home/username. Select “Go” from tool bar at the top then select “Go to folder” and enter workshop home directory path: /scratch/micro612w21_class_root/micro612w21_class/

To transfer or upload a file, you can drag and drop it into the location you want.

Getting your data onto great lakes and setting up environment variable

Log in to great lakes

ssh username@greatlakes.arc-ts.umich.edu

Setting up environment variables in .bashrc file so your environment is all set for genomic analysis!

Environment variables are the variables/values that describe the environment in which programs run in. All the programs and scripts on your unix system use these variables for extracting information such as:

  • What is my current working directory?,
  • Where are temporary files stored?,
  • Where are perl/python libraries?,
  • Where is Blast installed? etc.

In addition to environment variables that are set up by system administators, each user can set their own environment variables to customize their experience. This may sound like something super advanced that isn’t relevant to beginners, but that’s not true!

Some examples of ways that we will use environment variables in the class are:

  1. create shortcuts for directories that you frequently go to,
  2. Setup a conda environment to install all the required tools and have them availble in your environment
  3. setup a shortcut for getting on a cluster node, so that you don’t have to write out the full command each time.

One way to set your environment variables would be to manually set up these variables everytime you log in, but this would be extremely tedious and inefficient. So, Unix has setup a way around this, which is to put your environment variable assignments in special files called .bashrc or .bash_profile. Every user has one or both of these files in their home directory, and what’s special about them is that the commands in them are executed every time you login. So, if you simply set your environmental variable assignments in one of these files, your environment will be setup just the way you want it each time you login!

All the softwares/tools that we need in this workshop are installed in a directory “/scratch/micro612w21_class_root/micro612w21_class/shared/bin/” and we want the shell to look for these installed tools in this directory. For this, We will save the full path to these tools in an environment variable PATH.

i. Make a backup copy of bashrc file in case something goes wrong.

cp ~/.bashrc ~/bashrc_backup

#Note: "~/" represents your home directory. On great lakes, these means /home/username
ii. Open ~/.bashrc file using any text editor and add the following lines at the end of your .bashrc file.

Note: Replace “username” under alias shortcuts with your own umich “uniqname”.

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/sw/arcts/centos7/python3.7-anaconda/2019.07/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/sw/arcts/centos7/python3.7-anaconda/2019.07/etc/profile.d/conda.sh" ]; then
        . "/sw/arcts/centos7/python3.7-anaconda/2019.07/etc/profile.d/conda.sh"
    else
        export PATH="/sw/arcts/centos7/python3.7-anaconda/2019.07/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

##Micro612 Workshop ENV

#Aliases
alias islurm='srun --account=micro612w21_class --nodes=1 --ntasks-per-node=1 --mem-per-cpu=5GB --cpus-per-task=1 --time=12:00:00 --pty /bin/bash'
alias wd='cd /scratch/micro612w21_class_root/micro612w21_class/username/'
alias d1m='cd /scratch/micro612w21_class_root/micro612w21_class/username/day1am'
alias d1a='cd /scratch/micro612w21_class_root/micro612w21_class/username/day1pm'
alias d2m='cd /scratch/micro612w21_class_root/micro612w21_class/username/day2am'
alias d2a='cd /scratch/micro612w21_class_root/micro612w21_class/username/day2pm'
alias d3m='cd /scratch/micro612w21_class_root/micro612w21_class/username/day3am'
alias d3a='cd /scratch/micro612w21_class_root/micro612w21_class/username/day3pm'


#Great Lakes Modules
#module load Bioinformatics
#module load perl-modules



#Bioinformatics Tools
export PATH=$PATH:/scratch/micro612w21_class_root/micro612w21_class/shared/bin/quast/

Note: Replace “username” under alias shortcuts with your own umich “uniqname”. In the text editor, nano, you can do this by

  • typing Ctrl + \ and You will then be prompted to type in your search string (here, username).
  • Press return. Then you will be prompted to enter what you want to replace “username” with (here, your uniqname).
  • Press return. Then press a to replace all incidences or y to accept each incidence one by one.

You can also customize the alias name such as wd, d1am etc. catering to your own need and convenience.

The above environment settings will set various shortcuts such as “islurm” for entering interactive great lakes session, “wd” to navigate to your workshop directory, call necessary great lakes modules and perl libraries required by certain tools and finally sets the path for bioinformatics programs that we will run during the workshop.

iii. Save the file and Source .bashrc file to make these changes permanent.

source ~/.bashrc
iv. Check if the $PATH environment variable is updated

echo $PATH

#You will see a long list of paths that has been added to your $PATH variable

wd

You should be in your workshop working directory that is /scratch/micro612w21_class_root/micro612w21_class/username

v. Set up a conda environment using a YML file

The YML file - micro612.yml required for generating the conda environment is located here:

/scratch/micro612w21_class_root/micro612w21_class/shared/conda_envs/

Load great lakes anaconda package and set up a conda environment in the following way -

# Load anaconda package from great lakes 
module load python3.7-anaconda/2019.07

# Set channel_priority to false so that it can install packages as per the YML file and not from loaded channels.
conda config --set channel_priority false

# Create a new conda environment - micro612 from a YML file
conda env create -f /scratch/micro612w21_class_root/micro612w21_class/shared/conda_envs/micro612.yml -n micro612

# Load your environment and use the tools
conda activate micro612

# Check if the tools were properly installed by conda and are callable from your environment 
bash /scratch/micro612w21_class_root/micro612w21_class/shared/conda_envs/check_micro612_installation.sh 

# Problem installing PyVCF and biopython with Conda channels
pip install PyVCF --user
pip install biopython --user

# Update one of the databases that you would need in one of the Kraken exercises 
ktUpdateTaxonomy.sh

Unix is your friend

Up until now you’ve probably accessed sequence data from NCBI by going to the website, laboriously clicking around and finally finding and downloading the data you want.

There are a lot of reasons that is not ideal:

  • It’s frustrating and slow to deal with the web interface
  • It can be hard to keep track of where the data came from and exactly which version of a sequence you downloaded
  • Its not conducive to downloading lots of sequence data

To download sequence data in Unix you can use a variety of commands (e.g. sftp, wget, curl). Here, we will use the curl command to download some genome assemblies from NCBI ftp location:

  • Go to your class home directory (use your wd shortcut!)
  • Execute the following commands to copy files for this morning’s exercises to your home directory:
cp -r /scratch/micro612w21_class_root/micro612w21_class/shared/data/day1am/ ./

cd day1am/

#or 

d1m

ls
  • Now get three genome sequences with the following commands:
curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/241/685/GCF_000241685.1_ASM24168v2/GCF_000241685.1_ASM24168v2_genomic.fna.gz >Acinetobacter_baumannii.fna.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/409/005/GCF_000409005.1_gkp33v01/GCF_000409005.1_gkp33v01_genomic.fna.gz > Kleb_pneu.fna.gz

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/165/655/GCF_000165655.1_ASM16565v1/GCF_000165655.1_ASM16565v1_genomic.fna.gz > E_coli.fna.gz
  • Decompress the gzip compressed fasta file using gzip command
gzip -d Acinetobacter_baumannii.fna.gz
gzip -d Kleb_pneu.fna.gz
gzip -d E_coli.fna.gz

These files are genome assemblies in fasta format. Fasta files are a common sequence data format that is composed of alternating sequence headers (sequence names and comments) and their corresponding sequences. Of great importance, the sequence header lines must start with “>”. These genome assemblies have one header line for each contig in the assembly, and our goal will be to count the number of contigs/sequences. To do this we will string together two Unix commands: “grep” and “wc”. “grep” (stands for global regular expression print), is an extremely powerful pattern matching command, which we will use to identify all the lines that start with a “>”. “wc” (stand for word count) is a command for counting words, characters and lines in a file. To count the number of contigs in one of your fasta files enter:

grep ">" E_coli.fna | wc -l

Try this command on other assemblies to see how many contigs they contain.

Your first sequence analysis program!!!

OK, so now that we have a useful command, wouldn’t it be great to turn it into a program that you can easily apply to a large number of genome assemblies? Of course it would! So, now we are going to take out cool contig counting command, and put it in a shell script that applies it to all files in the desired directory.

There will be times when you have multiple sets of files in a folder in which case it becomes cumbersome to run individual commands on each file. To simplify this task, most programming language have a concept of loops that can be employed to repeat a task/command on a bunch of files repeatedly. Here we have three fasta files for which we want to know the number of contigs in each file. We can either run the above mentioned grep command seperately on each file or use it in a “for” loop that iterates through a set of values/files until that list is exhausted.

Try the below example of for loop, that loops over a bunch of numbers and prints out each value until the list is exhausted.

for i in 1 2 3 4 5; do echo "Looping ... number $i"; done

A simple for loop statement consists of three sections:

  1. for statement that loops through values and files
  2. a do statement that can be any type of command that you want to run on a file or a tool that uses the current loop value as an input
  3. done statement that indicates completion of a do statement.

Note that the list values - (1 2 3 4 5) in the above for loop can be anything at all. It can be a bunch of files in a folder with a specific extension (*.gz, *.fasta, *.fna) or a list of values generated through a seperate command that we will see later.

We will incorporate a similar type of for loop in fasta_counter.sh script that will loop over all the *.fna files in the folder. We will provide the name of the folder through a command line argument and count the number of contigs in each file. A command line argument is a sort of input that can be provided to a script which can then be used in different ways inside the script. fasta_counter.sh requires to know which directory to look for for *.fna files. For this purpose, we will use positional parameters that are a series of special variables ($0 through $9) that contain the contents of the command line.

Lets take an example to understand what those parameters stands for:

./some_program.sh Argument1 Argument2 Argument3

In the above command, we provide three command line Arguments that acts like an input to some_program.sh These command line argument inputs can then be used to inside the scripts in the form of $0, $1, $2 and so on in different ways to run a command or a tool.

Try running the above command and see how it prints out each positional parameters. $0 will be always be the name of the script. $1 would contain “Argument1” , $2 would contain “Argument2” and so on…

Lets try to incorporate a for loop inside the fasta_counter.sh script that uses the first command line argument - i.e directory name and search for *.fna files in that directory and runs contig counting command on each of them.

  • Open “fasta_counter.sh” in nano or your favourite text editor and follow instructions for making edits so it will do what we want it to do
  • Run this script in day1am directory and verify that you get the correct results. Basic usage of the script will be:

./fasta_counter.sh

./fasta_counter.sh .

The “.” sign tells the script to use current directory as its first command line argument($1)

Power of Unix commands

In software carpentry, you learned working with shell and automating simple tasks using basic unix commands. Lets see how some of these commands can be employed in genomics analysis while exploring various file formats that we use in day to day analysis. For this session, we will try to explore two different types of bioinformatics file formats:

gff: used for describing genes and other features of DNA, RNA and protein sequences

fastq: used for storing biological sequence / sequencing reads (usually nucleotide sequence) and its corresponding quality scores

Exploring GFF files

The GFF (General Feature Format) format is a tab-seperated file and consists of one line per feature, each containing 9 columns of data.

column 1: seqname - name of the genome or contig or scaffold

column 2: source - name of the program that generated this feature, or the data source (database or project name)

column 3: feature - feature type name, e.g. Gene, exon, CDS, rRNA, tRNA, CRISPR, etc.

column 4: start - Start position of the feature, with sequence numbering starting at 1.

column 5: end - End position of the feature, with sequence numbering starting at 1.

column 6: score - A floating point value.

column 7: strand - defined as + (forward) or - (reverse).

column 8: frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..

column 9: attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature such as gene name, product name etc.

  • Use less to explore first few lines of a gff file sample.gff

less sample.gff

Note: lines starting with pound sign “#” represent comments and are used to document extra information about the features.

You will notice that the GFF format follows version 3 specifications(”##gff-version 3”), followed by genome name(”#Genome: 1087440.3|Klebsiella pneumoniae subsp. pneumoniae KPNIH1”), date(”#Date:02/09/2017”) when it was generated, contig name(”##sequence-region”) and finally tab-seperated lines describing features.

You can press space bar on keyboard to read more lines and “q” key to exit less command.

  • Question: Suppose, you want to find out the number of annotated features in a gff file. how will you achieve this using grep and wc?
Solution
grep -v '^#' sample.gff | wc -l
  • Question: How about counting the number of rRNA features in a gff(third column) file using grep, cut and wc? You can check the usage for cut by typing “cut –help”
Solution

cut -f 3 sample.gff | grep 'rRNA' | wc -l

#Or number of CDS or tRNA features?

cut -f 3 sample.gff | grep 'CDS' | wc -l
cut -f 3 sample.gff | grep 'tRNA' | wc -l

#Note: In the above command, we are trying to extract feature information from third column.
  • Question: Try counting the number of features on a “+” or “-” strand (column 7).

Now, let’s use what we learned about for loops and creating shell scripts above to create a script called “feature_counter.sh” This script will take as input a directory and will search for all gff files in the directory. It will calculate and output the number of tRNA features in the gff.

  • Open “feature_counter.sh” in nano or your favourite text editor and follow instructions for making edits so it will do what we want it to do
  • Run this script in day1am directory and verify that you get the correct results. Basic usage of the script will be:

./feature_counter.sh

./feature_counter.sh .

Some more useful one-line unix commands for GFF files: here

Now we’re going to play around with a GFF in R. Specifically, we’re interested in looking at the distribution of gene length for all of the genes in the gff file.

Copy the sample.gff file to your computer using scp or cyberduck:

scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/shared/data/day1am/sample.gff ~/Desktop/

Note: You can use your choice of folder/path to copy the file instead of  “~/Desktop/”

Open a text file in RStudio and run the following commands:

# Plot histogram of gene lengths

# Read in gff file
gff = read.delim('~/Desktop/sample.gff',
                 comment.char = '#', # ignore lines that start with '#'
                 header=F,  #  no header
                 stringsAsFactors = FALSE) # do not read in as data type "factor"

# Rename columns
colnames(gff) = c('seqname','source','feature','start','end','score','strand','frame','attribute')

# Look at the head of the gff file
head(gff)

# Get the gene lengths
gene_lengths = gff$end - gff$start

# Plot a histogram of the gene lengths
hist(gene_lengths,
     breaks = 100, # 100 cells
     xlab = 'Gene Length (bp)', # change x label
     main = '') # no title

What information do you learn about gene lengths in this genome?

  • Challenge: Now that you know the gene lengths, get the length of the smallest and largest gene. Then get the attribute of the smallest and largest gene.
Solution
# Length of smallest and largest gene
min(gene_lengths) 
max(gene_lengths) 

# Attribute of smallest and largest gene
gff$attribute[gene_lengths == min(gene_lengths)]
gff$attribute[gene_lengths == max(gene_lengths)]

# Or you could use which.min() and which.max()
gff$attribute[which.min(gene_lengths)]
gff$attribute[which.max(gene_lengths)]

What information do you learn about gene lengths in this genome?

Submit Variant Calling Job

Before we go on a break, we will run a variant calling job that will run all the standard variant calling commands on a sample that we will explore in today’s afternoon session. The script will run all necessary commands associated with variant calling in an automated fashion. This will let us give ample time to explore the commands that are involved in each of the steps and explore the results that the script generates.

We will come back later to the script to understand some of the basics of shell scripting and how different commands can be tied together to run a standard process on a bunch of samples.

  • Go to your class home directory (use your wd shortcut!)
  • Execute the following commands to copy files for this afternoon’s exercises to your home directory:
wd


cp -r /scratch/micro612w21_class_root/micro612w21_class/shared/data/day1pm/ ./

We will be using sequencing reads from an Illumina-sequenced Klebsiella pneumoniae genome (sample PCMP_H326) as an input for these exercises. This sample, isolated from a hospitalized patient, is resistant to colistin, an antibiotic of last resort. We are interested in seeing if we can identify any mutations in the PCMP_H326 genome that could explain why this sample is resistant to colistin. Colistin resistance can arise through various mutations (see this review). To narrow our initial search, we will specifically look for mutations that inactivate the mgrB gene, a negative regulator of the PhoPQ two-component signalling system.

Change directory to the variant calling folder inside day1pm directory and list all the files to search variant_call.sh script.

d1a

cd variant_calling/

#or

cd /scratch/micro612w21_class_root/micro612w21_class/username/day1pm/variant_calling/

ls variant_call.sh

Try running the script with help menu and check all the inputs that is required by the script to run variant calling.


./variant_call.sh -h

USAGE: variant_call.sh forward_paired reverse_paired reference_genome output_directory basename [-h] – A simple shell script to run Variant Calling steps on a pair of fastq reads.

The script requires following positional arguments as input to call variants:

  1. Forward Paired end reads
  2. Reverse Paired end reads
  3. Path to Reference Genome Fasta file
  4. Output Directory Path
  5. Analysis Base name to store result files with this prefix.

The day1pm directory also contains a slurm script called variant_call.sbat. We will run variant_call.sh on the Great Lakes Cluster using the following command (you will see this command in the variant_call.sbat).


./variant_call.sh PCMP_H326_R1.fastq.gz PCMP_H326_R2.fastq.gz KPNIH1.fasta ./ PCMP_H326_ 1>variant_call.log 2> variant_call.err

Note: An error came up while testing the availability of python packages. Before submitting the script, try running parse_snpEff.py under the scripts folder to check if the script runs without any error:

python scripts/parse_snpEff.py -h

If the script runs without any error then you are good to move forward with submitting the slurm script. But if the script raises an error; try installing these two packages with pip:

pip install PyVCF --user
pip install biopython --user

Once you are done editing the slurm script, you can go ahead and submit the job. Make sure you are submitting the job from variant_calling folder and you have activated the conda environment. We will go through each of the variant calling result steps folder and explore the results in afternoon session.

Change the EMAIL_ADDRESS section (#SBATCH –mail-user=) of the slurm script to your email address using your favorite text editor (we learned nano in the pre-workshop).


nano variant_call.sbat
conda activate micro612

sbatch variant_call.sbat 

Day 1 Afternoon

[HOME]

Goal

  • We will perform quality control on raw illumina fastq reads to detect contamination and assess the quality of reads using Kraken and FastQC.
  • Clean reads by removing low quality reads and adapters using Trimmomatic.
  • Explore bioinformatics steps/commands involved in Variant calling script that we ran this morning, explore and play with various output files generated by the script and visualize BAM and VCF files in IGV.

Overview of Genomics Pipeline

Two different ways we can process raw reads include 1) variant calling and 2) genome assembly. We’ll talk about both in this course, and we’ll keep coming back to this roadmap to give some perspective on where we are in the pipeline. genomics_pipeline.pngMile high view of a genomics pipeline

Contamination Screening using Kraken

genomics_pipeline_qc.pngQC-ing

When running a sequencing pipeline, it is very important to make sure that your data matches appropriate quality threshold and are free from any contaminants. This step will help you make correct interpretations in downstream analysis and will also let you know if you are required to redo the experiment/library preparation or remove contaminant sequences.

For this purpose, we will employ Kraken which is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. We will screen our samples against a MiniKraken database (a pre-built database constructed from complete bacterial, archaeal, and viral genomes in RefSeq.) and confirm if the majority of reads in our sample belong to the target species.

i. Get an interactive cluster node to start running programs. Use the shortcut that we created in .bashrc file for getting into interactive flux session.***

How do you know if you are in interactive session?: you should see “username@glXXXX” in your command prompt where XXXX refers to the cluster node number.

islurm

conda activate micro612

Navigate to kraken directory placed under day1pm directory.

cd /scratch/micro612w21_class_root/micro612w21_class/username/day1pm/kraken/
ii. Lets run kraken on samples MRSA_CO_HA_473_R1_001.fastq.gz and MRSA_CO_HA_479_R1_001.fastq.gz which were part of the same sequencing library***

Since Kraken takes time to run, we have already placed the output of Kraken command in day1pm/kraken directory.

# Dont run these commands. MRSA_CO_HA_473_kraken and MRSA_CO_HA_479_kraken are already placed in day1pm/kraken directory
kraken --quick --fastq-input --gzip-compressed --unclassified-out MRSA_CO_HA_473_unclassified.txt --db minikraken_20171013_4GB/ --output MRSA_CO_HA_473_kraken MRSA_CO_HA_473_R1_001.fastq.gz


kraken --quick --fastq-input --gzip-compressed --unclassified-out MRSA_CO_HA_479_unclassified.txt --db minikraken_20171013_4GB/ --output MRSA_CO_HA_479_kraken MRSA_CO_HA_479_R1_001.fastq.gz
iii. Run Kraken report to generate a concise summary report of the species found in reads file.
# Update the taxonomy database before generating kraken report
ktUpdateTaxonomy.sh

kraken-report --db minikraken_20171013_4GB/ MRSA_CO_HA_473_kraken > MRSA_CO_HA_473_kraken_report.txt

kraken-report --db minikraken_20171013_4GB/ MRSA_CO_HA_479_kraken > MRSA_CO_HA_479_kraken_report.txt

The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

  1. Percentage of reads covered by the clade rooted at this taxon
  2. Number of reads covered by the clade rooted at this taxon
  3. Number of reads assigned directly to this taxon
  4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-’.
  5. NCBI taxonomy ID
  6. indented scientific name
less MRSA_CO_HA_473_kraken_report.txt

Lets extract columns by Species (column 4 - “S”) and check the major species indentified in our sample.

awk '$4 == "S" {print $0}' MRSA_CO_HA_473_kraken_report.txt | head

awk '$4 == "S" {print $0}' MRSA_CO_HA_479_kraken_report.txt | head
  • what is the percentage of majority species in MRSA_CO_HA_473 and MRSA_CO_HA_479?
  • how different they look?
  • what is the percentage of Staphylococcus aureus in MRSA_CO_HA_479?
  • majority of reads in MRSA_CO_HA_479 remained unclassified? what could be the possible explanation?

Lets visualize the same information in an ionteractive form.

iv. Generate a HTML report to visualize Kraken report using Krona
cut -f2,3 MRSA_CO_HA_473_kraken > MRSA_CO_HA_473_krona.input
cut -f2,3 MRSA_CO_HA_479_kraken > MRSA_CO_HA_479_krona.input

ktImportTaxonomy MRSA_CO_HA_473_krona.input -o MRSA_CO_HA_473_krona.out.html
ktImportTaxonomy MRSA_CO_HA_479_krona.input -o MRSA_CO_HA_479_krona.out.html

In case you get an error saying - Taxonomy not found, run ktUpdateTaxonomy.sh command again.

ktUpdateTaxonomy.sh

Use scp command as shown below or use cyberduck. If you dont the file in cyberduck window, try refreshing it using the refresh button at the top.


scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day1pm/kraken/*.html /path-to-local-directory/

#You can use ~/Desktop/ as your local directory path

Quality Control using FastQC

Now we will run FastQC on some sample raw data to assess its quality. FastQC is a quality control tool that reads in sequence data in a variety of formats(fastq, bam, sam) and can either provide an interactive application to review the results or create an HTML based report which can be integrated into any pipeline. It is generally the first step that you take upon receiving the sequence data from sequencing facility to get a quick sense of its quality and whether it exhibits any unusual properties (e.g. contamination or unexpected biological features)

i. In your day1pm directory, create a new directory for saving FastQC results.
cd /scratch/micro612w21_class_root/micro612w21_class/username/day1pm/

#or

d1a

mkdir Rush_KPC_266_FastQC_results
mkdir Rush_KPC_266_FastQC_results/before_trimmomatic
ii. Verify that FastQC is in your path by invoking it from command line.
fastqc -h

FastQC can be run in two modes: “command line” or as a GUI (graphical user interface). We will be using command line version of it.

iii. Run FastQC to generate quality report of sequence reads.
fastqc -o Rush_KPC_266_FastQC_results/before_trimmomatic/ Rush_KPC_266_1_combine.fastq.gz Rush_KPC_266_2_combine.fastq.gz --extract

This will generate two results directory, Rush_KPC_266_1_combine_fastqc and Rush_KPC_266_2_combine_fastqc in output folder Rush_KPC_266_FastQC_results/before_trimmomatic/ provided with -o flag.

The summary.txt file in these directories indicates if the data passed different quality control tests in text format.

You can visualize and assess the quality of data by opening html report in a local browser.

iv. Exit your cluster node so you don’t waste cluster resources and $$$!
v. Download the FastQC html report to your home computer to examine using scp or cyberduck

scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day1pm/Rush_KPC_266_FastQC_results/before_trimmomatic/*.html /path-to-local-directory/

The analysis in FastQC is broken down into a series of analysis modules. The left hand side of the main interactive display or the top of the HTML report show a summary of the modules which were run, and a quick evaluation of whether the results of the module seem entirely normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross).

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_morning/1.pngalt tag

Lets first look at the quality drop(per base sequence quality graph) at the end of “Per Base Sequence Quality” graph. This degredation of quality towards the end of reads is commonly observed in illumina samples. The reason for this drop is that as the number of sequencing cycles performed increases, the average quality of the base calls, as reported by the Phred Scores produced by the sequencer falls.

Next, lets check the overrepresented sequences graph and the kind of adapters that were used for sequencing these samples (Truseq or Nextera) which comes in handy while indicating the adapter database path during downstream filtering step (Trimmomatic).

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_morning/2.pngalt tag

Quality Trimming using Trimmomatic

Filtering out problematic sequences within a dataset is inherently a trade off between sensitivity (ensuring all contaminant sequences are removed) and specificity (leaving all non-contaminant sequence data intact). Adapter and other technical contaminants can potentially occur in any location within the reads.(start, end, read-through (between the reads), partial adapter sequences)

Trimmomatic is a tool that tries to search these potential contaminant/adapter sequence within the read at all the possible locations. It takes advantage of the added evidence available in paired-end dataset. In paired-end data, read-through/adapters can occur on both the forward and reverse reads of a particular fragment in the same position. Since the fragment was entirely sequenced from both ends, the non-adapter portion of the forward and reverse reads will be reverse-complements of each other. This strategy of searching for contaminant in both the reads is called ‘palindrome’ mode.

For more information on how Trimmomatic tries to achieve this, Please refer this manual.

Now we will run Trimmomatic on these raw data to remove low quality reads as well as adapters.

i. If the interactive session timed out, get an interactive cluster node again to start running programs and navigate to day1pm directory. Also, load the Conda environment - micro612.

How to know if you are in interactive session: you should see “username@nyx” in your command prompt

islurm

cd /scratch/micro612w21_class_root/micro612w21_class/username/day1pm/

#or

d1a

conda activate micro612
ii. Create these output directories in your day1pm folder to save trimmomatic results
mkdir Rush_KPC_266_trimmomatic_results
iii. Try to invoke trimmomatic from command line.

trimmomatic –h
iv. Run the below trimmomatic commands on raw reads.

trimmomatic PE Rush_KPC_266_1_combine.fastq.gz Rush_KPC_266_2_combine.fastq.gz Rush_KPC_266_trimmomatic_results/forward_paired.fq.gz Rush_KPC_266_trimmomatic_results/forward_unpaired.fq.gz Rush_KPC_266_trimmomatic_results/reverse_paired.fq.gz Rush_KPC_266_trimmomatic_results/reverse_unpaired.fq.gz ILLUMINACLIP:/scratch/micro612w21_class_root/micro612w21_class/shared/conda_envs/day1pm/share/trimmomatic-0.39-1/adapters/TruSeq3-PE.fa:2:30:10:8:true SLIDINGWINDOW:4:15 MINLEN:40 HEADCROP:0

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_morning/trimm_parameters.pngalt tag

First, Trimmomatic searches for any matches between the reads and adapter sequences. Adapter sequences are stored in this directory of Trimmomatic tool: /scratch/micro612w21_class_root/micro612w21_class/shared/bin/Trimmomatic/adapters/. Trimmomatic comes with a list of standard adapter fasta sequences such TruSeq, Nextera etc. You should use appropriate adapter fasta sequence file based on the illumina kit that was used for sequencing. You can get this information from your sequencing centre or can find it in FastQC html report (Section: Overrepresented sequences).

Short sections (2 bp as determined by seed misMatch parameter) of each adapter sequences (contained in TruSeq3-PE.fa) are tested in each possible position within the reads. If it finds a perfect match, It starts searching the entire adapter sequence and scores the alignment. The advantage here is that the full alignment is calculated only when there is a perfect seed match which results in considerable efficiency gains. So, When it finds a match, it moves forward with full alignment and when the match reaches 10 bp determined by simpleClipThreshold, it finally trims off the adapter from reads.

Quoting Trimmomatic:

“’Palindrome’ trimming is specifically designed for the case of ‘reading through’ a short fragment into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are ‘in silico ligated’ onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates ‘read- through’ i.e atleast 30 bp match, the forward read is clipped and the reverse read dropped (since it contains no new data).”

v. Now create new directories in day1pm folder and Run FastQC on these trimmomatic results.
mkdir Rush_KPC_266_FastQC_results/after_trimmomatic

fastqc -o Rush_KPC_266_FastQC_results/after_trimmomatic/ Rush_KPC_266_trimmomatic_results/forward_paired.fq.gz Rush_KPC_266_trimmomatic_results/reverse_paired.fq.gz --extract

Get these html reports to your local system.


scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day1pm/Rush_KPC_266_FastQC_results/after_trimmomatic/*.html /path-to-local-directory/

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_morning/3.pngalt tag

After running Trimmomatic, you should notice that the sequence quality improved (Per base sequence quality) and now it doesn’t contain any contaminants/adapters (Overrepresented sequences).

Next, take a look at the per base sequence content graph, and notice that the head bases(~9 bp) are slightly imbalanced. In a perfect scenario, each nucleotide content should run parallel to each other, and should be reflective of the overall A/C/T/G content of your input sequence.

Quoting FastQC: “It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichment of a number of different K-mers at the 5’ end of the reads. Whilst this is a true technical bias, it isn’t something which can be corrected by trimming and in most cases doesn’t seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.”

This doesn’t look very bad but you can remove the red cross sign by trimming these imbalanced head bases using HEADCROP:9 flag in the above command. This removes the first 9 bases from the start of the read. Often, the start of the read is not good quality, which is why this improves the overall read quality.

vi. Lets Run trimmomatic again with headcrop 9 and save it in a different directory called Rush_KPC_266_trimmomatic_results_with_headcrop/
mkdir Rush_KPC_266_trimmomatic_results_with_headcrop/


time trimmomatic PE Rush_KPC_266_1_combine.fastq.gz Rush_KPC_266_2_combine.fastq.gz Rush_KPC_266_trimmomatic_results_with_headcrop/forward_paired.fq.gz Rush_KPC_266_trimmomatic_results_with_headcrop/forward_unpaired.fq.gz Rush_KPC_266_trimmomatic_results_with_headcrop/reverse_paired.fq.gz Rush_KPC_266_trimmomatic_results_with_headcrop/reverse_unpaired.fq.gz ILLUMINACLIP:/scratch/micro612w21_class_root/micro612w21_class/shared/conda_envs/day1pm/share/trimmomatic-0.39-1/adapters/TruSeq3-PE.fa:2:30:10:8:true SLIDINGWINDOW:4:20 MINLEN:40 HEADCROP:9

Unix gem: time in above command shows how long a command takes to run?

vii. Run FastQC ‘one last time’ on updated trimmomatic results with headcrop and check report on your local computer
mkdir Rush_KPC_266_FastQC_results/after_trimmomatic_headcrop/

fastqc -o Rush_KPC_266_FastQC_results/after_trimmomatic_headcrop/ --extract -f fastq Rush_KPC_266_trimmomatic_results_with_headcrop/forward_paired.fq.gz Rush_KPC_266_trimmomatic_results_with_headcrop/reverse_paired.fq.gz

Download the reports again and see the difference.


scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day1pm/Rush_KPC_266_FastQC_results/after_trimmomatic_headcrop/*.html /path-to-local-directory/

The red cross sign disappeared!

Lets have a look at one of the Bad Illumina data example here

Earlier, We performed some quality control steps on our sequencing data to make it clean and usable for various downstream analysis. Now we will perform our first sequence analysis, specifically variant calling, and map these reads to a reference genome and try to find out the differences between them.

Read Mapping is one of the most common Bioinformatics operations that needs to be carried out on NGS data. The main goal behind read mapping/aligning is to find the best possible reference genome position to which reads could be aligned. Reads are generally mapped to a reference genome sequence that is sufficiently closely related genome to accurately align reads. There are number of tools that can map reads to a reference genome and they differ from each other in algorithm, speed and accuracy. Most of these tools work by first building an index of reference sequence which works like a dictionary for fast search/lookup and then applying an alignment algorithm that uses these index to align short read sequences against the reference.

These alignment has a vast number of uses, including:

  1. variant/SNP calling: Finding differences between your sequenced organism genome and the reference genome
  2. coverage estimation: If you have sufficient reads to cover each position of reference genome.
  3. gene expression analysis: determining the level of expression of each genes in a genome.

In this session, we will be covering the important steps that are part of any Read mapping/Variant calling bioinformatics pipleine.

Variant Calling for Collistin resistant Klebsiella pneumoniae

genomics_pipeline_variant_calling.pngMile high view of a genomics pipeline

At the end of our morning session, we submitted a variant calling job to run all the variant calling steps on PCMP_H326 genome.

The goal of this exercise is to learn what standard variant calling steps are involved, how to combine and run these steps in an automated fashion and explore results that were generated at each step by the script.

Let us see what inputs and commands variant_call.sh script need to run variant calling on PCMP_H326.

Navigate to variant_calling folder in day1pm directory and Try running the script with help menu. Check all the inputs that is required by the script.


./variant_call.sh -h

USAGE: variant_call.sh forward_paired reverse_paired reference_genome output_directory basename [-h] – A simple shell script to run Variant Calling steps on a pair of fastq reads.

The script requires following positional arguments as input to call variants:

  1. Forward Paired end reads
  2. Reverse Paired end reads
  3. Path to Reference Genome Fasta file
  4. Output Directory Path. A new directory will be created at this path by the name that you will provide for Analysis Base name. for example: if the output path is /dir_1/dir_2/ and Analysis Base name is sample_name , a new directory by the name sample_name_varcall_result will be created in /dir_1/dir_2/
  5. Analysis Base name to store result files with this prefix.

If you remember, we ran the shell script in following fashion inside variant_calling directory.


./variant_call.sh PCMP_H326_R1.fastq.gz PCMP_H326_R2.fastq.gz KPNIH1.fasta ./ PCMP_H326_ 1>variant_call.log 2> variant_call.err

Note: “1>” will print the standard output of the script to variant_call.log and “2>” will print standard error to variant_call.err.

The script generates PCMP_H326__varcall_result folder in your day1pm folder and the results for each step of variant calling will be organized in 6 different steps folder. Each of these steps represents a particular step involved in variant calling starting from cleaning the reads to calling variants.

These 6 folders are:

1. Step1_cleaning

2. Step2_mapping

3. Step3_samtobamconversion

4. Step4_removeduplicates

5. Step5_variantcalling

6. Step6_variantfilteraion

Try listing the folders in PCMP_H326__varcall_result:


ls -1ad PCMP_H326__varcall_result/*

Let’s visualize the steps of this process:

Note: We ran all the commands involved in these steps for you in variant_call.sh and in the next section we will concentrate on exploring the output files.

detailed_var_call.pngvar_call_steps

Step1_cleaning

This folder contains results generated by Trimmomatic. Since we already went through the results earlier, we will be skipping the Trimmomatic command and move on to Step 2.

Step2_mapping

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/1_1.pngalt tag

This folder contains results that were generated by mapping reads against a finished reference genome using BWA

Choosing the right read mapper is crucial and should be based on the type of analysis and data you are working with. Each aligners are meant to be better used with specific types of data, for example:

For whole genome or whole exome sequencing data: Use BWA for long reads (> 50/100 bp), use Bowtie2 for short reads (< 50/100bp) For transcriptomic data (RNA-Seq): use Splice-aware Mapper such as Tophat. (Not applicable for microbial data)

Here, we will be using BWA aligner to map the reads against a reference genome, KPNIH1.

BWA is one of the several read mappers that are based on Burrows-Wheeler transform algorithm. If you feel like challenging yourselves, you can read BWA paper here

Read Mapping is a time-consuming step that involves searching the reference and finding the optimal location for the alignment for millions of reads. Creating an index file of a reference sequence for quick lookup/search operations significantly decreases the time required for read alignment. Imagine indexing a genome sequence like the index at the end of a book. If you want to know on which page a word appears or a chapter begins, it is much more efficient to look it up in a pre-built index than going through every page of the book. Similarly, an index of a large DNA sequence allows aligners to rapidly find shorter sequences embedded within it.

Note: each read mapper has its own unique way of indexing a reference genome and therefore the reference index created by BWA cannot be used for Bowtie. (Most Bioinformatics tools nowadays require some kind of indexing or reference database creation)

You can move over to Step 2: Read Mapping section in the shell script to see what commands were used for this step.

Most of the Bioinformatics programs as BWA and Bowtie use a computational strategy known as ‘indexing’ to speed up their mapping algorithms. Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded within it.

The following command creates an index for the reference genome required for BWA aligner.

# Dont run this command - we already ran it as a part of variant_call.sh script.

bwa index KPNIH1.fasta

This command creates fai index file using samtools that is required by GATK in downstream steps.

# Dont run this command - we already ran it as a part of variant_call.sh script.

samtools faidx KPNIH1.fasta

Quoting BWA: “BWA consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.”

For other algorithms employed by BWA, you can refer to BWA manual

The second part of read mapping involves aligning both left and right end reads to our reference using BWA alignment algorithm ‘mem’.

The following command was used to do this job using both forward and reverse end reads along with a reference genome.

# Dont run this command - we already ran it as a part of variant_call.sh script.

bwa mem -M -R "@RG     ID:16   SM:PCMP_H326_R1.fastq.gz        LB:1    PL:Illumina" -t 8 /nfs/esnitkin/micro612w21_class_root/micro612w21_class/shared/data/day1pm/KPNIH1.fasta /nfs/esnitkin/micro612w21_class_root/micro612w21_class/shared/data/day1pm/PCMP_H326__varcall_result/forward_paired.fq.gz /nfs/esnitkin/micro612w21_class_root/micro612w21_class/shared/data/day1pm/PCMP_H326__varcall_result/reverse_paired.fq.gz > /nfs/esnitkin/micro612w21_class_root/micro612w21_class/shared/data/day1pm/PCMP_H326__varcall_result/PCMP_H326__aln.sam

Read group tells aligners/other tools that certain reads were sequenced together on a specific lane. If you have multiplexed samples in a single lane, you will get multiple samples in a single read group. If you sequenced the same sample in several lanes, you will have multiple read groups for the same sample.

This string with -R flag says that all reads belongs to ID:16 and library LB:1; with sample name SM:PCMP_H326_R1.fastq.gz and was sequenced on illumina platform PL:Illumina.

You can extract this information from fastq read header.


@D00728:16:hf2mtbcxx:2:1101:1580:2235 1:N:0:ACTGAGCG+GTAAGGAG

The output of BWA and most of the short-reads aligners is a SAM file. SAM format is considered as the standard output for most read aligners and stands for Sequence Alignment/Map format. It is a TAB-delimited format that describes how each reads were aligned to the reference sequence. Detailed information about the SAM specs can be obtained from this pdf document.

Step3_samtobamconversion

The next step involves manipulating the SAM files generated by the BWA aligner using Samtools

BAM is the compressed binary equivalent of SAM but are usually quite smaller in size than SAM format. Since, parsing through a SAM format is slow, Most of the downstream tools require SAM file to be converted to BAM so that it can be easily sorted and indexed.

The first section for this step will ask samtools to convert SAM format(-S) to BAM format(-b)

# Dont run this command - we already ran it as a part of variant_call.sh script.

samtools view -Sb ./PCMP_H326__varcall_result/PCMP_H326__aln.sam > ./PCMP_H326__varcall_result/PCMP_H326__aln.bam

The next section will sort these converted BAM file using SAMTOOLS

Most of the downstream tools such as GATK requires your BAM file to be indexed and sorted by reference genome positions.

Now before indexing this BAM file, we will sort the data by positions(default) using samtools. Some RNA Seq/Gene expression tools require it to be sorted by read name which is achieved by passing -n flag.

# Dont run this command - we already ran it as a part of variant_call.sh script.

samtools sort -O BAM -o ./PCMP_H326__varcall_result/PCMP_H326__aln_sort.bam ./PCMP_H326__varcall_result/PCMP_H326__aln.bam

Step4_removeduplicates

This step will mark duplicates(PCR optical duplicates) and remove them using PICARD

Illumina sequencing involves PCR amplification of adapter ligated DNA fragments so that we have enough starting material for sequencing. Therefore, some amount of duplicates are inevitable. Ideally, you amplify upto ~65 fold(4% reads) but higher rates of PCR duplicates e.g. 30% arise when people have too little starting material such that greater amplification of the library is needed or some smaller fragments which are easier to PCR amplify, end up over-represented.

For an in-depth explanation about how PCR duplicates arise in sequencing, please refer to this interesting blog

Picard identifies duplicates by searching reads that have same start position on reference or in PE reads same start for both ends. It will choose a representative from each group of duplicate reads based on best base quality scores and other criteria and retain it while removing other duplicates. This step plays a significant role in removing false positive variant calls(such as sequencing error) during variant calling that are represented by PCR duplicate reads.

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/picard.pngalt tag

To remove PCR duplicate reads from BAM file, PICARD needs a sequence dictionary of reference fasta file which can be generated with picard CreateSequenceDictionary command.

# Dont run this command - we already ran it as a part of variant_call.sh script.

picard CreateSequenceDictionary REFERENCE=KPNIH1.fasta OUTPUT=KPNIH1.dict

Once the sequence dictionary is created, PICARD can be run for removing duplicates

# Dont run this command - we already ran it as a part of variant_call.sh script.

picard MarkDuplicates REMOVE_DUPLICATES=true INPUT=./PCMP_H326__varcall_result/PCMP_H326__aln_sort.bam OUTPUT=./PCMP_H326__varcall_result/PCMP_H326__aln_marked.bam METRICS_FILE=./PCMP_H326__varcall_result/PCMP_H326__markduplicates_metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT

The output of Picard remove duplicate step is a new bam file “PCMP_H326__aln_marked.bam” without PCR duplicates.

This bam file should be indexed before it can be used for variant calling.

# Dont run this command - we already ran it as a part of variant_call.sh script.

samtools index ./PCMP_H326__varcall_result/PCMP_H326__aln_marked.bam

Step5_variantcalling

One of the downstream uses of read mapping is finding differences between our sequence data against a reference. This step is achieved by carrying out variant calling using any of the variant callers (samtools, gatk, freebayes etc). Each variant caller uses a different statistical framework to discover SNPs and other types of mutations. For those of you who are interested in finding out more about the statistics involved, please refer to this samtools paper, one of most commonly used variant callers.

The GATK best practices guide will provide more details about various steps that you can incorporate in your analysis.

There are many published articles that compare different variant callers but this is a very interesting blog post that compares the performance and accuracy of different variant callers.

Here we will use samtools mpileup to perform this operation on our BAM file and generate a VCF (variant call format) file.

This step will Call variants using samtools mpileup and bcftools**

# Dont run this command - we already ran it as a part of variant_call.sh script.

samtools mpileup -ug -f KPNIH1.fasta ./PCMP_H326__varcall_result/PCMP_H326__aln_marked.bam | bcftools call -O v -v -c -o ./PCMP_H326__varcall_result/PCMP_H326__aln_mpileup_raw.vcf

#In the above command, we are using samtools mpileup to generate a pileup formatted file from BAM alignments and genotype likelihoods (-g flag) in BCF format (binary version of vcf). This bcf output is then piped to bcftools, which calls variants and outputs them in vcf format (-c flag for using consensus calling algorithm  and -v for outputting variants positions only)

Let’s go through the VCF file and try to understand a few important VCF specifications and criteria that we can use for filtering low confidence SNPs.

Go to Step5_variantcalling folder under PCMP_H326__varcall_result folder and take a look at pre-filtered vcf file - “PCMP_H326__aln_mpileup_raw.vcf” generated by samtools mpileup.


cd PCMP_H326__varcall_result/Step5_variantcalling/

less PCMP_H326__aln_mpileup_raw.vcf
  1. CHROM, POS: 1st and 2nd column represent the reference genome name and reference base position where a variant was called
  2. REF, ALT: 4th and 5th columns represent the reference allele at the position and alternate/variant allele called from the reads
  3. QUAL: Phred-scaled quality score for the assertion made in ALT
  4. INFO: Additional information that provides technical scores and obervations for each variant. Important parameters to look for: Depth (DP), mapping quality (MQ), FQ (consensus score), allele frequency for each ALT allele (AF)

VCF format stores a large variety of information and you can find more details in this pdf.

Lets count the number of raw unfiltered variants in PCMP_H326__aln_mpileup_raw.vcf by running these grep/wc commands:

grep -v '#' PCMP_H326__aln_mpileup_raw.vcf | wc -l

grep -v '#' PCMP_H326__aln_mpileup_raw.vcf | grep 'INDEL' | wc -l

Step6_variantfilteraion

This step will filter variants and process file generation using GATK:

There are various tools that can you can try for variant filteration such as vcftools, GATK, vcfutils etc. Here we will use GATK VariantFiltration utility to filter out low confidence variants.

Make sure you change the directory to Step6_variantfilteraion folder under PCMP_H326__varcall_result.

cd ../Step6_variantfilteraion

The GATK VariantFiltration command that was used to filter the variants is:

# Dont run this command - we already ran it as a part of variant_call.sh script.

gatk VariantFiltration -R KPNIH1.fasta -O ./PCMP_H326__varcall_result/PCMP_H326__filter_gatk.vcf --variant ./PCMP_H326__varcall_result/PCMP_H326__aln_mpileup_raw.vcf --filter-expression "FQ < 0.025 && MQ > 50 && QUAL > 100 && DP > 15" --filter-name pass_filter

This command will add a ‘pass_filter’ text in the 7th FILTER column for those variant positions that passed our filtered criteria:

  1. DP: Depth of reads. More than 15 reads supporting a variant call at these position.
  2. MQ: Root Mean Square Mapping Quality. This provides an estimation of the overall mapping quality of reads supporting a variant call. The root mean square is equivalent to the mean of the mapping qualities plus the standard deviation of the mapping qualities.
  3. QUAL stands for phred-scaled quality score for the assertion made in ALT. High QUAL scores indicate high confidence calls.
  4. FQ stands for consensus quality. A positive value indicates heterozygote and a negative value indicates homozygous. In bacterial analysis, this plays an important role in defining if a gene was duplicated in a particular sample. We will learn more about this later while visualizing our BAM files in Artemis.

Check if the pass_filter was added properly and count the number of variants that passed the filter using grep.

grep 'pass_filter' PCMP_H326__filter_gatk.vcf | head

Caveat: This filter criteria should be applied carefully after giving some thought to the type of library, coverage, average mapping quality, type of analysis and other such requirements.

The next section will Remove indels and keep only SNPS that passed our filter criteria using the vcftools manual:***

vcftools is a program package that is especially written to work with vcf file formats. It thus saves your precious time by making available all the common operations that you would like to perform on the vcf file using a single command. One such operation is removing INDEL information from a vcf file.

Now, The below command was used to remove indels from our final vcf file and keep only variants that passed the filter criteria (positions with pass_filter in their FILTER column).

# Dont run this command - we already ran it as a part of variant_call.sh script.

vcftools --vcf ./PCMP_H326__varcall_result/PCMP_H326__filter_gatk.vcf --keep-filtered 
pass_filter --remove-indels --recode --recode-INFO-all --out ./PCMP_H326__varcall_result/PCMP_H326__filter_onlysnp

Step 7 bgzip and Tabix index vcf files for IGV visualization.

The below command were used to compress final annotated vcf file and were tabix indexed so that it can be used for IGV visualization.

# Dont run this command - we already ran it as a part of variant_call.sh script.

bgzip -fc PCMP_H326__filter_gatk_ann.vcf > PCMP_H326__filter_gatk_ann.vcf.gz

tabix PCMP_H326__filter_gatk_ann.vcf.gz

Variant Annotation using snpEff

variant_annot.pngvariant annotation using snpeff Variant annotation is one of the crucial steps in any variant calling pipeline. Most of the variant annotation tools create their own database or use an external one to assign function and predict the effect of variants on genes. We will try to touch base on some basic steps of annotating variants in our vcf file using snpEff.

You can annotate these variants before performing any filtering steps that we did earlier or you can decide to annotate just the final filtered variants.

snpEff contains a database of about 20,000 reference genomes built from trusted and public sources. Lets check if snpEff contains a database of our reference genome.

Make sure you change the directory to Step6_variantfilteraion


cd PCMP_H326__varcall_result/Step6_variantfilteraion
i. Check snpEff internal database for your reference genome:

You can check what reference genomes are available in snpEff’s internal database using the below command.


snpEff databases | grep -w 'Klebsiella_pneumoniae_subsp_pneumoniae_kpnih1'

The existing KPNIH1 reference database doesn’t contain mgrB annotation in it so we built a custom database out of a custom KPNIH1 genbank file. The procedure to configure a custom database can be found here.

ii. Run snpEff for variant annotation.

The variants in PCMP_H326__filter_gatk.vcf file were annotated using this snpEff command and the annotated variants were printed in PCMP_H326__filter_gatk_ann.vcf.

# Dont run this command - we already ran it as a part of variant_call.sh script.

snpEff -csvStats PCMP_H326__filter_gatk_stats -dataDir /scratch/micro612w21_class_root/micro612w21_class/shared/bin/snpEff/data/ -d -no-downstream -no-upstream -c /scratch/micro612w21_class_root/micro612w21_class/shared/bin/snpEff/snpEff.config KPNIH1 PCMP_H326__filter_gatk.vcf > PCMP_H326__filter_gatk_ann.vcf

The STDOUT of snpEff prints out some useful details such as genome name and version being used, no. of genes, protein-coding genes and transcripts, chromosome and plasmid names etc.

snpEff will add an extra field named ‘ANN’ at the end of INFO field. Lets go through the ANN field added after annotation step.

Now go to Step6_variantfilteraion folder under PCMP_H326__varcall_result.

cd PCMP_H326__varcall_result/Step6_variantfilteraion

# Dont run this command - we already ran it as a part of variant_call.sh script.
bgzip -fc PCMP_H326__filter_gatk_ann.vcf > PCMP_H326__filter_gatk_ann.vcf.gz
tabix PCMP_H326__filter_gatk_ann.vcf.gz

We compressed this final annotated vcf file and tabix indexed it so that it can be used for IGV visualization

Now go to Step6_variantfilteraion folder under PCMP_H326__varcall_result and look at one of the annotated variants.

grep 'ANN=' PCMP_H326__filter_gatk_ann.vcf | head -n1

or to print on seperate lines

grep -o 'ANN=.*GT:PL' PCMP_H326__filter_gatk_ann.vcf | head -n1 | tr '|' '\n' | cat --number

The ANN field will provide information such as the impact of variants (HIGH/LOW/MODERATE/MODIFIER) on genes and transcripts along with other useful annotations.

Detailed information of the ANN field and sequence ontology terms that it uses can be found here.

Let’s see how many SNPs and Indels passed the filter using grep and wc.


#No. of Variants:
grep '^gi|' PCMP_H326__filter_gatk_ann.vcf | wc -l

#No. of Variants that passed the filter:
grep '^gi|.*pass_filter' PCMP_H326__filter_gatk_ann.vcf | wc -l

#No. of SNPs that passed the filter:
grep '^gi|.*pass_filter' PCMP_H326__filter_gatk_ann.vcf | grep -v 'INDEL' | wc -l

#No. of Indels that passed the filter:
grep '^gi|.*pass_filter' PCMP_H326__filter_gatk_ann.vcf | grep 'INDEL' | wc -l

We wrote a small python script parser “parse_snpEff.py” located under scripts folder to parse the annotated vcf file and print out some important annotation related fields in a table format. We already ran this script as part of variant_call.sh as shown below.

# Dont run this command - we already ran it as a part of variant_call.sh script.

python scripts/parse_snpEff.py -genbank KPNIH1.gb -vcf PCMP_H326__varcall_result/Step6_variantfilteraion/PCMP_H326__filter_gatk_ann.vcf

Lets explore the annotated variants that were parsed and printed out in snpEff_parsed.csv

This script generates a new csv file called snpEff_parsed.csv with extracted annotated information printed in a table format. Let’s take a look at these parsed annotated features from variant filtering.

First, make sure you’re in the directory where snpEff_parsed.csv is stored i.e Step6_variantfilteraion

ls snpEff_parsed.csv

Let’s look at what the different columns contain:

head -n 1 snpEff_parsed.csv | tr ',' '\n' | cat -n

Now that we know what the different columns are, let’s look at the whole file using less (can you figure out what the -S flag does?):

less -S snpEff_parsed.csv

How many high-impact variants are there based on the snpEff annotation?

grep HIGH snpEff_parsed.csv | wc -l

What proteins are these high-impact variants in?

grep HIGH snpEff_parsed.csv | cut -d',' -f13

How many of each type of high-impact variants are there?

grep HIGH snpEff_parsed.csv | cut -d',' -f7  | sort | uniq -c

What are the actual variant changes?

 grep HIGH snpEff_parsed.csv | cut -d',' -f8

Let’s get the protein and variant change together:

grep HIGH snpEff_parsed.csv | cut -d',' -f8,9

What if we want to look specifically for mutations in mgrB? (what does the -i flag do?)

grep -i mgrb snpEff_parsed.csv

Generate Alignment Statistics

Often, while analyzing sequencing data, we are required to make sure that our analysis steps are correct. Some statistics about our analysis will help us in making that decision.

We already ran PICARD’s CollectAlignmentSummaryMetrics and CollectWgsMetrics as a part of variant_call.sh to gather alignment statistics from the BAM file.

Go to your day1pm/variant_calling directory.


d1a

cd variant_calling
i. Collect Alignment statistics using Picard

The below picard CollectAlignmentSummaryMetrics command can be used to generate alignment statistics from the input PCMP_H326__aln_marked.bam file. The output of this command is stored in AlignmentSummaryMetrics.txt file.

# Dont run this command - we already ran it as a part of variant_call.sh script.

picard CollectAlignmentSummaryMetrics R=KPNIH1.fasta I=PCMP_H326__varcall_result/Step5_variantcalling/PCMP_H326__aln_marked.bam O=AlignmentSummaryMetrics.txt

Open the file AlignmentSummaryMetrics.txt and explore various statistics. It will generate various statistics and the definition for each can be found here

The file AlignmentSummaryMetrics.txt contains many columns and at times it becomes difficult to extract information from a particular column if we dont know the exact column number. Run the below unix gem to print column name with its number.

grep 'CATEGORY' AlignmentSummaryMetrics.txt | tr '\t' '\n' | cat --number
  • Question: Extract alignment percentage from AlignmentSummaryMetrics file. (% of reads aligned to reference genome)
grep -v '#' AlignmentSummaryMetrics.txt | cut -f7

Try to explore other statistics and their definitions from Picard AlignmentSummaryMetrics link

ii. Estimate read coverage/read depth using Picard

Read coverage/depth describes the average number of reads that align to, or “cover,” known reference bases. The sequencing depth is one of the most crucial issue in the design of next-generation sequencing experiments. This paper review current guidelines and precedents on the issue of coverage, as well as their underlying considerations, for four major study designs, which include de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses

After read mapping, it is important to make sure that the reference bases are represented by enough read depth before making any inferences such as variant calling.

To extract this information, we ran PICARD’s CollectWgsMetrics as a part of variant_call.sh on input PCMP_H326__aln_marked.bam file and the output is stored in WgsMetrics.txt file.

# Dont run this command - we already ran it as a part of variant_call.sh script.

picard CollectWgsMetrics R=KPNIH1.fasta I=PCMP_H326__varcall_result/Step5_variantcalling/PCMP_H326__aln_marked.bam O=WgsMetrics.txt

Open the file “WgsMetrics.txt” and explore various statistics. It will generate various statistics and the definition for each can be found here.

Print column names

grep 'GENOME_TERRITORY' WgsMetrics.txt | tr '\t' '\n' | cat --number

Since “WgsMetrics.txt” also contains histogram information, we will run commands on only the first few lines to extract information.

  • Question: Extract mean coverage information from “WgsMetrics.txt”
grep -v '#' WgsMetrics.txt | cut -f2 | head -n3
Question: Percentage of bases that attained at least 5X sequence coverage.
grep -v '#' WgsMetrics.txt | cut -f15 | head -n3
Question: Percentage of bases that had siginificantly high coverage. Regions with unusually high depth sometimes indicate either repetitive regions or PCR amplification bias.
grep -v '#' WgsMetrics.txt | cut -f27 | head -n3

Visualize BAM and VCF files in IGV (Integrative Genome Viewer)

While these various statistical/text analyses are helpful, visualization of all of these various output files can help in making some significant decisions and inferences about your entire analysis. There are a wide variety of visualization tools out there that you can choose from for this purpose.

We will be using IGV (Integrative Genome Viewer) here, developed by the Broad Institute for viewing BAM and VCF files for manual inspection of some of the variants.

  • Required Input files:
  • KPNIH1 reference fasta
  • KPNIH1 genbank file
  • PCMP_H326__aln_marked.bam
  • PCMP_H326__aln_marked.bam.bai
  • PCMP_H326__filter_gatk_ann.vcf.gz
  • PCMP_H326__filter_gatk_ann.vcf.gz.tbi

Note: This IGV exercise requires an annotated vcf file, so make sure you have completed snpeff exercise successfully.

Let’s make a seperate folder (make sure you are in the day1pm/variant_calling folder) for the files that we need for visualization and copy it to that folder:

d1a 
cd variant_calling
mkdir IGV_files

cp KPNIH1.fasta KPNIH1.gb PCMP_H326__varcall_result/*/PCMP_H326__aln_marked.bam* PCMP_H326__varcall_result/*/PCMP_H326__filter_gatk_ann.vcf.gz* IGV_files/

Open a new terminal and run the scp command or cyberduck to get these files to your local system:



scp -r username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day1pm/PCMP_H326__varcall_result/IGV_files/ /path-to-local-directory/


#You can use ~/Desktop/ as your local directory path

Start IGV.

Load the following files (each is a separate panel or ‘track’):

  • GenomesLoad Genome from File → navigate to IGV_filesKPNIH1.fasta
    • Shows where in the genome you are at the top
  • FileLoad from File → navigate to IGV_filesKPNIH1.gb
    • Shows what genes are present (looks like a blue bar when zoomed out)
  • FileLoad from File → navigate to IGV_filesPCMP_H326__aln_sort__filter_gatk_ann.vcf.gz
    • Shows variants found in the sample (when you zoom in)
  • FileLoad from File → navigate to IGV_filesPCMP_H326__aln_marked.bam
    • Shows coverage and reads aligned to the reference (when you zoom in)

By default, the whole genome is shown:

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/igv_zoomed_out.pngalt tag

Using the plus sign in the top right corner of the window, zoom in by clicking 3 times

  • You should see grey bars in the vcf track and blue bars in the fastq track, both showing variant positions
  • You can hover over the bars to get more information about the variant
  • Light grey and light blue indicate homozygous variants, while dark grey and dark blue indicate heterozygous variants
  • You can navigate to different sections of the genome by moving the red bar at the top

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/igv_zoomed_in1.pngalt tag

Zoom in ~5 more times until you see reads appear in the bottom part of the window

  • You should see coverage and reads mapped in bottom half of the window
  • Different colors indicate different variants
  • In the Coverage track, the y-axis indicates read coverage
  • You can now also see distinct genes in the genbank annotation track
  • You can hover over a read to get more information about it

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/igv_zoomed_in2.pngalt tag

To see all of the reads, you can click the square with the arrows pointing to each corner, found in the top-middle-right of the window:

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/igv_zoomed_in3.pngalt tag

If you zoom in all the way, you can see the nucleotide sequence at the bottom of the screen as well as nucleotide variants in the reads:

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/igv_zoomed_in4.pngalt tag

Now that you know the basics of how to use IGV, let’s navigate to the mgrB gene to look at mutations that might make this sample resistant to colistin.

  • In the top middle bar of the window, type in gi|661922017|gb|CP008827.1|:3,359,811-3,360,323
  • Look at the gene annotation by hovering over the blue bar to see what gene it is
  • What is the nucleotide of the SNP in the sample? The amino acid change?
  • Do you think this variant might be the cause of colistin resistance? Why or why not?

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/igv_mgrb.pngalt tag

Now let’s look an example of a heterozygous variant - variant positions where more than one allele (variant) with sufficiently high read depth are observed.

  • Navigate to gi|661922017|gb|CP008827.1|:2,265,249-2,273,465
  • You can see that there are a lot of heterozygous variants in one region.
    • We removed these types of variants during our Variant Filteration step using the FQ value. (If the FQ is unusually high, it is suggestive of a heterozygous variant and negative FQ value is a suggestive of true variant as observed in the mapped reads)
  • In the region with lots of heterozygous variants, the read coverage is much higher than in the flanking regions (the regions on either side), and much higher than the rest of the genome coverage.
  • Why do you think this region contains many heterozygous variants and a higher read coverage than the rest of the genome?
  • You can also see that there are some places with no reads and no coverage. What does this mean?

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day1_after/igv_het.pngalt tag

You can inspect these type of HET variants later for any gene duplication or copy number analysis (by extracting variant positions with high FQ values). Addition of these details will give a better resolution while inferring phylogenetic trees.

You can refer to the IGV User Guide for more information about how to use IGV.

Exercise: Daptomycin resistance in VRE

Today we ran a variant calling pipeline for a colistin resistant isolate against a susceptible reference. In that case the susceptible reference genome was somewhat arbitrarily selected, in that it had no epidemiologic relationship to the resistant isolate. This worked out, because we had an idea of what gene the resistance mutation should be in, and we were able to fish it out from the relatively large number of differences. In this exercise we will take a different approach of comparing our resistant isolate to a susceptible isolate from the same patient. In particular, these samples both come from a patient infected with VRE before and after treatment with daptomycin. The first sample was the patient’s initial sample and is susceptible to daptomycin, and the second was after daptomycin resistance emerged during treatment. Your goal is to map reads from the resistant genome (VRE_daptoR) to the susceptible reference (VRE_daptoS_ref_strain.fa) and search for variants that may be associated with resistance. To accomplish this you will run the programs from this session to generate filtered variant files (VCF), and then explore these variants in IGV to see what genes they are in (we have provided you with a gff file with gene annotations that can be loaded into IGV - VRE_daptoS_gene_annot.gff3). To help with your interpretation, see if you see any genes hit that were reported in this paper - cardiolipin synthases, which was the first to identify putative daptomycin resistance loci (hint: non-coding variants can be functional, so make sure you look at what genes are downstream of inter-genic variants).

Your steps should be:

  1. Load micro612 environment.
  2. Change directory to VRE_dapto_resistance under day1pm folder.
  3. We have placed a SLURM script - variant_call.sbat that you would need to edit before submitting it to cluster. You would need to add variant_call.sh command at the end of the SLURM script by giving VRE_daptoR_R1.fastq.gz, VRE_daptoR_R2.fastq.gz and VRE_daptoS_ref_strain.fasta as your inputs. Read comments in variant_call.sbat for more assistance.

One last thing - dont forget to add path to an output directory and a unique name for saving the results.

Your command should look like this:


./variant_call.sh Forward_reads Reverse_reads Reference.fasta output_directory_path(ex: ./) unique_prefix(ex: VRE_daptoR)

Note - we skipped snpEff annotation command because our reference genome is not available in snpEff database and therefore the variant_call.sh under VRE_dapto_resistance doesn’t include snpEff commands

Wait for this variant call job to finish before you proceed to next exercise.

  1. Examine onlysnp vcf file and check how many high quality SNPs we end up with. Note there aren’t very many, supporting the premise that daptomycin resistance evolved in the patient during treatment.
  2. Since there are so few variants you will explore them in IGV to find the one putatively impacting cardiolipin synthase function. Load files into IGV and examine gene annotations around each variant, and see if any match cardiolipin synthase. You would need - VRE_daptoS_gene_annot.gff3, VRE_daptoS_ref_strain.fasta, relevant bam and vcf files for visualizing the alignments/variants in IGV.

Exercise: Colistin resistance in Acinetobacter

In the second exercise we will try and find a mutation that is in a colistin resistant Acinetobacter isolate from a patient, but not in a colistin susceptible isolate from the same patient. In this case, it turned out that despite being from the same patient, the resistant and susceptible genomes are quite different. Therefore, we will focus on differences in a known resistance gene (pmrB). Your task is to run the variant calling for SRR7591080 (colR) and SRR7591083 (colS) against the ACICU reference genome (ACICU.fasta). You will then look for pmrB mutations that are in the resistant strain, that are not in the susceptible one. Did the mutation you found match the one from the paper i.e patient 1.

Your steps should be:

  1. Load micro612 environment.
  2. Create two SLURM scripts comparing your colR and colS genomes to the reference genomes and submit both of them to cluster. We have already placed two SLURM scripts - SRR7591080.sbat and SRR7591083.sbat that you can edit and add in the variant_call.sh commands at the end with forward read, reverse read, reference genome, output directory path and a prefix as parameters for respective samples.
  3. This version of variant_call.sh also includes snpEff commands so you will find the annotated vcf file in Step6_variantfilteraion folder. Once the job completes, search the snpEff_parsed.csv of the colR and colS genomes for pmrB mutations. Did the resistant isolate have a unique mutation? Does this match the mutation found in the paper?

Day 2 Morning

[HOME]

Goal

  • Assemble short reads illumina data into a genomic assembly using Spades Assembler, assess the quality of assembly with Quast and annotate it using Prokka.
  • Generate multiple sample report using a bunch of pre-assembled data as input and visualize the summary report.
  • Compare assembled genome to a reference genome using Abacas and ACT.

On day 1 we worked through a pipeline to map short-read data to a pre-existing assembly and identify single-nucleotide variants (SNVs) and small insertions/deletions. However, what this sort of analysis misses is the existence of sequence that is not present in your reference. Today we will tackle this issue by assembling our short reads into larger sequences, which we will then analyze to characterize the functions unique to our sequenced genome.

genome_assembly.pngroadmap

Execute the following command to copy files for this morning’s exercises to your workshop home directory:

> Note: Make sure you change 'username' in the commands below to your 'uniqname'. 

wd

#or 

cd /scratch/micro612w21_class_root/micro612w21_class/username

> Note: Check if you are in your home directory(/scratch/micro612w21_class_root/micro612w21_class/username) by executing 'pwd' in terminal. 'pwd' stands for present working directory and it will display the directory you are in.

pwd

> Note: Copy files for this morning's exercise in your home directory.

cp -r /scratch/micro612w21_class_root/micro612w21_class/shared/data/day2am ./

Genome Assembly using Spades Pipeline

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day2_morning/intro.pngalt tag

There are a wide range of tools available for the assembly of microbial genomes. These assemblers fall in to two general algorithmic categories, which you can learn more about here. In the end, most assemblers will perform well on microbial genomes, unless there is unusually high GC-content or an over-abundance of repetitive sequences, both of which make accurate assembly difficult.

Here we will use the Spades assembler with default parameters. Because genome assembly is a computationally intensive process, we will submit our assembly jobs to the cluster, and move ahead with some pre-assembled genomes, while your assemblies are running.

assembly_details_spades.pngspades

i. Create directory to hold your assembly output.

Create a new directory for the spades output in your day2am folder

> Note: Make sure you change 'username' in the below command with your 'uniqname'. 

d2m

#or

cd /scratch/micro612w21_class_root/micro612w21_class/username/day2am

> We will create a new directory in day2am to save genome assembly results:

mkdir MSSA_SRR5244781_assembly_result 

Now, we will use a genome assembly tool called Spades for assembling the reads.

ii. Test out Spades to make sure it’s in your path

Load the workshop conda environment micro612. To make sure that your conda paths are set up correctly, try running Spades with the –h (help) flag, which should produce usage instruction.

> check if spades is working. 

conda activate micro612

spades.py -h     
iii. Submit a cluster job to assemble

Since it takes a huge amount of memory and time to assemble genomes using spades, we will run a slurm script on great lakes cluster for this step.

Now, open the spades.sbat file residing in the day2aming folder with nano and add the following spades command to the bottom of the file. Replace the EMAIL_ADDRESS in spades.sbat file with your actual email-address. This will make sure that whenever the job starts, aborts or ends, you will get an email notification.

> Open the spades.sbat file using nano:

nano spades.sbat

> Now replace the EMAIL_ADDRESS in spades.sbat file with your actual email-address. This will make sure that whenever the job starts, aborts or ends, you will get an email notification.

> Copy and paste the below command to the bottom of spades.sbat file.

spades.py -1 forward_paired.fq.gz -2 reverse_paired.fq.gz -o MSSA_SRR5244781_assembly_result/ --careful
iv. Submit your job to the cluster with sbatch
sbatch spades.sbat
v. Verify that your job is in the queue with the squeue command
squeue -u username 

Submit PROKKA annotation job

Since Prokka annotation is a time intensive run, we will submit an annotation job and go over the results later at the end of this session.

Before we submit the job, run this command to make sure that prokka is setup properly in your environment.

prokka –setupdb

In your day2am directory, you will find a prokka.sbat script. Open this file using nano and change the EMAIL_ADDRESS to your email address.

nano prokka.sbat

Now add these line at the end of the slurm script.


mkdir SRR5244781_prokka 
prokka -kingdom Bacteria -outdir SRR5244781_prokka -force -prefix SRR5244781_contigs_ordered SRR5244781_contigs_ordered.fasta

Submit the job using sbatch

sbatch prokka.sbat

Assembly evaluation using QUAST

The output of an assembler is a set of contigs (contiguous sequences), that are composed of the short reads that we fed in. Once we have an assembly we want to evaluate how good it is. This is somewhat qualitative, but there are some standard metrics that people use to quantify the quality of their assembly. Useful metrics include: i) number of contigs (the fewer the better), ii) N50 (the minimum contig size that at least 50% of your assembly belongs, the bigger the better). In general you want your assembly to be less than 200 contigs and have an N50 greater than 50 Kb, although these numbers are highly dependent on the properties of the assembled genome.

To evaluate some example assemblies we will use the tool quast. Quast produces a series of metrics describing the quality of your genome assemblies.

assembly_details_quast.pngspades

i. Run quast on a set of previously generated assemblies

Now to check the example assemblies residing in your day2am folder, run the below quast command. Make sure you are in day2am folder in your home directory using ‘pwd’

SInce Quast needs an older version of python, we will deactivate the current environment and run quast outside micro612 environment.

conda deactivate 

module load python2.7-anaconda/2019.03

export PATH=$PATH:/scratch/micro612w21_class_root/micro612w21_class/shared/bin/quast/

quast.py -o quast SRR5244781_contigs.fasta SRR5244821_contigs.fasta

The command above will generate a report file in /scratch/micro612w21_class_root/micro612w21_class/username/day2am/quast

ii. Explore quast output

QUAST creates output in different formats such as html, pdf and text. Now lets check the report.txt file residing in quast folder for assembly statistics. Open report.txt using nano.

less quast/report.txt

Check the difference between the different assembly statistics. Also check the different types of report it generated.

Generating multiple sample reports using multiqc

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day2_morning/multiqc.jpegalt tag

Let’s imagine a real-life scenario where you are working on a project which requires you to analyze and process hundreds of samples. Having a few samples with extremely bad quality is very commonplace. Including these bad samples into your analysis without adjusting their quality threshold can have a profound effect on downstream analysis and interpretations.

  • Question: How will you find those bad apples?

Yesterday, we learned how to assess and control the quality of samples as well as screen for contaminants. But the problem with such tools or any other tools is, they work on per-sample basis and produce only single report/logs per sample. Therefore, it becomes cumbersome to dig through each sample’s reports and make appropriate quality control calls.

Thankfully, there is a tool called multiqc which parses the results directory containing output from various tools, reads the log report created by those tools (ex: FastQC, FastqScreen, Quast), aggregates them and creates a single report summarizing all of these results so that you have everything in one place. This helps greatly in identifying the outliers and removing or reanalysizing it individually.

Lets take a look at one such mutiqc report that was generated using FastQC results on C. difficile samples.

Download the html report Cdiff_multiqc_report.html from your day2am folder.

#Note: Make sure you change 'username' in the below command to your 'uniqname'.


scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2am/Cdiff_multiqc_report.html /path-to-local-directory/

  • Question: Open this report in a browser and try to find the outlier sample/s
  • Question: What is the most important parameter to look for while identifying contamination or bad samples?
  • Question: What is the overall quality of data?

Lets run multiqc on one such directory where we ran and stored FastQC, FastQ Screen and Quast reports.

if you are not in day2am folder, navigate to it and change directory to multiqc_analysis

d2m

#or

cd /scratch/micro612w21_class_root/micro612w21_class/username/day2am/

cd multiqc_analysis

#Activate workshop conda environment

conda activate micro612

multiqc -h

#Run multiqc on sample reports

multiqc ./ --force --filename workshop_multiqc

#Check if workshop_multiqc.html report was generated

ls

#transfer this report to your local system and open it in a browser for visual inspection

scp username@flux-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2am/workshop_multiqc.html /path-to-local-directory/

The report contains the Assembly, Fastq Screen and FastQC report for a mixture of 51 organisms’ sequence data. Sample names for Assembly statistics ends with “l500_contigs”.

  • Question: Which two sample’s genome length i.e column Length (Mbp) stand out from all the other genome lengths and what is the reason (hint – check their GC % and their FastQ Screen result)?
  • Question: Which sample has the worst N50 value? What do you think must be the reason (hint - check out the general stats)?
  • Question: Which sample has the second worst N50 value? Is it the same or a different issue from the worst sample (hint - check out general stats and then the fastQC results)?

Compare assembly to reference genome and post-assembly genome improvement

Once we feel confident in our assembly by using quast or multiQC, let’s compare it to our reference to see if we can identify any large insertions/deletions using a graphical user interface called Artemis Comparison Tool (ACT) for visualization.

In order to simplify the comparison between assembly and reference, we first need to orient the order of the contigs to reference.

assembly_details_abacas.pngspades

i. Run abacas to orient contigs to the reference

To orient our contigs relative to the reference we will use a tool called abacas. ABACAS aligns contigs to a reference genome and then stitches them together to form a “pseudo-chromosome”.

Go back to flux and into the directory where the assembly is located.

d2m

#or

cd /scratch/micro612w21_class_root/micro612w21_class/username/day2am/

Now, we will run abacas using these input parameters:

  1. your reference sequence (-r FPR3757.fasta),
  2. your contig file (-q SRR5244781_contigs.fasta),
  3. the program to use to align contigs to reference (-p nucmer),
  4. append unmapped contigs to end of file (-b),
  5. use default nucmer parameters (-d),
  6. append contigs into pseudo-chromosome (-a),
  7. the prefix for your output files (–o SRR5244781_contigs_ordered)

Source the relevant abacas dependencies and Check if abacas can be properly invoked:

source /scratch/micro612w21_class_root/micro612w21_class/shared/bin/PAGIT_2020/PAGIT/sourceme.pagit

abacas.1.3.1.pl -h

Run abacas on assembly:

abacas.1.3.1.pl -r FPR3757.fasta -q SRR5244781_contigs.fasta -p nucmer -b -d -a -o SRR5244781_contigs_ordered
ii. Use ACT to view contig alignment to reference genome
  • Make a new directory by the name ACT_contig_comparison in your day2am folder and copy relevant abacas/ACT comparison files to it.
mkdir ACT_contig_comparison

cp FPR3757.gb SRR5244781_contigs_ordered* ACT_contig_comparison/
  • Use scp to get ordered fasta sequence and .cruch file onto your laptop
> Dont forget to change username and /path-to-local-ACT_contig_comparison-directory/ in the below command

scp -r username@flux-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2am/ACT_contig_comparison/ /path-to-local-directory/
  • Read files into ACT
Go to File on top left corner of ACT window -> open 
Sequence file 1 = FPR3757.gb 
Comparison file 1  = SRR5244781_contigs_ordered.crunch 
Sequence file 2  = SRR5244781_contigs_ordered.fasta

Click Apply button

Dont close the ACT window
  • Notice that the alignment is totally beautiful now!!! Scan through the alignment and play with ACT features to look at genes present in reference but not in assembly. Keep the ACT window open for further visualizations.

https://github.com/alipirani88/Comparative_Genomics/blob/master/_img/day2_morning/beautiful.pngalt tag

Genome Annotation

genome_annotation.pngannotation

Identify protein-coding genes with Prokka annotation_details_prokka.pngprokka

From our ACT comparison of our assembly and the reference we can clearly see that there is unique sequence in our assembly. However, we still don’t know what that sequence encodes! To try to get some insight into the sorts of genes unique to our assembly we will run a genome annotation pipeline called Prokka. Prokka works by first running de novo gene prediction algorithms to identify protein coding genes and tRNA genes. Next, for protein coding genes Prokka runs a series of comparisons against databases of annotated genes to generate putative annotations for your genome.

Earlier, we submitted a prokka job which should be completed by now. In this exercise, we will go over the prokka results and copy annotation files to our local system that we can then use for ACT visualization.

***i. Use scp or cyberduck to get Prokka annotated genome on your laptop. Dont forget to change username in the below command
cd SRR5244781_prokka

ls 

scp -r username@flux-xfer.arc-ts.umich.edu:/scratch/micro612w16_fluxod/username/day2am/SRR5244781_prokka/ /path-to-local-ACT_contig_comparison-directory/
ii. Reload comparison into ACT now that we’ve annotated the un-annotated!

annotation_details_ACT.pngprokka

Read files into ACT

Go to File on top left corner of ACT window -> open 
Sequence file 1 = FPR3757.gb 
Comparison file 1  = SRR5244781_contigs_ordered.crunch 
Sequence file 2  = SRR5244781_contigs_ordered.gbf
  • Play around with ACT to see what types of genes are unique to the MSSA genome SRR5244781 compared to the MRSA genome!

The MRSA reference genome is on the top and the MSSA assembly is on the bottom of you screen.

What genes (in general) do you expect to be in the MRSA genome but not the MSSA genome? Some sort of resistance genes, right? Indeed USA300 MRSA acquired the SCCmec cassette (which contains a penicillin binding protein and mecR1) which confers resistance to methicillin and other beta-lactam antibiotics.

Click on GoTo->FPR3757.gb->Navigator-> GoTo and search by gene name. Search for mecR1. Is it in the MSSA genome?

It also acquired the element ACME. One gene on ACME is arcA.

Click on GoTo->FPR3757.gb->Navigator-> GoTo and search by gene name. Search for arcA. Is it in the MSSA genome? Do you see other arc genes that may be in an operon with arcA?

Scroll through the length of the genome. Are there any genes in the MSSA genome that are not in the MRSA genome?

See this diagram and paper for more information on the features of USA300 MRSA:

Image from David & Daum Clin Microbiol Rev. 2010 Jul;23(3):616-87. doi: 10.1128/CMR.00081-09.

Using abacas and ACT to compare VRE/VSE genome

Now that we learned how ACT can be used to explore and compare genome organization and differences, try comparing VSE_ERR374928_contigs.fasta, a Vancomycin-susceptible Enterococcus against a Vancomycin-resistant Enterococcus reference genome Efaecium_Aus0085.fasta that are placed in VRE_vanB_comparison folder under day2am directory.

The relevant reference genbank file that can be used in ACT is Efaecium_Aus0085.gbf.

For this exercise, you will use abacas to order VSE_ERR374928_contigs.fasta against the reference genome Efaecium_Aus0085.fasta and then use the relevant ordered.crunch and ordered.fasta files along with Efaecium_Aus0085.gbf for ACT visualization. Use feature search tool in ACT to search for “vanB” in the resistant genome.

Prep for this afternoon

Before lunch, we’re going to start a job running ARIBA, which takes about 40 minutes to finish, and a job running Roray, which takes about 20 minutes to finish. That way, the results will be there when we’re ready for them!

Execute the following command to copy files for this afternoon’s exercises to your scratch directory, and then load the micro612 conda environment if it’s not already loaded:

cd /scratch/micro612w21_class_root/micro612w21_class/username

# or

wd

cp -r /scratch/micro612w21_class_root/micro612w21_class/shared/data/day2pm/ ./

# Deactivate current conda 
conda deactivate 

# Log out and log back in please

# Create a new conda environment - micro612 from a YML file
conda env create -f /scratch/micro612w21_class_root/micro612w21_class/shared/data/day2pm/day2pm.yaml -n day2pm

# Load your environment and use the tools
conda activate day2pm

Next, let’s start the ariba job:

d2a

# or 

cd /scratch/micro612w21_class_root/micro612w21_class/username/day2pm

# list files
ls

# change directories
cd ariba

# modify email address and look at ariba command
nano ariba.sbat

# run job
sbatch ariba.sbat

Now, let’s start the Roary job:

cd ../roary

# or 

d2pm
cd roary

Start the Roary job:

# modify email address and look at roary command
nano roary.sbat

# run roary
sbatch roary.sbat

Day 2 Afternoon

[HOME]

Goal

This morning we learned how to perform basic genome annotation and comparison using Prokka and ACT. Now we will up the ante and do some more sophisticated comparative genomics analyses!

  • First, we will create custom BLAST databases to identify specific antibiotic resistance genes of interest in a set of genomes.
  • Second, we will use the tool ARIBA to identify the complete antibiotic resistome in our genomes.
  • Third, we will move beyond antibiotic resistance, and look at the complete set of protein coding genes in our input genomes.
  • Finally, we will go back to ACT to understand the sorts of genomic rearrangements underlying observed variation in gene content.

comp_genomics.pngroadmap

For BLAST and ARIBA, we will be looking at 8 Klebsiella pneumoniae genomes from human and environmental sources. Six of these genomes are from this paper, and the other two are sequences from our lab. We are interested in learning more about potential differences in the resistomes of human and environmental isolates.

For the pan-genome analysis, we will be looking at four closely related Acinetobacter baumannii strains. However, despite being closely related, these genomes have major differences in gene content, as A. baumannii has a notoriously flexible genome! In fact, in large part due to its genomic flexibility, A. baumannii has transitioned from a harmless environmental contaminant to a pan-resistant super-bug in a matter of a few decades. If you are interested in learning more, check out this nature review or this paper I published a few years back analyzing the very same genomes you will be working with.

Navigate to the day2pm directory and load the conda environment you created before break for this session:


cd /scratch/micro612w21_class_root/micro612w21_class/username

or

wd

cp -r /scratch/micro612w21_class_root/micro612w21_class/shared/data/day2pm/ ./

Load your environment and use the tools

conda activate day2pm

Determine which genomes contain KPC genes using BLAST

comp_genomics_details_blast.pngblast Before comparing full genomic content, lets start by looking for the presence of particular genes of interest. Some K. pneumoniae harbor a KPC gene that confers resistance to carbapenems, a class of antibiotics of last resort (more information here and here). We will see if any of our samples have a KPC gene, by comparing the genes in our genomes to KPC genes extracted from the antibiotic resistance database (ARDB). These extracted genes can be found in the file blast/data/blast_kleb/ardb_KPC_genes.pfasta, which we will use to generate a BLAST database.

First, change directories to the blast directory:

#change directory to day2pm
d2a

cd blast
i. Run makeblastdb on the file of KPC genes to create a BLAST database.

makeblastdb takes as input:

  1. an input fasta file of protein or nucleotide sequences (data/blast_kleb/ardb_KPC_genes.pfasta) and
  2. a flag indicating whether to construct a protein or nucleotide database (in this case protein: -dbtype prot).
makeblastdb -in data/blast_kleb/ardb_KPC_genes.pfasta -dbtype prot
ii. BLAST K. pneumoniae protein sequences against our custom KPC database.

Run BLAST!

The input parameters are:

  1. query sequences (-query data/blast_kleb/kpneumo_all.pfasta),
  2. the database to search against (-db data/blast_kleb/ardb_KPC_genes.pfasta),
  3. the name of a file to store your results (-out KPC_blastp_results.tsv),
  4. output format (-outfmt 6),
  5. e-value cutoff (-evalue 1e-100),
  6. number of database sequences to return (-max_target_seqs 1) (Note that when using large databases, this might not give you the best hit. See here for more details.)
blastp -query data/blast_kleb/kpneumo_all.pfasta -db data/blast_kleb/ardb_KPC_genes.pfasta -out KPC_blastp_results.tsv -outfmt 6 -evalue 1e-100 -max_target_seqs 1

Use less to look at KPC_blastp_results.tsv. Which genomes have a KPC gene?

less KPC_blastp_results.tsv

Here is more information about the content for each of the output file columns.

  • Exercise: In this exercise you will try a different type of blasting – blastx. Blastx compares a nucleotide sequence to a protein database by translating the nucleotide sequence in all six frames and running blastp. Your task is to determine which Enterococcus genomes are vancomycin resistant (VRE, vs. VSE) by blasting against a database of van genes. The required files are located in blast/data/blast_ent folder in the day2pm directory.

Your steps should be:

  1. Concatenate the data/blast_ent/*.fasta files (VRE/VSE genomes) into a single file (your blast query file) using the cat command.
  2. Create a blastp database from data/blast_ent/ardb_van.pfasta
  3. Run blastx
  4. Verify that only the VRE genomes hit the database
  5. For extra credit, determine which van genes were hit by using grep to search for the hit gene ID in data/blast_ent/ardb_van.pfasta
Solution
d2a

cd blast/data/blast_ent

# Make sure you are in blast_ent folder
cat *.fasta > VRE_VSE_genomes.fasta

makeblastdb -in ardb_van.pfasta -dbtype prot

blastx -query VRE_VSE_genomes.fasta -db ardb_van.pfasta -out van_blastp_results.tsv -outfmt 6 -evalue 1e-100 -max_target_seqs 1
  • Exercise: Experiment with the –outfmt parameter, which controls different output formats that BLAST can produce. You can use blastp -help | less to get more information about the different output formats. You can search for the -outfmt flag by typing /outfmt and then typing n to get to the next one.

Identify antibiotic resistance genes with ARIBA directly from paired end reads

Now let’s look at the full spectrum of antibiotic resistance genes in our Klebsiella genomes!

ARIBA (Antimicrobial Resistance Identification By Assembly) is a tool that identifies antibiotic resistance genes by running local assemblies. The input is a FASTA file of reference sequences (can be a mix of genes and noncoding sequences) and paired sequencing reads. ARIBA reports which of the reference sequences were found, plus detailed information on the quality of the assemblies and any variants between the sequencing reads and the reference sequences.

ARIBA is compatible with various databases and also contains a utility to download different databases such as: argannot, card, megares, plasmidfinder, resfinder, srst2_argannot, vfdb_core. Today, we will be working with the card database (ariba/data/CARD/ in your day2pm directory).

i. Run ARIBA on input paired-end fastq reads for resistance gene identification.

The fastq reads are in the ariba/data/kpneumo_fastq/ directory. Since ARIBA is a memory intensive, we started this at the beginning of the afternoon. We should have our results by now, but first let’s look at the ARIBA command.

# navigate to ariba directory
cd /scratch/micro612w21_class_root/micro612w21_class/username/day2pm

# or

d2a

cd ariba

# load conda environment if not already loaded
conda activate day2pm

# look at ariba commands
less ariba.sbat
ii. Run ARIBA summary function to generate a summary report.

ARIBA has a summary function that summarises the results from one or more sample runs of ARIBA and generates an output report with various level of information determined by the -preset parameter. The parameter -preset minimal will generate a minimal report showing only the presence/absence of resistance genes whereas -preset all will output all the extra information related to each database hit such as reads and reference sequence coverage, variants and their associated annotations (if the variant confers resistance to an antibiotic) etc.

# look at ariba output
ls results
ls results/card
ls results/card/*

# make directory for ariba results

ariba summary --preset minimal results/kpneumo_card_minimal_results results/card/*/report.tsv

ariba summary --preset all results/kpneumo_card_all_results results/card/*/report.tsv

The ARIBA summary generates three output:

  1. kpneumo_card*.csv file that can be viewed in your favorite spreadsheet program (e.x. Microsoft Excel).
  2. kpneumo_card*.phandango.{csv,tre} that allow you to view the results in Phandango. You can drag-and-drop these files straight into Phandango.

Lets copy these files, along with a metadata file, to the local system using cyberduck or scp.

mkdir ~/Desktop/micro612
mkdir ~/Desktop/micro612/day2pm

scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2pm/ariba/results/kpneumo_card* ~/Desktop/micro612/day2pm
scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2pm/ariba/data/kpneumo_source.tsv ~/Desktop/micro612/day2pm
scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2pm/ariba/data/mlst_typing/kpneumo_mlst.tsv ~/Desktop/micro612/day2pm

Drag and drop these two files onto the Phandango website. What types of resistance genes do you see in these Klebsiella genomes?

iii. Explore full ARIBA matrix in R

Now, fire up RStudio and read in the ARIBA full report kpneumo_ariba_all_results.csv so we can take a look at the results.

# Read in data
ariba_full  = read.csv(file = '~/Desktop/micro612/day2pm/kpneumo_card_all_results.csv', row.names = 1)
rownames(ariba_full) = gsub('_1|_R1|/report.tsv|card/|results/','',rownames(ariba_full))

# Subset to get description for each gene
ariba_full_match = ariba_full[, grep('match',colnames(ariba_full))]

# Make binary for plotting purposes
ariba_full_match[,] = as.numeric(ariba_full_match != 'no')

# Make a heatmap!

# install pheatmap if you don't already have it installed 
#install.packages('pheatmap')

# load pheatmap
library(pheatmap)

# load metadata about sample source
annots = read.table('~/Desktop/micro612/day2pm/kpneumo_source.tsv',row.names=1)
colnames(annots) = 'Source'

# plot heatmap
pheatmap(ariba_full_match,annotation_row = annots)

Bacteria of the same species can be classified into different sequence types (STs) based on the sequence identity of certain housekeeping genes using a technique called multilocus sequence typing (MLST). The different combination of these house keeping sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the seven genes define the allelic profile or sequence type (ST). Sometimes, different sequence types are associated with different environments or different antibiotic resistance genes. We want to know what sequence type(s) our genomes come from, and if there are certain ones that are associated with certain sources or certain antibiotic resistance genes.

We already pre-ran Ariba MLST on all 8 of our K. pneumonia genomes. Use the MLST results kpneumo_mlst.tsv that we previously downloaded to add a second annotation column to the heatmap we created above to visualize the results.

Go to your R studio and overlay MLST metadata as an additional row annotation to your previous heatmap

#read in MLST data
annots_mlst = read.table('~/Desktop/micro612/day2pm/kpneumo_mlst.tsv',row.names=1)

#make sure order of genomes is the same as source annotation
annots_mlst$ST = annots_mlst[row.names(annots),]

#change from numeric to character, so that heatmap doesn't treat ST as continuous variable
Row_annotations$ST = as.character(Row_annotations$ST)

#paste together annotations
Row_annotations <- cbind(annots, annots_mlst) 

#create new heatmap with source and mlst
pheatmap(ariba_full_match,annotation_row = Row_annotations, annotation_colors = annoCol)

The commands and steps that were used for MLST typing are given below:

Solution

Dont run this exercise

Steps:

  1. Check if you have an MLST database for your species of interest using ariba pubmlstspecies.
  2. Download your species MLST database. You can look at the manual or run the command ariba pubmlstget -h to help figure out how to download the correct MLST database. I would suggest downloading it to the data directory.
  3. Copy the ariba.sbatch file to a new file called mlst.sbatch.
  4. Modify the mlst.sbatch script in the following ways:
    1. Change the database directory to the K. pneumoniae MLST database you just downloaded.
    2. Change the mkdir line to make a results/mlst directory.
    3. Modify the output directory outdir line: Change card to mlst.
  5. Submit the mlst.sbatch script. It should take about 7 minutes to run.
  6. Once the run completes, run scripts/summarize_mlst.sh results/mlst to look at the MLST results. If you want, you can save it to its own file. What sequence types are present?
# Make sure you are in ariba directory under day2pm folder and running the below commands from ariba directory.
d2a
cd ariba

# Check if you have an mlst database for your species of interest
ariba pubmlstspecies

# Download your species mlst database
ariba pubmlstget "Klebsiella pneumoniae" data/MLST_db

# Set ARIBA database directory to the get_mlst database that we just downloaded.
db_dir=data/MLST_db/ref_db/

# Run ariba mlst with this database
samples=$(ls data/kpneumo_fastq/*1.fastq.gz) #forward reads

# Generate mlst folder under results folder to save ariba MLST results
mkdir results/mlst_typing

# Run for loop, where it generates ARIBA command for each of the forward end files.
for samp in $samples; do   
samp2=${samp//1.fastq/2.fastq} #reverse reads   
outdir=results/mlst_typing /$(echo $samp | cut -d/ -f3 | sed 's/_.*1.fastq.gz//g')
echo "Results will be saved in $outdir"
echo "Running: ariba run --force $db_dir $samp $samp2 $outdir  #ariba command "
ariba run --force $db_dir $samp $samp2 $outdir  #ariba command 
done

# Once the run completes, run summarize_mlst.sh script to print out mlst reports that are generated in current directory
bash scripts/summarize_mlst.sh results/mlst

Perform pan-genome analysis with Roary

comp_genomics_details_roary.pngroary

Roary is a pan genome pipeline, which takes annotated assemblies in GFF3 format and calculates the pan-genome. The pan-genome is just a fancy term for the full complement of genes in a set of genomes.

The way Roary does this is by:

  1. Roary gets all the coding sequences from GFF files, converts them into protein, and creates pre-clusters of all the genes.
  2. Then, using BLASTP and MCL, Roary will create gene clusters, and check for paralogs.
  3. Finally, Roary will take every isolate and order them by presence/absence of genes.
i. Generate pan-genome matrix using Roary and GFF files

Change your directory to day2pm


# Make sure to change username with your uniqname

cd /scratch/micro612w21_class_root/micro612w21_class/username/day2pm/roary

# or 

d2a
cd roary

Let’s look at the Roary command:

less roary.sbat

The roary command runs pan-genome pipeline on gff files placed in Abau_genomes_gff (-v) using 4 threads (-p), save the results in an output directory Abau_genomes_roary_output (-f), generate R plots using .Rtab output files and align core genes(-n)

Change directory to Abau_genomes_roary_output to explore the results.

cd Abau_genomes_roary_output

ls

Output files:

  1. summary_statistics.txt: This file is an overview of your pan genome analysis showing the number of core genes(present in all isolates) and accessory genes(genes absent from one or more isolates or unique to a given isolate).
  2. gene_presence_absence.csv: This file contain detailed information about each gene including their annotations which can be opened in any spreadsheet software to manually explore the results. It contains plethora of information such as gene name and their functional annotation, whether a gene is present in a genome or not, minimum/maximum/Average sequence length etc.
  3. gene_presence_absence.Rtab: This file is similar to the gene_presence_absence.csv file, however it just contains a simple tab delimited binary matrix with the presence and absence of each gene in each sample. It can be easily loaded into R using the read.table function for further analysis and plotting. The first row is the header containing the name of each sample, and the first column contains the gene name. A 1 indicates the gene is present in the sample, a 0 indicates it is absent.
  4. core_gene_alignment.aln: a multi-FASTA alignment of all of the core genes that can be used to generate a phylogenetic tree.
ii. Explore pan-genome matrix gene_presence_absence.csv and gene_presence_absence.Rtab using R

We’re going to use information from gene_presence_absence.csv and gene_presence_absence.Rtab. Let’s take a look at these from the Terminal using less:

# annotation information is here
less -S gene_presence_absence.csv
# presence/absence information
less -S gene_presence_absence.Rtab

Read matrices into R, generate exploratory plots and query pan-genome

Use scp or cyberduck to get gene_presence_absence.csv and gene_presence_absence.Rtab onto your laptop desktop folder.

scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2pm/roary/Abau_genomes_roary_output/gene_presence_absence.csv ~/Desktop/micro612/day2pm
scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day2pm/roary/Abau_genomes_roary_output/gene_presence_absence.Rtab ~/Desktop/micro612/day2pm
i. Prepare and clean data
  • Fire up RStudio and read both files in.
# read in annotations (only need 3rd column)
annots = read.csv('~/Desktop/micro612/day2pm/gene_presence_absence.csv')[,3]
# read in presence-absence heatmap
pg_matrix = read.delim('~/Desktop/micro612/day2pm/gene_presence_absence.Rtab', row.names = 1, header=TRUE)
  • Use head, str, dim, etc. to explore the matrix.
ii. Generate exploratory heatmaps.
  • Make a heatmap for the full matrix
# load pheatmap
library(pheatmap)

# make heatmap
pheatmap(pg_matrix,col= c('black', 'red'), show_rownames = F)
  • Make a heatmap for variable genes (present in at least one, but not all of the genomes)
# subset matrix to include only variable genes
pg_matrix_subset = pg_matrix[rowSums(pg_matrix > 0) > 0 & rowSums(pg_matrix > 0) < 4 ,] 
# make heatmap of variable genes
pheatmap(pg_matrix_subset,col= c('black', 'red'), show_rownames = F)
iii. Query pan-genome
  • Which genomes are most closely related based upon shared gene content?

We will use the apply function to determine the number of genes shared by each pair of genomes.

apply(pg_matrix_subset, 2, function(x){
  colSums(pg_matrix_subset == x & pg_matrix_subset == 1)
})
  • What is the size of the core genome?

Lets first get an overview of how many genes are present in different numbers of genomes (0, 1, 2, 3 or 4) by plotting a histogram. Here, we combine hist with rowSums to accomplish this.

hist(rowSums(pg_matrix > 0), col="red")

Next, lets figure out how big the core genome is (e.g. how many genes are common to all of our genomes)?

sum(rowSums(pg_matrix > 0) == 4)
  • What is the size of the accessory genome?

Lets use a similar approach to determine the size of the accessory genome (e.g. those genes present in only a subset of our genomes).

sum(rowSums(pg_matrix > 0) < 4 & rowSums(pg_matrix > 0) > 0)
  • What types of genes are unique to a given genome?

So far we have quantified the core and accessory genome, now lets see if we can get an idea of what types of genes are core vs. accessory. Lets start by looking at those genes present in only a single genome.

# genes present only in a single genome
unique_annots = annots[rowSums(pg_matrix > 0) == 1]
unique_annots

What do you notice about these genes?

  • What is the number of hypothetical genes in core vs. accessory genome?

Looking at unique genes we see that many are annotated as “hypothetical”, indicating that the sequence looks like a gene, but has no detectable homology with a functionally characterized gene.

Determine the fraction of “hypothetical” genes in unique vs. core.

# genes present in all genomes
core_annots = annots[rowSums(pg_matrix > 0) == 4]

# fraction of unique and core hypothetical genes
# unique hypothetical
sum(grepl("hypothetical" , unique_annots)) / sum(rowSums(pg_matrix > 0) == 1)
# core hypothetical
sum(grepl("hypothetical" , core_annots)) / sum(rowSums(pg_matrix > 0) == 4)

Why does this make sense?

Perform genome comparisons with ACT

comp_genomics_details_ACT.pngact

In the previous exercises we were focusing on gene content, but losing the context of the structural variation underlying gene content variation (e.g. large insertions and deletions). Here we will use ACT to compare two of our genomes (note that you can use ACT to compare more than two genomes if desired).

i. Create ACT alignment file with BLAST

As we saw this morning, to compare genomes in ACT we need to use BLAST to create the alignments. We will do this on great lakes.

# navigate to act directory
cd /scratch/micro612w21_class_root/micro612w21_class/username/day2pm

# or

d2a
cd act

Run blast on two A. baumanii genomes:

blastn -query data/Abau_genomes/AbauA_genome.fasta -db data/Abau_BLAST_DB/ACICU_genome.fasta -outfmt 6 -evalue 1e-20 > AbauA_vs_ACICU.blast
ii. Read in genomes, alignments and annotation files

Use scp or cyberduck to transfer Abau_ACT_files folder onto your laptop

  1. data/Abau_genomes/AbauA_genome.fasta
  2. data/Abau_genomes/ACICU_genome.fasta
  3. AbauA_vs_ACICU.blast
  4. data/Abau_ACT_files/AbauA_genome_gene.gff
  5. data/Abau_ACT_files/ACICU_genome_gene.gff
iii. Explore genome comparison and features of ACT

Read in genomes and alignment into ACT


Go to File -> open 
Sequence file 1  = ACICU_genome.fasta 
Comparison file 1  = AbauA_vs_ACICU.blast
Sequence file 2  = AbauA_genome.fasta

Before we use annotation present in genbank files. Here we will use ACT specific annotation files so we get some prettier display (resistance genes = red, transposable elements = bright green)


Go to File -> ACICU_genome.fasta -> Read an entry file = ACICU_genome_gene.gff

Go to File -> AbauA_genome.fasta -> Read an entry file = AbauA_genome_gene.gff

Play around in ACT to gain some insight into the sorts of genes present in large insertion/deletion regions. See if you can find:

  1. differences in phage content,
  2. membrane biosynthetic gene cluster variation and
  3. antibiotic resistance island variation.

Day 3 Morning

[HOME]

Goal

On day 1, we ran through a pipeline to map reads against a reference genome and call variants, but didn’t do much with the variants we identified. Among the most common analyses to perform on a set of variants is to construct phylogenetic trees.

  • In this section, we will employ Parsnp to align genomes and create a multiple sequence alignment and visualize how they align in Gingr.
  • Explore different tools for generating and visualizing phylogenetic trees.
  • Perform recombination filtering with Gubbins and see how recombination can distort phylogenetic signal.

phylo.pngphylo

For the first several exercises, we will use the A. baumannii genomes that we worked with yesterday afternoon. The backstory on these genomes is that Abau_A, Abau_B and Abau_C are representatives of three clones (as defined by pulsed-field gel electrophoresis - a low-resolution typing method) that were circulating in our hospital.

One of the goals of our published study was to understand the relationship among these clones to discern whether:

  1. the three clones represent three independent introductions into the hospital or
  2. the three clones originated from a single introduction into the hospital, with subsequent genomic rearrangement leading to the appearance of unique clones.

The types of phylogenetic analyses you will be performing here are the same types that we used to decipher this mystery. The other two genomes you will be using are ACICU and AB0057. ACICU is an isolate from a hospital in France, and its close relationship to our isolates makes it a good reference for comparison. AB0057 is a more distantly related isolate that we will utilize as an out-group in our phylogenetic analysis. The utility of an out-group is to help us root our phylogenetic tree, and gain a more nuanced understanding of the relationship among strains.

Execute the following command to copy files for this afternoon’s exercises to your scratch directory:

wd

#or

cd /scratch/micro612w21_class_root/micro612w21_class/username

cp -r /scratch/micro612w21_class_root/micro612w21_class/shared/data/day3am ./

Perform Whole genome alignment with Parsnp and convert alignment to other useful formats

An alternative approach for identification of variants among genomes is to perform whole genome alignments of assemblies. If the original short read data is unavailable, this might be the only approach available to you. Typically, these programs don’t scale well to large numbers of genomes (e.g. > 100), but they are worth being familiar with. We will use the tool mauve for constructing whole genome alignments of our five A. baumannii genomes.

i. Perform genome alignment with Parsnp

Create a conda environment day3am that will install Parsnp/Harvesttools for you. Run these commands to generate a new conda environment.

# Deactivate conda environment if you have loaded it.
conda deactivate

conda env create -f /scratch/micro612w21_class_root/micro612w21_class/shared/data/day3am/day3am.yml -n day3am

conda activate day3am

# invoke parsnp and harvesttool's help menu to check if it was installed properly
parsnp -h

harvesttools -h

cd day3am

Now we will ask parsnp to align all the genomes in Abau_genomes directory and also ask parsnp to use Reference_genome/ACICU_genome.fasta as a reference genome.


parsnp -c -d Abau_genomes/ -r Abau_genomes/Reference_genome/ACICU_genome.fasta -o parsnp_results -p 2 -v

Parsnp will generate various output files in parsnp_results folder:

  • Newick formatted core genome SNP tree: parsnp_results/parsnp.tree
  • SNPs used to infer phylogeny: parsnp_results/parsnp.vcf
  • Gingr formatted binary archive: parsnp_results/parsnp.ggr
  • XMFA formatted multiple alignment: parsnp_results/parsnp.xmfa
ii. Convert ginger formatted binary file to fasta format

We will use harvesttools to convert parsnp.ggr to a multi-fasta alignment output (concatenated LCBs) file - parsnpLCB.aln

cd parsnp_results

harvesttools -i parsnp.ggr -M parsnpLCB.aln
iii. View multiple genome alignment in gingr

Gingr is a visualization tool that accompanies parsnp. So, let’s download the files we created using parsnp and load them into gingr.

Go to your local system terminal or open a new terminal tab and run these commands to create a new directory and copy files from great lakes to your local Desktop.

mkdir ~/Desktop/Abau_parsnp

cd ~/Desktop/Abau_parsnp

# Make sure to replace username with your uniq name
scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day3am/parsnp_results/parsnpLCB.aln ./
scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day3am/parsnp_results/parsnp.tree ./
scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day3am/Abau_genomes/Reference_genome/ACICU.gb ./

Now, fire up gingr and use File->open tab to read in these files one by one - .aln, .tree and .gb files.

Notice the structure of the tree (i.e. which genomes are closely related to one another) and whether variants are or aren’t evenly spaced across the genome.

Perform some DNA sequence comparisons and phylogenetic analysis in APE, an R package

We’ve done some initial visualization of the parsnp output files in gingr, now we are going to do some further exploration in R.

First we will determine how many variants separate each genome, second we will visualize the phylogenetic tree produced by parsnp and finally we will look for evidence of recombination among our genomes.

i. Read alignment into R

Fire up RStudio, set your working directory to ~/Desktop/Abau_parsnp/ or wherever you have downloaded your parsnp files and install/load ape

Use the read.dna function in ape to read in you multiple alignments. Print out the variable to get a summary.

#SET YOUR DIRECTORY
setwd("~/Desktop/Abau_parsnp/")

#INSTALL THE ape PACKAGE FOR PHYLOGENETIC ANALYSIS
install.packages("ape")
library(ape)

#READ IN THE MULTIPLE GENOME ALIGNMENT AND CHANGE THE NAMES TO REMOVE FILE EXTENSIONS
abau_msa = read.dna('parsnpLCB.aln', format = "fasta") 
row.names(abau_msa) = gsub(".fa|.fasta", "", row.names(abau_msa))
ii. Get variable positions

The DNA object created by read.dna can also be addressed as a matrix, where the columns are positions in the alignment and rows are your sequences. We will next treat our alignment as a matrix, and use apply and colSums to get positions in the alignment that vary among our sequences. Examine these commands in detail to understand how they are working together to give you a logical vector indicating which positions vary in your alignment.


abau_msa_bin = apply(abau_msa, 2, FUN = function(x){x == x[1]}) 

abau_var_pos = colSums(abau_msa_bin) < 5
iii. Get non-gap positions

For our phylogenetic analysis we want to focus on the core genome, so we will next identify positions in the alignment where all our genomes have sequence.

non_gap_pos = colSums(as.character(abau_msa) == '-') == 0
iv. Count number of variants between sequences

Now that we know which positions in the alignment are core and variable, we can extract these positions and count how many variants there are among our genomes. To count pairwise variants we will use the dist.dna function in ape. The model parameter indicates that we want to compare sequences by counting differences. Print out the resulting matrix to see how different our genomes are.


abau_msa_var = abau_msa[,abau_var_pos & non_gap_pos ]
var_count_matrix = dist.dna(abau_msa_var, model = "N")

Examining the pairwise distances among our isolates, we see that our genomes have thousands of variants between them. Based on the estimated evolutionary rate of Acinetobacter, we would expect this amount of variation to take hundreds of years to accumulate via mutation and vertical inherentence. Thus, based on this observation it seems unlikely that these are related by recent transmission. However, before we reach our final conclusion we need to assess whether all of this variation was due to mutation and vertical inheretance, or if some of the variation is due to horizontal transfer via recombination.

vi. View phylogenetic tree

First, let’s read in the tree produced by parsnp and plot it using ape.

parsnp_tree = read.tree('parsnp.tree')
plot(parsnp_tree)

Next, let’s root our tree by the outgroup so that the structure is correct.

parsnp_tree_rooted = root(parsnp_tree, 'Abau_AB0057_genome.fa')
plot(parsnp_tree_rooted)

Now that the tree is rooted, let’s drop the outgroup so we can more clearly see the tree structure for our isolates of interest.

parsnp_tree_rooted_drop = drop.tip(parsnp_tree_rooted, c('Abau_AB0057_genome.fa', 'ACICU_genome.fasta.ref'))
plot(parsnp_tree_rooted_drop)

Notice that with this tree, genomes A and B are more closely related, suggesting that they share a more recent common ancestor than C. In the realm of genomic epidemiology we would infer that A and B are more closely related in a putative transmission chain. Let’s see if the tree structure, and therefore our understanding of the outbreak is influenced by filtering out recombinant regions.

Perform SNP density analysis to discern evidence of recombination

An often-overlooked aspect of a proper phylogenetic analysis is to exclude recombinant sequences. Homologous recombination in bacterial genomes is a mode of horizontal transfer, wherein genomic DNA is taken up and swapped in for a homologous sequence. The reason it is critical to account for these recombinant regions is that these horizontally acquired sequences do not represent the phylogenetic history of the strain of interest, but rather in contains information regarding the strain in which the sequence was acquired from. One simple approach for detecting the presence of recombination is to look at the density of variants across a genome. The existence of unusually high or low densities of variants is suggestive that these regions of aberrant density were horizontally acquired. Here we will look at our closely related A. baumannii genomes to see if there is evidence of aberrant variant densities.

i. Subset sequences to exclude the out-group

For this analysis we want to exclude the out-group, because we are interested in determining whether recombination would hamper our ability to reconstruct the phylogenetic relationship among our closely related set of genomes.

  • Note that the names of the sequences might be different for you, so check that if the command doesn’t work. Try printing out abau_msa_no_outgroup to check that it worked.
abau_msa_no_outgroup = abau_msa[c('ACICU_genome.ref','AbauA_genome','AbauB_genome','AbauC_genome'),]
ii. Get variable positions

Next, we will get the variable positions, as before


abau_msa_no_outgroup_bin = apply(abau_msa_no_outgroup, 2, FUN = function(x){x == x[1]}) 

abau_no_outgroup_var_pos = colSums(abau_msa_no_outgroup_bin) < 4
iii. Get non-gap positions

Next, we will get the core positions, as before


abau_no_outgroup_non_gap_pos = colSums(as.character(abau_msa_no_outgroup) == '-') == 0
iv. Create overall histogram of SNP density

Finally, create a histogram of SNP density across the genome.

hist(which(abau_no_outgroup_var_pos & abau_no_outgroup_non_gap_pos), 10000)

Does the density look even, or do you think there might be just a touch of recombination?

Perform recombination filtering with Gubbins

Now that we know there is recombination, we know that we need to filter out the recombinant regions to discern the true phylogenetic relationship among our strains. In fact, this is such an extreme case (~99% of variants of recombinant), that we could be totally misled without filtering recombinant regions. To accomplish this we will use the tool gubbins, which essentially relies on elevated regions of variant density to perform recombination filtering.

i. Run gubbins on your fasta alignment

Go back on great lakes and load modules required by gubbins

# Deactivate day3am conda environment
conda deactivate

module load Bioinformatics

module load gubbins/2.3.1

Run gubbins on your fasta formatted alignment

d3m

cd parsnp_results

run_gubbins -v -f 50 -o Abau_AB0057_genome.fa parsnpLCB.aln
ii. View recombination regions detected by gubbins in Phandango

Phandango is a web based tool that is useful for visualizing output from many common microbial genomic analysis programs. Here we will use it to visualize the recombination regions detected by gubbins.

First let’s download a summary of recombinant regions in gff format onto your local system. Type these commands on your local terminal.


cd ~/Desktop/Abau_parsnp

scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day3am/parsnp_results/parsnpLCB.recombination_predictions.gff  ./

scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day3am/parsnp_results/parsnpLCB.node_labelled.final_tree.tre  ./

Next, go the the phandango website (https://jameshadfield.github.io/phandango/#/), and just drag the gff file - parsnpLCB.recombination_predictions.gff and your parsnp tree - parsnp.tree into your web browser.

Does gubbins seem to have identified recombinant regions where we saw elevated variant density? In addition, which genomes seem to share the most recombinant regions?

Finally, lets look at the recombination-filtered tree to see if this alters our conclusions.

To view the tree we will use the ape package in R:


# In RStudio

# Load ape library
library(ape)

# Path to tree file
tree_file <- '~/Desktop/Abau_parsnp/parsnpLCB.node_labelled.final_tree.tre'

# Read in tree
gubbins_tree <- read.tree(tree_file)

# Drop the outgroup for visualization purposes
gubbins_tree_noOG = drop.tip(gubbins_tree, c('Abau_AB0057_genome.fa'))

plot(gubbins_tree_noOG)

How does the structure look different than the unfiltered tree?

  • Note that turning back to the backstory of these isolates, Abau_B and Abau_C were both isolated first from the same patient. So this analysis supports that patient having imported both strains, which likely diverged at a prior hospital at which they resided.

Overlay metadata on your tree using R

For the final exercise we will use a different dataset, composed of USA300 methicillin-resistant Staphylococcus aureus genomes. USA300 is a strain of growing concern, as it has been observed to cause infections in both hospitals and in otherwise healthy individuals in the community. An open question is whether there are sub-clades of USA300 in the hospital and the community, or if they are all the same. Here you will create an annotated phylogenetic tree of strains from the community and the hospital, to discern if these form distinct clusters.

i. Download MRSA genome alignment from great lakes

Use cyberduck or scp to get genomes onto your local system.


cd ~/Desktop (or wherever your desktop is) 
mkdir MRSA_genomes 
cd MRSA_genomes

scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day3am/2016-03-09_KP_BSI_USA300.fa  ./
scp username@greatlakes-xfer.arc-ts.umich.edu:/scratch/micro612w21_class_root/micro612w21_class/username/day3am/HA_vs_CA  ./

ii. Look at SNP density for MRSA alignment in R

Before we embark on our phylogenetic analysis, lets look at the SNP density to verify that there is no recombination.

Fire up R studio and run these commands.

library(ape)
mrsa_msa = read.dna('2016-03-09_KP_BSI_USA300.fa', format = 'fasta') 
mrsa_msa_bin = apply(mrsa_msa, 2, FUN = function(x){x == x[1]}) 
mrsa_var_pos = colSums(mrsa_msa_bin) < nrow(mrsa_msa_bin) 
hist(which(mrsa_var_pos), 10000)

Does it look like there is evidence of recombination?

***iii. Create neighbor joining tree in R ***

We will be using R to construct a neighbor-joining tree. Neighbor joining trees fall under the category of “distance-based” tree building. There are more sophisticated algorithms for tree building, including maximum likelihood and bayesian methods.

There are a number of bioinformatics tools to create trees. Note that in your research it is not a good idea to use these phylogenetic tools completely blind and I strongly encourage embarking on deeper learning yourself, or consulting with an expert before doing an analysis for a publication.

One resource here at University of Michigan is the Phylogenetic Methods course, EEB 491.

First, we will need to install and load in some packages to work with tree objects in R.


install.packages('phangorn')
install.packages('phytools')

library(ape) #installed previously, load in if not already in your environment 
library(phangorn)
library(phytools)

Next, we will create a pairwise SNV distance matrix of the MRSA samples to create a neighbor joining tree.

mrsa_msa_var = mrsa_msa[, mrsa_var_pos]
dna_dist = dist.dna(mrsa_msa_var, model = 'N', as.matrix = TRUE)

Finally, we can use the distance matrix to construct a neighbor joining tree using the function nj()

NJ_tree = nj(dna_dist) 

We can look at our tree using plot()

plot(NJ_tree)

We can explore other representations of our tree in R using the flag “type” in the plot function

plot(NJ_tree, type = 'fan')
plot(NJ_tree, type = 'cladogram')
plot(NJ_tree, type = 'phylogram') #default
***iv. Add annotations to tree in R ***

Now, we will overlay our data on whether an isolate was from a community or hospital infection onto the tree.

Here is a great example of some of the different ways to annotate your trees in R: https://rdrr.io/cran/ape/man/nodelabels.html

First, we will read in our metadata.

metadata = read.table('HA_vs_CA', header = TRUE, stringsAsFactors = FALSE)

Next, let’s clean up the tree tip label names to match the metadata IDs, and then drop the tree tips we don’t have metadata for.

NJ_tree$tip.label = gsub('_R.*','',NJ_tree$tip.label)
NJ_tree = drop.tip(NJ_tree, setdiff(NJ_tree$tip.label, metadata$ID))

Next, we will create our isolate legend and assign colors to the legend. It’s important that the labels are in the same order as the tree tips. So, we will use an sapply statement (which is like a for loop) to iterate through the tree tip labels, and figure out the label for each tree tip id.

isolate_legend = sapply(NJ_tree$tip.label, function(id){
  metadata$SOURCE[metadata$ID == id]
})
isolate_colors = structure(c('blue', 'red'), names = sort(unique(isolate_legend)))

Finally, we can use the function tiplabels() to add annotations to our tree and the function legend() to put a legend on our tree.

plot(NJ_tree, type = 'fan', no.margin = TRUE, 
     cex = 0.5, label.offset = 3, align.tip.label = FALSE)
tiplabels(pie = to.matrix(isolate_legend,names(isolate_colors)), 
          piecol = isolate_colors, cex = 0.3, adj = 1.4)
legend('bottomleft', legend = names(isolate_colors), 
       col = "black", pt.bg = isolate_colors, pch = 21, cex = 1)
  • Do community and hospital isolates cluster together, or are they inter-mixed?

Note, there are also web tools out there to overlay metadata onto your tree and to manipulate the tree in other ways. One such tool is iTOL.

Day 3 Afternoon

[HOME]

Klebsiella pneumoniae comparative genomic analysis

To finish up the workshop we are going to go through the process of working up a complete dataset, from start to finish. This set of genomes originated from a regional outbreak of bla-KPC carrying Klebsiella pneumoniae – one of the most concerning healthcare associated pathogens. The goal is to follow up on a previously published epidemiologic analysis, and see if genomics supports prior epidemiologic conclusions and can provide additional insights.

The results of this genomics analysis were published in this paper.

We have our genomes, and we know in which regional facility each isolate originated.

The goal of this exercise is to:

  1. process our genomes (QC, variant calling),
  2. perform a phylogenetic analysis and
  3. overlay our meta-data.

To make this more difficult, the instructions will be much more vague than in previous sessions, and you will be challenged to use what you have learned, both in the past three days and in the prior workshop, to complete this analysis.

Hopefully we’ve prepared you to take on the challenge, but remember this is an open book test!

Feel free to lean on materials from the workshops, manuals of tools and Google (and of course instructors and neighbors).

Execute the following command to copy files for this afternoon’s exercises to your scratch directory:


cd /scratch/micro612w21_class_root/micro612w21_class/username

or 

wd

cp -r /scratch/micro612w21_class_root/micro612w21_class/shared/data/day3pm ./

Perform QC on fastq files

On the first morning you ran FastQC to evaluate the quality of a single genome. However, a typical project will include many genomes and you will want to check the quality of all of your samples. From the bash workshop, I hope you can appreciate that you do not want to process 100 genomes by typing 100 commands – rather you want to write a short shell script to do the work for you!

i. Edit the shell script fastqc.sh located in /scratch/micro612w21_class_root/micro612w21_class/your username/day3pm to run FastQC on all fastq files.

Important info about this shell script

  • The shell script includes a for loop that loops over all of the genomes in the target directory
  • The tricky part of this shell script is that each fastq command contains two files (forward and reverse reads). So, you need to take advantage of the fact that the forward and reverse read files both have the same prefix, and you can loop over these prefixes.
  • You should be able to get prefixes by piping the following unix commands: ls, cut, sort, uniq
  • The prefix should be a part of both forward and reverse reads. For example, the file_prefix for samples Rush_KPC_264_1_sequence.fastq.gz and Rush_KPC_264_2_sequence.fastq.gz should be Rush_KPC_264
  • when you are testing your shell script, comment out (using #) the lines below echo so you can see that if the script is ‘echo’-ing the correct commands.
  • Try running multiqc inside the script by adding the multiqc command with appropriate out directory
  • Don’t run multiqc inside for loop and should be run only after the for loop ends.

The fastq files are located in:

/scratch/micro612w21_class_root/micro612w21_class/shared/data/day3pm_fastq/

Rather than copying these to your directory, analyze the files directly in that directory, so everyone doesn’t have to copy 25G to their home directories.

Copy and paste commands to run fastqc.sh as slurm script, into a slurm script and submit this slurm script as a job to great lakes.

Your slurm script wil contain the following command after the slurm preamble stuff(Make sure your $SLURM_SUBMIT_DIR is set inside the slurm script):

bash fastqc.sh /scratch/micro612w21_class_root/micro612w21_class/shared/data/day3pm_fastq/

ii. Examine output of FastQC to verify that all samples are OK

Check the multiqc report of your fastq files.

Explore ARIBA CARD and MLST typing on day3pm_fastq samples

On Day 2 afternoon, you explored ARIBA’s MLST results that were performed on a single genome. However, a typical public epidimiology project will include many genomes sampled from different sites and you will want to sequence type each of these samples to study their genetic diversity and determine if these samples contain antibiotic resistance genes.

Since both MLST typing and CARD resistance gene detection takes a while to run on these samples, We already processed them and have placed the results in day3pm folder.

i. Explore MLST typing results.

Try exploring mlst.sh and mlst.sbat scripts that were used to generate these results and try to understand the implementation of for loop in running them sequentially.

You can find the processed MLST results for day3pm_fastq samples in this folder:

/scratch/micro612w21_class_root/micro612w21_class/shared/data/day3pm/MLST_results

Explore and run summarize_mlst.sh and check the dominant sequence type that these samples belong to.

ii. Summarize ARIBA CARD results

Now go to CARD_results folder under day3pm and explore card.sh and card.sbat scripts that were used to generate these results.

Try to understand the implementation of for loop in running them sequentially.

Use Ariba’s summary function that we used in day2pm and summarize ARIBA CARD results that are placed in:

/scratch/micro612w21_class_root/micro612w21_class/shared/data/day3pm/CARD_results

iii. Drag and drop Phandango associated files onto the Phandango website and check what type of actibiotic resistance genes are present in these genomes?
iv. Explore full ARIBA matrix in R and plot a heatmap to visualize the presence/absence of various resistancve genes.

how many of these samples contain carbapenam resistant gene - KPC?

Examine results of Snippy pipeline

On the afternoon of day 1 we saw how many steps are involved in calling variants relative to a reference genome. However, the same steps are applied to every sample, which makes this very pipeline friendly! So, you could write your own shell script to string together these commands, or take advantage of one of several published pipelines.

Here, we will use Snippy to perform rapid haploid variant calling and generate a core genome alignment. Snippy takes a tab delimited file containing list of paths to your fastq samples and a reference genome in Genbank format to generate a runme.sh shell script that can be run inside a SLURM script.

We already created a runme.sh under snippy_results using snippy-multi command from snippy. Try exploring this shell script and understand what it means.

Every line of runme.sh contains a snippy command for each of your samples that are placed in - /nfs/esnitkin/Workshop_Backups/micro612w20_class/shared/data/day3pm_fastq/

Using runme.sh inside a slurm script - snippy.sbat, we already ran the pipeline for you since it takes ~2Hrs to run on 44 samples. The results of this run are located in snippy_results folder.

Snippy will create an individual folder Rush_KPC_* for each of the samples and their variant call results. The description of each of the output files are documented on Snippy’s Github page - Section Calling SNPs -> Output Files

Since we are running Snippy pipeline on multiple samples, Snippy also produces an alignment of “core SNPs” which can be used to build a phylogeny.

Snippy generates various other files with a prefix core.* that can be further used to explore variants/alignment statistics.

i. Look at overall statistics for variant calling in excel

Snippy produces an overall summary file - core.txt of its run that includes tab-separated list of alignment/core-size statistics such as:

  1. Number of ALIGNED and UNALIGNED bases for a given sample.
  2. Number of filtered VARIANT (variants passing all the filters)
  3. Number of bases with low coverage - LOWCOV

Use less to look at this file and then apply unix commands to extract and sort individual columns

HINTS The following unix commands can be used to get sorted list of number of VARIANT and number of ALIGNED bases: tail, cut, sort

ii. Add in another exercise here to explore either core.vcf or generate a file that looks like SPANDx’s All_SNPs_annotated.txt

Recombination detection and tree generation

i. Plot the distribution of variants across the genome in R

The positions of variants are embedded in the second column of core.tab, but you have to do some work to isolate them!

HINTS

  • You will need to use “cut” command to extract the second column “POS” from core.tab
  • Note that for cut you can specify tab as the delimiter as follows: cut –d$’\t’
  • You should redirect the output of your cut command (a list of SNP positions) to a file called ‘snp_positions.txt’. For example, the first line of your snp_positions.txt should be:
12695
  • Finally, download this file, read it into R using ‘read.table’ and use ‘hist’ to plot a histogram of these core positions
  • Do you observe clustering of variants that would be indicative of recombination? #Doubt: No clustering since they are only core variants and High Quality as compared to Spandx
ii. Create a neighboring-joining tree from core SNP alignment in R

Download core.aln to your home computer
Read the fasta file into R, create a distance matrix, and make a neighbor joining tree. Make sure you load in the necessary packages! 

Phylogenetic tree annotation and visualization

Follow along with the Day 3 morning exercise where we annotated a MRSA tree with CA vs. HA metadata to overlay facility information on your Klebsiella neighbor-joining tree you just made.

  1. Read in annotation file called Rush_KPC_facility_codes.txt to R. This file is comma delimited.
  2. Drop tip labels from the tree that are not in the annotation file. Hint use setdiff() and drop.tip()
  3. Midpoint root the tree for visualization purposes using the function midpoint.root()
my_tree = midpoint.root(my_tree) #where my_tree is what you named your neighbor-joining tree above 
  1. Use sapply() to make your isolate_legend in the same order as your my_tree$tip.labels.
  2. We provided color hex codes in Rush_KPC_facility_codes.txt so everyone’s facilities are labeled with the same color. Use the following commands to extract the colors from the metadata and create your legend colors.
colors = metadata[,c('Color', 'Facility')] # metadata = whatever your variable from reading in the annotation file from step 1 is called 
colors = colors[!duplicated(metadata[,c('Color', 'Facility')]),]
isolate_colors = structure(colors[,'Color'], names = colors[,'Facility'])
  1. Use plot(), tiplabels() and legend() to plot your tree with metadata overlayed as we did previously.

To visualize the data on the tree better, you can use the argument show.tip.label = FALSE in plot() to remove the tree tip labels from the plot. You can also play around with the tree layout type (e.g. phylogram, fan, etc) using the argument type. You can change the placement of the colored tip labels by changing the number value of the parameter adj in tiplabels().

After you’ve overlayed facility metadata on your tree, answer the following questions:


Which facilities appear to have a lot of intra-facility transmission based on grouping of isolates from the same facility? 
Which patient’s infections might have originated from the blue facility?

Helpful resources for microbial genomics

If you were not able to follow the video, here is the link to illumina Sequencing

[HOME]

General Bioinformatics resources

Short read processing

Genome assembly

Genome alignment

Visualization of genomic data

Genome annotation

Phylogenetic tools and resources

Databases

Video Resources you should watch and follow

101 Questions: a series of interviews with notable bioinformaticians

Bioinformatics is just like bench science and should be treated as such

Unix/Command line