UPDATE: I fixed a few bugs in the script and it should be running better now. Also you can run it from sam files or a counts table now, if you reads have already been mapped.
No time to blog these days. Big changes coming. Anyway, I just uploaded a script I wrote so the lab could do primary RNA-seq analysis pretty easily.
I am a fan of the DESeq/edgeR counts approach to differential expression. Conversion to RPKM values before calculation of differential gene expression, as is done in Cufflinks, seems like it adds unnecessary complexity to the problem and potentially clouds the results. Making counts tables for DESeq is a little laborious and mistake prone if it is not scripted so that was my main reason to get this script up and running. Also, everything R is a little rough for beginners, so I wanted make things a little easier.
As I keep saying, I’m no bioinformatician so keep that in mind but it’s pretty basic stuff here, so I think it is all good.
The script can be found here:
From the readme file from the script:
ezDESeq.sh Version April 2, 2012
This script runs DESeq in a nice easy automated way. It does the following:
1) Maps reads to the genome with Tophat.
2) Removes reads mapped to chrM.
3) Assigns mapped reads genes using HTseq-count.
4) Merges HTseq-count output into a counts table.
5) Runs DESeq using the script DESeqWrapper.R.
6) Runs goseq using the script GOseqWrapper.R.
7) Makes bedGraph files for visualization on a genome browser.
To run ezDESeq.sh first you must install the appropriate dependencies and download the appropriate reference genomes and deposit them in the correct location.
1) Install samtools. It must be in the PATH.
2) Install bedtools. It must be in the PATH.
3) Install the latest version of R (it is preinstalled on Macs).
4) Install HTseq. http://www-huber.embl.de/users/anders/HTSeq/doc/index.html
5) Install the R/Bioconductor packages: a) DESeq b) goseq
6) Unzip and untar the file ezDESeq.tar.gz with the command:
tar zxvf ezDESeq.tar.gz
(if you are reading this you have already done this)
7) Prepare reference genomes.
a) Place the directory ‘genomes’ in the root directory.
b) Download iGenomes for hg19, mm9 or whatever species you want. http://tophat.cbcb.umd.edu/igenomes.html
c) Uzip the iGenome and place the directory ‘hg19’ (or whatever genome you are using) it in the /genomes directory. If I remember correctly, the hg19 folder is inside a folder called hg19. Whatever it is, you should have check that the references have the following paths:
Bowtie Index: /genomes/hg19/Sequence/BowtieIndex/genome
GTF file: /genomes/hg19/Annotation/Archives/archive-current/Genes/genes.gtf
Genome Index file: /genomes/hg19/Sequence/WholeGenomeFasta/genome.fa.fai
Obviously ‘hg19’ in the paths above is replaced by the reference genome you are using.
8) Place the directory NGS_Scripts in your root directory.
To actually run ezDESeq.sh:
1. Create a directory with the name of your experiment, e.g. ‘malakia’ or whatever you like.
2. In the experiment directory (e.g malakia) create two more directories. One for each condition, e.g. ‘control’ and ‘treated’. You can use whatever two names you like.
3. Place your .fastq files (gziped is ok) in either the control or treated directories. For example, the path to your fastq files should be like this:
4. Open a terminal and change directories using the ‘cd’ command so that you are in the experiment directory (e.g. malakia).
5. Type sh /NGS_Scripts/ezDESeq.sh in the command line and follow the prompts.
1. If you do not have access to your root directory you can run ezDESeq.sh from your home directory but you must do the following:
a) Create the file structure as described above except place the directories in your home directory and not your root directory.
b) Go through the script ezDESeq.sh and changed every place where ‘/genomes/…’ is written with ‘~/genomes/…’ and replace ‘/NGS_Scripts/…’ with ‘~/NGS_Scripts/…’
2. Other genomes besides hg19 and mm9 can easily be added. The only thing you have to do is download the iGenome, place it in the genomes directory and when prompted by the script type in the name of the directory that contains your genomes, e.g. hg18. You can ignore the fact that it does not suggest the genome you want to use.