Skip to content

ezDESeq

April 19, 2012

UPDATE: I fixed a few bugs in the script and it should be running better now.  Also you can run it from sam files or a counts table now, if you reads have already been mapped.

No time to blog these days. Big changes coming. Anyway, I just uploaded a script I wrote so the lab could do primary RNA-seq analysis pretty easily.

I am a fan of the DESeq/edgeR counts approach to differential expression. Conversion to RPKM values before calculation of differential gene expression, as is done in Cufflinks, seems like it adds unnecessary complexity to the problem and potentially clouds the results. Making counts tables for DESeq is a little laborious and mistake prone if it is not scripted so that was my main reason to get this script up and running.  Also, everything R is a little rough for beginners, so I wanted make things a little easier.

As I keep saying, I’m no bioinformatician so keep that in mind but it’s pretty basic stuff here, so I think it is all good.

The script can be found here:

https://sourceforge.net/projects/ethanomics/files/

From the readme file from the script:

ezDESeq.sh Version April 2, 2012

This script runs DESeq in a nice easy automated way. It does the following:

1) Maps reads to the genome with Tophat.

2) Removes reads mapped to chrM.

3) Assigns mapped reads genes using HTseq-count.

4) Merges HTseq-count output into a counts table.

5) Runs DESeq using the script DESeqWrapper.R.

6) Runs goseq using the script GOseqWrapper.R.

7) Makes bedGraph files for visualization on a genome browser.

To run ezDESeq.sh first you must install the appropriate dependencies and download the appropriate reference genomes and deposit them in the correct location.

1) Install samtools. It must be in the PATH.

2) Install bedtools. It must be in the PATH.

3) Install the latest version of R (it is preinstalled on Macs).

4) Install HTseq. http://www-huber.embl.de/users/anders/HTSeq/doc/index.html

5) Install the R/Bioconductor packages: a) DESeq b) goseq

6) Unzip and untar the file ezDESeq.tar.gz with the command:
tar zxvf ezDESeq.tar.gz
(if you are reading this you have already done this)

7) Prepare reference genomes.

a) Place the directory ‘genomes’ in the root directory.

b) Download iGenomes for hg19, mm9 or whatever species you want. http://tophat.cbcb.umd.edu/igenomes.html

c) Uzip the iGenome and place the directory ‘hg19’ (or whatever genome you are using) it in the /genomes directory. If I remember correctly, the hg19 folder is inside a folder called hg19. Whatever it is, you should have check that the references have the following paths:

Bowtie Index: /genomes/hg19/Sequence/BowtieIndex/genome

GTF file: /genomes/hg19/Annotation/Archives/archive-current/Genes/genes.gtf

Genome Index file: /genomes/hg19/Sequence/WholeGenomeFasta/genome.fa.fai

Obviously ‘hg19’ in the paths above is replaced by the reference genome you are using.

8) Place the directory NGS_Scripts in your root directory.

To actually run ezDESeq.sh:

1. Create a directory with the name of your experiment, e.g. ‘malakia’ or whatever you like.

2. In the experiment directory (e.g malakia) create two more directories. One for each condition, e.g. ‘control’ and ‘treated’. You can use whatever two names you like.

3. Place your .fastq files (gziped is ok) in either the control or treated directories. For example, the path to your fastq files should be like this:

/pathToYourExpDirectory/malakia/control/yourfiles.fastq /pathToYourExpDirectory/malakia/treated/yourfiles.fastq

4. Open a terminal and change directories using the ‘cd’ command so that you are in the experiment directory (e.g. malakia).

5. Type sh /NGS_Scripts/ezDESeq.sh in the command line and follow the prompts.

Other stuff:
1. If you do not have access to your root directory you can run ezDESeq.sh from your home directory but you must do the following:
a) Create the file structure as described above except place the directories in your home directory and not your root directory.
b) Go through the script ezDESeq.sh and changed every place where ‘/genomes/…’ is written with ‘~/genomes/…’ and replace ‘/NGS_Scripts/…’ with ‘~/NGS_Scripts/…’
2. Other genomes besides hg19 and mm9 can easily be added. The only thing you have to do is download the iGenome, place it in the genomes directory and when prompted by the script type in the name of the directory that contains your genomes, e.g. hg18. You can ignore the fact that it does not suggest the genome you want to use.

About these ads

From → Posts

20 Comments
  1. drosoform permalink

    Dear ETHANol,
    Thank you for providing this script! I haven’t used it yet but it looks like it would be very useful – just looking at the steps you listed is helpful.

  2. ethanomics permalink

    Glad to help out. There are two ways to do science. Keep everything to yourself and then publish your data (the fear driven model). Or put everything you know that is interesting or helpful out there for everyone to see (the sharing model). While sometimes the fear driven model is necessary, for the most part following the sharing model will increase your positive impact on science, which I believe is the best way to judge the work of the academic scientist. The more you share, the more influence you have. The more influence you have, the more power you have……. So the next time you run across a scientist that doesn’t share, just laugh because in the long run it’s their career they are ruining, not yours.

    Back to the point, while it’s a nice automated pipeline for a vanilla DESeq run, you are right it is probably more useful in a sense that it gives an idea things you can do and maybe a little code to copy-paste. Everything is different so you have to know what to modify to make stuff do what you want it to do.

  3. newbie permalink

    Hi, Thank you very much for your script.
    I was looking at your script to follow what is being done and I saw this “sed -n -e :a -e ’1,5!{P;N;D;};N;ba’ ” commend right before you merge the tables. Can you explain what this does? I know very little about sed.

    Thank you.

  4. ethanomics permalink

    It removes the last 5 lines of the output from HTseq-count, which are the mapping stats. Thus, should not be included in the counts table. Earlier in the script it writes these lines to the terminal so you can see what they are.

    Just for the record, I don’t understand the code any better then you. I got it from sed one-liners.
    http://sed.sourceforge.net/sed1line.txt
    http://www.catonmat.net/blog/sed-one-liners-explained-part-one/

  5. newbie permalink

    I have htseq-count output files named c1_o.count.txt c2s_o.count.txt c3s_o.count.txt c4s_o.count.txt and I want to use the merge script of yours and it complained as follows

    Error in read.table(file = file, header = header, sep = sep, quote = quote, :
    more columns than column names

    Is this because I did not run the see commend? I looked at the htseq-count outputs and did not see any stats so I just tried the merge. Out put does not have any column header so is this why it’s complaining?
    Thank you.

  6. ethanomics permalink

    Yeah the files need headers and the column that has the gene names MUST be called ‘gene’. The column with the counts can be called whatever you like.

    Or you can alter this line of the script ‘multimergeRNA.R’:
    Reduce(function(x,y) {merge(x,y,all=TRUE,by=”gene”,sort=FALSE)}, datalist)

    by=”gene” specifies the column name by which the rows are aligned.

    You probably also want to get rid of the last few lines of the HT-seq count output, which gives the alignment statistics.

  7. Hi, May you send me a copy of your ezdeseq.sh, I can’t find it in your links. Thanks!!

  8. ethanomics permalink

    Hi, It’s here:
    http://sourceforge.net/projects/ethanomics/files/ethanomicsNGS.tgz/download
    location subject to change but should can always be found by the scripts link above.

    Let me know if you have any problems getting it up and running.

    Ethan

  9. Kevin permalink

    Dear Ethan,

    thanks a lot for this very nice script. How does it deal with paired-end sequencing?

    Thanks!

    Kevin

    • ethanomics permalink

      Hi Kevin,
      The simple answer is it doesn’t (I’ve never worked with paired end data), but it should be really easy to modify to take paired end data. There’s a guy here that is running it with paired end data and he’s a beginner. I’ll take a look into in on Monday (which probably really means Tuesday or Wednesday these days). I think if you have SAM files you should just be able to run it, although the SAM files need to be sorted as described in the HTseq-count documentation.

      Ethan

  10. Kevin permalink

    Hi Ethan,
    did you have time to look into using your script with paired-end data?

    Thanks a lot!

    Kevin

  11. BrianM permalink

    Thanks for the great scripts. I wanted to add that if you’re using Ubuntu (12.04) you’ll need to change sh default from dash to bash to get the scripts to work.

    sudo update-alternatives –install /bin/sh sh /bin/bash 100

    • ethanomics permalink

      Glad you found them helpful!! It seems they annoyingly need modification for every system.

  12. Ali permalink

    HI
    If I already have tophat output, is there any way to feed it to your script then?

  13. Dear Ethan,

    greetings from Greece. We have are using the ezDESeq script for zebrafish reads. So far the scripts makes the sam files, but then it gives me this error message:

    23700000 sam lines processed.
    23800000 sam lines processed.
    23900000 sam lines processed.
    23959636 sam lines processed.
    no_feature 1423910
    ambiguous 12315994
    too_low_aQual 0
    not_aligned 0
    alignment_not_unique 0
    -e

    -e merging data sets into counts table in R for MUT
    [1] “/Users/claudia/Documents/RNA-Seq_Analysis/hearts/MUT/countstable”
    Error in fix.by(by.x, x) : ‘by’ must specify uniquely valid column(s)
    Calls: multimerge -> Reduce -> f -> merge -> merge.data.frame -> fix.by
    Execution halted

    Does this mean that there is a problem with the sorted bam file?

    Cheers,
    C.

    • ethanomics permalink

      Hi Claudia,
      I’ve altered the scripts so they run on our Linux server here in Australia. To run them on Mac you have to open the script with a text editor (just use the Mac X-code program) and do a find-replace.
      Fine = echo -e
      Replace = echo

      If it still doesn’t work e-mail me the output file countstable.txt and we can go from there. Because debugging is what I like to do on a Friday evening.

      Are you guys making TALENs yet?

      Ethan

  14. ethanomics permalink

    Hi Ali, Yes it takes sam and bam files. It may take a little tweaking to get it up and running on you system. If you get errors let me know.

  15. Hello,
    Many thanks for the wonderful guide to using DeSeq!
    I was wondering if you tried to wrap DeSeq2 recently.
    Please let me know if you have.
    Best,
    Menorca

  16. ethanomics permalink

    Hi Menorca,
    As it turns out, that is what I spent a good chunk of my day today doing. Looks like it’s running error free. I’ll e-mail you the code shortly. DESeq2 appears to have a lot more flexibility so I’m just going with a simple pairwise workflow at the moment.

    Ethan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 36 other followers

%d bloggers like this: