Skip to content

Homemade AMPure XP beads

Certainly not my protocol and the guy that wrote it says it certainly isn’t his either. But it’s about time someone put this together and we could start saving a lot of money.

Anyway, the protocol was nicely written in plain English by these guys (thanks!!!!):

B. Faircloth & T. Glenn, November 19, 2011, Ecol. and Evol. Biology, Univ. of California – Los Angeles

From this publication: Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture

Which of course came from some previous publications..see the protocol.  It is here: Homemade AMPure XP beads


And a big thanks from the community to the above contributors!!!!!


ETNANomics is Back!!!

After totally abandoning the blog the last few months, I’m going to start posting again. Trying to finish up stuff in Greece and arrange the move to Australia was just way too much. It would have all worked out if US immigration had not lost my passport.  And it still would have all been good if DHL had not lost my RNA samples and a TruSeq RNA kit. Apart from the move, I have been doing mostly data analysis the last months and I just never really think of anything too innovative or that interesting that I am doing on the bioinformatics side. When it comes to data analysis I am mostly the guy asking the stupid questions. I like to think of bioinformatics as BIOinformatics with a big emphasis on the bio side of it and not try to push too much on the technical side.

And here’s a shameless plug for where I’ll be calling home for a while, i.e. the Ryan Lister Lab at the University of Western Australia Center for Plant Energy Biology.  If you are smart, creative and not too big of a jerk there is empty bench space and every piece of equipment in the lab is brand spanking new.

And here is a little photo from my scientific past.  I found it hanging on the wall in the office of my undergraduate advisor, Manny Ares at UCSC. Click on his name if you want to learn more about splicing or how the waves are in Santa Cruz.  One of my best gels ever, so I was stoked to see it was still alive.  For those interested, it is a 2-D denaturing (urea) acrylamide gel showing that our inside out group I intron RNA cyclase was indeed producing non-linear RNA (Ford and Ares 1994).  And yes that is a Tom Cech autographed Group I intron at the top above my head.  That’s some good stuff.

Science Down Under

In July I will be joining the lab of Ryan Lister (a.k.a. the King of the Methylome) at the University of Western Australia in the Center for Plant Energy Biology. I will be the first to admit that I know very little about plants and have never really had much of an interest in plants. But going back to my undergraduate days, I have always been interested in the basic biology of gene regulation and epigenetics. The organism, disease model or whatever has always been an excuse or tool.  I think it is going to be a great opportunity to do some really good science with some good people. The only thing I am a little nervous about is the supply chain. The time it takes to get reagents here in Greece is way too long. It won’t be like Boston but I don’t think it will be anything like the bureaucratic headaches I have encountered here in Greece.

Anyway, the point I wanted to make was that there is some good science going on in Australia. The economy is doing much better then the US and Europe. The politicians do not appear to be dead set on slashing spending for everything except the military (US) or counter productive austerity measures (US and Europe – I’m a big Paul Krugman fan). There are talented scientists.  And, importantly, the pay scale is better for scientists then in the US.  So I would suggest taking a serious look at Australia, I did and I think it is going to be a good move.  And if all else fails, I can always work in the mines there.


UPDATE: I fixed a few bugs in the script and it should be running better now.  Also you can run it from sam files or a counts table now, if you reads have already been mapped.

No time to blog these days. Big changes coming. Anyway, I just uploaded a script I wrote so the lab could do primary RNA-seq analysis pretty easily.

I am a fan of the DESeq/edgeR counts approach to differential expression. Conversion to RPKM values before calculation of differential gene expression, as is done in Cufflinks, seems like it adds unnecessary complexity to the problem and potentially clouds the results. Making counts tables for DESeq is a little laborious and mistake prone if it is not scripted so that was my main reason to get this script up and running.  Also, everything R is a little rough for beginners, so I wanted make things a little easier.

As I keep saying, I’m no bioinformatician so keep that in mind but it’s pretty basic stuff here, so I think it is all good.

The script can be found here:

From the readme file from the script: Version April 2, 2012

This script runs DESeq in a nice easy automated way. It does the following:

1) Maps reads to the genome with Tophat.

2) Removes reads mapped to chrM.

3) Assigns mapped reads genes using HTseq-count.

4) Merges HTseq-count output into a counts table.

5) Runs DESeq using the script DESeqWrapper.R.

6) Runs goseq using the script GOseqWrapper.R.

7) Makes bedGraph files for visualization on a genome browser.

To run first you must install the appropriate dependencies and download the appropriate reference genomes and deposit them in the correct location.

1) Install samtools. It must be in the PATH.

2) Install bedtools. It must be in the PATH.

3) Install the latest version of R (it is preinstalled on Macs).

4) Install HTseq.

5) Install the R/Bioconductor packages: a) DESeq b) goseq

6) Unzip and untar the file ezDESeq.tar.gz with the command:
tar zxvf ezDESeq.tar.gz
(if you are reading this you have already done this)

7) Prepare reference genomes.

a) Place the directory ‘genomes’ in the root directory.

b) Download iGenomes for hg19, mm9 or whatever species you want.

c) Uzip the iGenome and place the directory ‘hg19’ (or whatever genome you are using) it in the /genomes directory. If I remember correctly, the hg19 folder is inside a folder called hg19. Whatever it is, you should have check that the references have the following paths:

Bowtie Index: /genomes/hg19/Sequence/BowtieIndex/genome

GTF file: /genomes/hg19/Annotation/Archives/archive-current/Genes/genes.gtf

Genome Index file: /genomes/hg19/Sequence/WholeGenomeFasta/genome.fa.fai

Obviously ‘hg19’ in the paths above is replaced by the reference genome you are using.

8) Place the directory NGS_Scripts in your root directory.

To actually run

1. Create a directory with the name of your experiment, e.g. ‘malakia’ or whatever you like.

2. In the experiment directory (e.g malakia) create two more directories. One for each condition, e.g. ‘control’ and ‘treated’. You can use whatever two names you like.

3. Place your .fastq files (gziped is ok) in either the control or treated directories. For example, the path to your fastq files should be like this:

/pathToYourExpDirectory/malakia/control/yourfiles.fastq /pathToYourExpDirectory/malakia/treated/yourfiles.fastq

4. Open a terminal and change directories using the ‘cd’ command so that you are in the experiment directory (e.g. malakia).

5. Type sh /NGS_Scripts/ in the command line and follow the prompts.

Other stuff:
1. If you do not have access to your root directory you can run from your home directory but you must do the following:
a) Create the file structure as described above except place the directories in your home directory and not your root directory.
b) Go through the script and changed every place where ‘/genomes/…’ is written with ‘~/genomes/…’ and replace ‘/NGS_Scripts/…’ with ‘~/NGS_Scripts/…’
2. Other genomes besides hg19 and mm9 can easily be added. The only thing you have to do is download the iGenome, place it in the genomes directory and when prompted by the script type in the name of the directory that contains your genomes, e.g. hg18. You can ignore the fact that it does not suggest the genome you want to use.

More thoughts on the TruSeq RNA Sample Prep Kit

I thought I would give a little run down on the data I am getting back using Illumina’s TruSeq Sample Preparation Kit v2.

I aligned the reads to some known common contaminating/abundant sequence found in RNA-seq data sets. This is the data from one sample but they all follow pretty much the same profile:

Sequence                                        number of reads            percent of reads
total number of reads:                     9205240                            100
TruSeq Index Adapter reads:           38                                         0
TruSeq Universal Adapter reads:   572352                                6
chrM reads:                                     326135                                    3
ribosomal DNA reads:                   72997                                   0
5S DNA reads:                                 632                                       0
phiX174 reads:                               1                                              0
polyA reads:                                    0                                          0
polyC reads:                                   16                                         0

So here it looks pretty good. About 6% of the reads are aligning to adapter sequence (a little high but not much of an issue) and another 3% to mitochondrial sequences. rDNA aligning reads are minimal. So that leaves about 90% to stuff we in theory want.

I then took the original FASTQ file from the HiSeq and adapter/quality trimmed it with Trimmomatic. This cut about 17% of the reads off from the original FASTQ file. I then aligned the trimmed FASTQ file with Bowtie to hg19. Since it was a 50 bp single end run, Bowtie should give a good representation of align-able reads. Regardless 62% of the reads mapped which is pretty good. Of the mapped reads 90% mapped to exons (not including ChrM). So the bottom line is that from a start of 9.2 million reads about 3.8 million (41%) mapped to traditional mRNAs. Aligning with Tophat may bump that up a tiny bit.

The sequence quality on the FASTQC reports were good, but the ‘per base sequence content’ (see the FASTQC page for more on this) looks a little funny.

Given the moderate level of adapter contamination, it was pretty obvious that at least part of this is from adapter dimer sequences. So I removed the sequences of the TruSeq Universal adapter using the FASTX tool kit fastx_clipper program:

Side note: fastx_clipper is way too slow for general use, but seems to be more effective at removing adapter sequences then Trimmomatic. But looking deeper into this is more time then I have now.

Anyway, after removing sequences with the Universal Adapter sequence, the per base sequence content looks much better after 13 bp.

But what is up with the first 13 bp? This has been really bugging me, so I finally spent some time looking into it. As it turns out there is a publication on this phenomenon (Biases in Illumina transcriptome sequencing caused by random hexamer priming).  Apparently this bias is caused by priming with random hexamers. It is well know that priming with random hexamers results bias, so it would not be unexpected to see biases in per base nucleotide content in the first 6 bases of the read. What is strange is that this bias continues until position 13. This is really strange and the authors of the paper are as stumped as I am as to exactly why this happens. Anyway, the important point is that this does not reflect errors in sequencing, so there is no need to trim these bases.

The final thing, which is a little strange, is that the contaminating adapter sequence is the Universal Adapter sequence. When self-ligated adapter dimers form during library preparation and then are subsequently PCR amplified, you get the Universal Adapter sequences fused to the Index Adapter sequences. The sequencing primer anneals to the Universal Adapter sequence, and, thus should result in the sequencing of the Index adapter. This is what I see when there is some adapter-dimer contamination in my ChIP-seq library preparation protocol (which is always under 1%). But here the contaminating sequence is the Universal Adapter sequence. How this happens I have no idea. I think there is something about the TruSeq kit Illumina is not letting out.

All in all, apart from a couple peculiarities and some minor modifications to the protocol, the kit seems to produce good data, albeit without strand information.

And bottom bottom line, there are better things to do with a Sunday evening then over QC’ing RNA-seq data.

UPDATE:  Just a quick little update.  I got some more data back from libraries made with the TruSeq RNA kit and it had essentially zero adapter contamination.  I don’t know what happened with the previous samples.  The libraries we got back were made by a talented Ph.D. student instead of me so maybe that had something to do with it.  Or it could be that we are using an new batch of AMPure beads.  Or maybe it was something to do with the sequencing run.  Who knows, who cares.  Anyway, the data is looking much better.  The percentage of good reads, i.e. high quality and mapping to mRNAs is up significantly.  So I’d say I’m much happier with the performance of the kit then what we got with the first round of samples.

MeDIP-seq protocol results

I finally got the data back from my MeDIP-seq experiment.  After months of waiting for everything and anything (Special thanks to all our ‘wonderful’ Greek distributors of scientific products).  Enough to make me pull my hair out and make my collaborator Andrija wonder if he’ll ever get his Ph.D.  Anyway, the protocol itself seems to work really well.  Below are a couple USCS genome browser screen shots. The first is of a methylated locus and you see a nice peak in the middle of all the tracks.  The second is the GAPDH promoter and there is no signal despite the high GC content.  So if you are looking for a good protocol for making TruSeq compatible/barcoded MeDIP-seq libaries it is here:  Barcoded MeDIP-seq protocol with TruSeq adapters.

Now on to the data analysis, where the tools appear to still be a little rough around the edges. And most importantly of all hoping that there is some biologically interesting data to get out of this experiment.

New protocols and scripts for ChIP

I just uploaded one of the first scripts I run when I get ChIP-seq data back.   It takes all the FASTQ files in a directory and maps them with Bowtie and eventually spits out a bedGraph file for visualization, as well as a lot of other sam, bam and bed files.  It’s all described in the readme.txt.  It is all located here: bowtie_wrapper_ChIP.

I also uploaded the native ChIP protocol I use.  It not original material but it is a good protocol nonetheless and can be found here: Native ChIP Protocol

Thank you blog readers

Just wanted to say thanks to all you people out there that click on my blog. I started the blog basically for myself.  I assumed no one would read it.  My mom wouldn’t understand it and no one in the science world would find it. But here we are a couple weeks into February and the WordPress stats say there have been over 2700 page views. I’m not sure what exactly that means but it seems like a lot to me. What’s kind of cool too is you can look at the location of the IP addresses and it is basically a map of where you can do genomics. Boston is the clear winner and  makes me a little nostalgic, but clearly you don’t have to be at the epicenter to have an impact.

Ethanomics gets a Sourceforge account

I have written some little scripts to manipulate NGS text files and streamline some NGS workflows, but there doesn’t appear to be a good way to post them on so I opened a Sourceforge account, which makes me laugh because what on earth is a biologist doing with a Sourceforge account.  Anyway, I’ll try to annotate, write up some documentation and post them in the hopefully near future, if for no other reason then to keep me from forgetting what they do.

Library Quatitation: Science or Voodoo

If you call Illumina about problems with library quantitation, they won’t give you a straight answer. They aren’t hiding anything; they just don’t have a good answer. Libarary quantitation is a little bit science and a little bit voodoo. The three main methods used to quantitate libraries are nanoDrop, Qubit and qPCR. NanoDrop is said to be sensitive to contaminants and thus not recommended. The QubitHS DNA assay kit from LifeTech is highly specific to dsDNA, which is actually a problem because significant portions of your library may be single-stranded, bubble shaped molecules or daisy chains of molecules because of the high complexity of the library insert and the low complexity of the adapter sequences on the end. This is especially a problem if your library is over amplified (see the previous post on how to avoid this). Both of these methods quantitate total DNA and are not specific to your adapter ligated DNA. That leaves qPCR, which also has its disadvantages but in my opinion they are less than the other two methods. The main problem with qPCR is that different libraries may amplify with different efficiencies and thus cause you to under or over-estimate your library concentration. Ideally, your standards would be identical to your library but that is obviously not possible. The easiest way to qPCR quantitate your libraries is order the library quantification kit from KAPA Biosystems. The kit is basically serial dilutions of a library and some primers. So this is obviously something that you can make yourself if you have a library you are confident of the concentration (see my library quantitation protocol), but you probably would not be reading this post if you did. So in conclusion this is my advice:

1) Quantitate your library with qPCR (this is the science half).
2) Since it is better to get fewer cluster of high quality than a lot of cluster with low quality, use a little less than the amount Illumina recommends
3) After you run your samples make a note of whether your cluster density was too low or too high (this is the voodoo half).
a) If your cluster density was too high, next time around use less library.
b) If your cluster density was too low, next time around use more library.

Final note: It is very important at the final step of your library preparation, after PCR amplification, that you purify your library with AMPure beads to remove the unused PCR primer. Unused PCR primer interferes with the binding of your library to the flow cell (A large genome center’s improvements to the Illumina sequencing system).

Final final note: In theory you can use the bioanalyzer for library quantitation. Don’t do this. Use it to check the size of your libraries and that you have no adapter dimers, but don’t use it for quantitation.