Skip to content

More thoughts on the TruSeq RNA Sample Prep Kit

March 12, 2012

I thought I would give a little run down on the data I am getting back using Illumina’s TruSeq Sample Preparation Kit v2.

I aligned the reads to some known common contaminating/abundant sequence found in RNA-seq data sets. This is the data from one sample but they all follow pretty much the same profile:

Sequence                                        number of reads            percent of reads
total number of reads:                     9205240                            100
TruSeq Index Adapter reads:           38                                         0
TruSeq Universal Adapter reads:   572352                                6
chrM reads:                                     326135                                    3
ribosomal DNA reads:                   72997                                   0
5S DNA reads:                                 632                                       0
phiX174 reads:                               1                                              0
polyA reads:                                    0                                          0
polyC reads:                                   16                                         0

So here it looks pretty good. About 6% of the reads are aligning to adapter sequence (a little high but not much of an issue) and another 3% to mitochondrial sequences. rDNA aligning reads are minimal. So that leaves about 90% to stuff we in theory want.

I then took the original FASTQ file from the HiSeq and adapter/quality trimmed it with Trimmomatic. This cut about 17% of the reads off from the original FASTQ file. I then aligned the trimmed FASTQ file with Bowtie to hg19. Since it was a 50 bp single end run, Bowtie should give a good representation of align-able reads. Regardless 62% of the reads mapped which is pretty good. Of the mapped reads 90% mapped to exons (not including ChrM). So the bottom line is that from a start of 9.2 million reads about 3.8 million (41%) mapped to traditional mRNAs. Aligning with Tophat may bump that up a tiny bit.

The sequence quality on the FASTQC reports were good, but the ‘per base sequence content’ (see the FASTQC page for more on this) looks a little funny.

Given the moderate level of adapter contamination, it was pretty obvious that at least part of this is from adapter dimer sequences. So I removed the sequences of the TruSeq Universal adapter using the FASTX tool kit fastx_clipper program:
fastx_clipper -Q33 -a AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -C -v -i RNA_C8R3_pos.fastq -o RNA_C8R3_pos.trimmed.fastq

Side note: fastx_clipper is way too slow for general use, but seems to be more effective at removing adapter sequences then Trimmomatic. But looking deeper into this is more time then I have now.

Anyway, after removing sequences with the Universal Adapter sequence, the per base sequence content looks much better after 13 bp.

But what is up with the first 13 bp? This has been really bugging me, so I finally spent some time looking into it. As it turns out there is a publication on this phenomenon (Biases in Illumina transcriptome sequencing caused by random hexamer priming).  Apparently this bias is caused by priming with random hexamers. It is well know that priming with random hexamers results bias, so it would not be unexpected to see biases in per base nucleotide content in the first 6 bases of the read. What is strange is that this bias continues until position 13. This is really strange and the authors of the paper are as stumped as I am as to exactly why this happens. Anyway, the important point is that this does not reflect errors in sequencing, so there is no need to trim these bases.

The final thing, which is a little strange, is that the contaminating adapter sequence is the Universal Adapter sequence. When self-ligated adapter dimers form during library preparation and then are subsequently PCR amplified, you get the Universal Adapter sequences fused to the Index Adapter sequences. The sequencing primer anneals to the Universal Adapter sequence, and, thus should result in the sequencing of the Index adapter. This is what I see when there is some adapter-dimer contamination in my ChIP-seq library preparation protocol (which is always under 1%). But here the contaminating sequence is the Universal Adapter sequence. How this happens I have no idea. I think there is something about the TruSeq kit Illumina is not letting out.

All in all, apart from a couple peculiarities and some minor modifications to the protocol, the kit seems to produce good data, albeit without strand information.

And bottom bottom line, there are better things to do with a Sunday evening then over QC’ing RNA-seq data.

UPDATE:  Just a quick little update.  I got some more data back from libraries made with the TruSeq RNA kit and it had essentially zero adapter contamination.  I don’t know what happened with the previous samples.  The libraries we got back were made by a talented Ph.D. student instead of me so maybe that had something to do with it.  Or it could be that we are using an new batch of AMPure beads.  Or maybe it was something to do with the sequencing run.  Who knows, who cares.  Anyway, the data is looking much better.  The percentage of good reads, i.e. high quality and mapping to mRNAs is up significantly.  So I’d say I’m much happier with the performance of the kit then what we got with the first round of samples.

Advertisements

From → Posts

6 Comments
  1. Ian permalink

    I’ve noticed the funky 13-base thing too – and also noticed that more often than not, trimming some or all of it off has modestly improved the percentage of reads that actually align to anything.

  2. Neetha permalink

    Hi Ethan,
    Great blog! Wanted to know more about your experiences with True seq V2 kit. I did a set of sequencing (not with true seq) and have a lot of problems replicating it through RT-PCR.We have to redo the sequencing,and was wondering if you could give more insights in to your sample variation (between biological replicates) which might be protocol dependent.Thanks!

    • ethanomics permalink

      Hi Neetha,
      Thanks for the compliments. My experience with the TruSeq RNA Sample Prep kit are very positive. We have made over a hundred libraries in the last couple months and they have all worked really well. Importantly, it works for all the first time users as well. The validation on a technical level seems to be spot on, i.e. when we do rt-qPCR on the same RNA samples that we sequenced, the data matches perfectly with the RNA-seq data.

      As far as biological variation, that really depends on the system and happens before you start your experiment so the method of sample and library preparation is not going to be an issue. Systems with a lot of variance are going to need a lot of replicates. Somehow I got involved in this Euro FP7 project called PreDicta and we will be doing some RNA-seq for the project along with some other stuff. But anyway, we will be looking at looking at a bunch of clinical samples. From some of the preliminary gene level data I have seen, it is clear that we are going to need a lot of replicates to get anything out of it.

      The only issue I see with the TruSeq RNA kit is that is appears Illumina cannot keep up with demand and it seems to be perpetually back ordered. And of course that it is not strand-specific. I’ve wanted to adapt it to a UDG strand specific method but it appears I will leave Greece before the UDGase I ordered ever comes.

      Ethan

  3. Rens permalink

    Hi Ethan,
    Maybe a nice addition to your observations: I’m working on a denovo assembly project of data sequenced in 2011. Since this is DNA and not RNA like in your story you might expect little overlap, however I am experiencing the exact same problem of per base sequence variation in the first 10 (or so) bases. Like your story I have narrowed it down to a bit of sequence from the universal adapter. Any thoughts?

  4. ethanomics permalink

    I have about as much insight as the paper I cited above, which is none. It’s really weird and hard to explain.

    I think in the paper they attributed it to random priming (which is so weird it makes me suspicious that they are right) but since DNA libraries are not random primed it seems like there is another source.

    Strange weird stuff. One of those things that make me a little curious and annoyed.

    • Rens permalink

      At least it’s good to know I’m not the only one confused/annoyed. Thanks for the quick response!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: