Skip to content

Some thoughts on mapping short reads

January 24, 2012

I hinted in the last post that I would talk about mapping short reads to a reference genome. I am by no means an expert on this subject but there are still a couple points I would like to address. The first is what to do with reads that map to multiple locations on the reference genome. For most all applications the without question the best thing to do is to discard them, although this is not the default setting for Bowtie. If you think about RNA-seq, if you have a read that maps to two different locations there is a 50-50 chance of mapping it to the wrong location. For example, there are two theoretical genes that a read maps to, which we will call GAPDH and Oct-1. GAPDH is expressed at a 1000 fold higher level than Oct-1 so if we randomly assign this read to either gene a small increase in the GAPDH level will make it look like there is a large increase in the Oct-1 level. For DNA based methods like ChIP-seq and RRBS-seq you don’t have expression levels to make the situation worse but you are still adding noise to the system and perhaps producing inaccurate map of what is going on. Thus, it is important to throw away the desire the keep as many reads as possible and just keep the ones that map to the reference genome in a single location.

The second issue is more interesting and I don’t really have a good answer. Here are two alignments I did with Bowtie from a ChIP-seq experiment.

thanos-lab:Desktop ethanford$ bowtie -m 1 -S -v 1 mm9_bowtie_index/mm9 input_es_covaris.fastq input_es_covaris.sam

# reads processed: 25352110

# reads with at least one reported alignment: 18265448 (72.05%)

# reads that failed to align: 1819586 (7.18%)

# reads with alignments suppressed due to -m: 5267076 (20.78%)

Reported 18265448 alignments to 1 output stream(s)

thanos-lab:Desktop ethanford$ bowtie -m 1 -S -v 0 mm9_bowtie_index/mm9 input_es_covaris.fastq input_es_covaris1.sam

# reads processed: 25352110

# reads with at least one reported alignment: 18512820 (73.02%)

# reads that failed to align: 3427050 (13.52%)

# reads with alignments suppressed due to -m: 3412240 (13.46%)

Reported 18512820 alignments to 1 output stream(s)

For those of you that don’t speak Bowtie, the –m1 argument means that only reads that map to a single place in the reference genome are kept. The –v 1 argument in the top alignment says allow one mismatch when performing the alignment. The –v 0 argument in the lower alignment says do not allow any mismatches when performing the alignment.

What is really interesting is that the total number of aligned reads is almost the same in both cases. Increasing the mapping stringency by not allowing any mismatches causes less reads to map as you would expect but this increased stringency also increases the number of reads that map uniquely.  The net effect being the number of mapped reads is about the same the reads that are getting mapped are quite different.  Which alignment is better? I’m not sure. For RNA-seq I would like to throw away all the reads that potentially map to multiple locations using low stringency and than map with a higher stringency. Bowtie doesn’t have this option. You could align with two different two different conditions and take the common reads but that is a lot more data processing. In the end, I found it an interesting thing to think about and it really highlights how short read mapping is a little voodoo mixed in with some complicated algorithms.

Advertisements

From → Posts

2 Comments
  1. Mikael Huss permalink

    Hi, unfortunately throwing away all multi-mapping reads can be dangerous in RNA-seq. I was working on a project where RNA-seq had been run on human ES cells, and using only uniquely mapping reads (which was the default in the software we used), we got essentially no expression of the Nanog gene, which was known from microarrays, qPCR etc to be highly expressed in practically all ES cell lines. The reason was that there was a pseudogene with very similar sequence and almost all of the reads that were presumably from the “real” Nanog mapped to two loci and were thrown out by the pipeline.

  2. ethanomics permalink

    Hi, Thanks for the insight!!!!!! We should be getting some ES/iPS data back soon, so this is something to really consider. I guess it’s sort of a damned if you do, damned if you don’t situation. I still really dislike the idea of randomly mapping reads that map to more then one location, but I had not thought of the idea of throwing away the reads from an entire gene due to a pseudogene(s). Clearly, it is important to calculate mapability when computing RPKM values. Is anybody doing this? It seems like a lot of this could be addressed by someone with more of an interest in bioinformatics then me. It would be interesting to see a list with the mappability of 50 bp reads to all the genes. I believe it would be pretty straight forward to do. Make collection every possible 50 bp read in the genome and then map them. Divide expected reads mapped by actual reads mapped.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: