Skip to content

To filter or not to filter duplicate reads – ChIP-seq

January 6, 2012

When it comes to filtering duplicate sequences for ChIP-seq, although it appears to be recommended in most work flows I have read, in general I think it is a bad idea. The first question to ask is, are the duplicate sequences coming from PCR amplification? The existence of PCR duplicates is pretty easy to spot by visually inspecting the data in a genome browser. When you have PCR duplicates the sequences stack up on each other with very uneven coverage, i.e. you have a stack, some blank space, another stack and so on. Duplicates that come from enrichment from your ChIP look like nice peaks surrounded by an otherwise more even distribution of reads. So if you have a lot of PCR duplicates (this should show up in your FASTQC plots as well), then you have to filter duplicates but your data is not very good and you really need to do your experiment again. Really don’t waste much of your time with bad data.
If you do not have PCR duplicates, filtering duplicates will have the effect of decreasing the amplitude of your real peaks (since you reads are most concentrated at peaks) and perhaps masking some true binding sites for your epitope. To put it another way, you are filtering out real signal and keeping all the background. An extreme case of this can be observed with nucleosomal ChIP, i.e. where the chromatin is fragmented with micrococcal nuclease. In the case of a positioned nucleosome, all the reads from a peak will occur at the exact same position on either side of the nucleosome. Finally a third type duplicate comes from repetitive sequences, ideally these will be filtered out during mapping (see next post), but sometime they still get mapped due to structural differences between the genome you are working with and the reference genome you are mapping to. These types of duplicates ideally should be matched by your input control sample but sometimes they still get called as peaks by peak calling programs. These types of peaks look nothing like ‘real peaks’ to the human eye so it is easy to see if they are contaminating your data set.

To summarize:
1. If you have a lot of PCR duplicates, then you must filter the duplicates. Perhaps after filtering, you can identify some binding sites. But importantly, you need to do the experiment again using more starting material or a more efficient library preparation protocol.

2. If your library is mostly free of PCR duplicates, then you should not filter duplicate sequences. But keep a watchful eye out for peaks that look nothing like a normal ChIP-seq peak.

Final note: To some extent this holds true for RNA-seq as well.


From → Posts

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: