3.ChIP-seq
本章主要介绍 ChIP-seq 相关分析方法,包括 peak calling 和 motif analysis。
染色质免疫共沉淀测序(Chromatin immunoprecipitation sequencing,简称 ChIP-seq)被用于分析蛋白质与 DNA 的交互作用。常见的 ChIP-seq 实验主要研究转录因子和组蛋白修饰如何通过与DNA相互作用来调控基因表达,从而影响表型。ChIP-seq 对于充分了解许多生物过程和疾病状态至关重要。

1) Pipeline

2) Data Structure

2a) getting software & data

方法1: 使用docker

docker images的下载链接如附表所示,docker中我们已经准备好所需文件,例如 .bam 文件位于 Docker 中的 /home/test/chip-seq/input

方法2: 自行下载和安装

    1.
    Install software:HOMER
    2.
    Download data: 下载链接
我们使用 GSE61210 中的两个样本:GSM1499619input.bam)和GSM1499607ip.bam)。 用户也可以参考 这里 从 CHIP-seq 数据生成自己的 .bam 文件。

2b) input

Format
Description
Notes
.bam
将CHIP-seq的 Reads 比对到参考基因组
-

2c) output

2c.1) 核心输出文件如下所示:

Text
File format
File description
Peak calling
peak file
each row contatins information of a peak
Motif analysis
homerResults.html
de novo motif table in HTML format
peak table in peak file
1
# Column Headers:
2
#PeakID chr start end strand Normalized Tag Count focus ratio findPeaks Score Total Tags Control Tags (normalized to IP Experiment) Fold Change vs Control p-value vs Control Fold Change vs Local p-value vs Local Clonal Fold Change
3
chrIII-1 chrIII 78346 78578 + 69987.1 0.862 5971.000000 5966.0 201.6 29.60 0.00e+00 26.54 0.00e+00 0.50
4
chrIII-2 chrIII 133 365 + 61226.9 0.775 5364.000000 5227.0 116.3 44.93 0.00e+00 59.92 0.00e+00 0.61
5
chrI-1 chrI 141663 141895 + 41225.6 0.854 3515.000000 3514.0169.1 20.78 0.00e+00 17.09 0.00e+00 0.50
6
chrII-1 chrII 165145 165377 + 35334.5 0.845 3015.000000 3018.0171.1 17.64 0.00e+00 14.47 0.00e+00 0.50
7
chrII-2 chrII 555827 556059 + 34817.1 0.790 2973.000000 2970.0159.6 18.61 0.00e+00 10.12 0.00e+00 0.50
Copied!
图1. motif table in homerResults.html

2c.2) detailed description of Peak calling output

peak file contains the following columns
column
information
description
1
PeakID
a unique name for each peak
2
chr
chromosome where peak is located
3
starting position of peak
4
ending position of peak
5
Strand (+/-)
6
Normalized Tag Counts
number of tags found at the peak, normalized to 10 million total mapped tags (or defined by the user)
7
Focus Ratio
fraction of tags found appropriately upstream and downstream of the peak center
8
Peak score
position adjusted reads from initial peak region - reads per position may be limited
9
total Tags
number of tags found at the peak
10
Control Tags
normalized to IP Experiment
11
Fold Change vs Control
putative peaks have 4-fold more normalized tags in the target experiment than the control
12
p-value vs Control
HOMER uses the poisson distribution to determine the chance that the differences in tag counts are statistically significant (sequencing-depth dependent), requiring a cumulative poisson p-value of 0.0001
13
Fold Change vs Local
HOMER requires the tag density at peaks to be 4-fold greater than in the surrounding 10 kb region
14
p-value vs Local
the comparison must also pass a poisson p-value threshold of 0.0001
15
Clonal Fold Change
The fold threshold can be set with the -C <#> option (default: -C 2), if this ratio gets too high, the peak may arise from expanded repeats, and should be discarded

2c.3) detailed description of Motif analysis output

Detailed output files of Motif analysis will produce many files, we only explain the main output -- homerResults.html in above. Here we will briefly introduce other files.
file name
description
homerMotifs.all.motifs
Simply the concatenated file composed of all the homerMotifs.motifs<#> files.
motifFindingParameters.txt
command used to execute findMotifsGenome.pl.
knownResults.txt
text file containing statistics about known motif enrichment (open in EXCEL).
seq.autonorm.tsv
autonormalization statistics for lower-order oligo normalization.
homerResults.html
HTML formatted output of de novo motif finding:
knownResults.html
HTML formatted output of known motif finding.
knownResults/ directory
contains files for the knownResults.html webpage, including known<#>.motif files for use in finding specific instance of each motif.

3) Running Steps

首先进入到容器(在自己电脑的 Terminal 中运行,详情请参见 这里):
1
docker exec -it bioinfo_tsinghua bash
Copied!
以下步骤均在 /home/test/chip-seq/ 下进行:
1
cd /home/test/chip-seq/
Copied!
准备输出目录
1
mkdir output
Copied!

3a) Peak Calling

The common call peak software are HOMER and MACS, here we mainly introduce the usage of HOMER. You can get manual for MACS here.
HOMER contains a program called findPeaks that performs all of the peak calling analysis. Before we use findPeaks to call peak, we need to convert our .bam file into tag file by using makeTagDirectory:
1
makeTagDirectory input/ip input/ip.part.bam
2
makeTagDirectory input/input input/input.part.bam
Copied!
After this step, input/ip & input/input will contain several .tags.tsv files, as well as a file named tagInfo.txt. This file contains information about your sequencing run, including the total number of tags considered. This file is used by later peak-calling programs to quickly reference information about the experiment. Then we call peak by using these tag file:
1
findPeaks input/ip/ -style factor -o output/part.peak -i input/input/
Copied!
Important parameters
parameter
meaning
-style
Specialized options for specific analysis strategies, such as factor (transcription factor ChIP-Seq) and histone (histone modification ChIP-Seq).
-o
File name for to output peaks, default: stdout.
-i
Input tag directory, experiment to use as IgG/Input/Control.
输出文件为 /home/test/chip-seq/output/part.peak, 示例如下
1
#PeakID chr start end strand Normalized Tag Count focus ratio findPeaks Score Total TagControl Tags (normalized to IP Experiment) Fold Change vs Control p-value vs Control Fold Change vs Local p-value vs Local Clonal Fold Change
2
chrIII-1 chrIII 78346 78578 + 69987.1 0.862 5971.000000 5966.0 201.6 29.60 0.00e+00 26.54 0.00e+00 0.50
3
chrIII-2 chrIII 133 365 + 61226.9 0.775 5364.000000 5227.0 116.3 44.93 0.00e+00 59.92 0.00e+00 0.61
4
chrI-1 chrI 141663 141895 + 41225.6 0.854 3515.000000 3514.0 169.1 20.78 0.00e+00 17.09 0.00e+00 0.50
5
chrII-1 chrII 165145 165377 + 35334.5 0.845 3015.000000 3018.0 171.1 17.64 0.00e+00 14.47 0.00e+00 0.50
6
chrII-2 chrII 555827 556059 + 34817.1 0.790 2973.000000 2970.0 159.6 18.61 0.00e+00 10.12 0.00e+00 0.50
7
chrIII-3 chrIII 163527 163759 + 31266.1 0.826 2662.000000 2670.0 186.0 14.35 0.00e+00 14.16 0.00e+00 0.51
Copied!

3b) Motif Analysis

HOMER contains a program called findMotifsGenome.pl that can find enriched motifs in ChIP-Seq peaks
1
findMotifsGenome.pl output/part.peak sacCer2 output/part.motif.output -len 8
Copied!
Important parameters
    Region Size (-size <#>, -size <#>,<#>, -size given, default: 200) The size of the region used for motif finding is important. If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea。
    Motif length (-len <#> or -len <#>,<#>,..., default 8,10,12) In general, it's best to try out enrichment with shorter lengths (i.e. less than 15) before trying longer lengths.
    Number of motifs to find (-S <#>, default 25) Specifies the number of motifs of each length to find. 25 is already quite a bit.
最重要的输出文件为 /home/test/chip-seq/output/part.motif.output/homerResults.html, 示例输出参见 这里

4) Tips/Utilities

4a) Preparation .bam from ChIP-seq data

If you want to get these two .bam files by yourself, you can follows these steps.
    1.
    download data
      1.
      The fastq data for yeast ChIP-seq was downloaded from GSE61210 .
      2.
      Input data was downloaded from GSM1499619;
      3.
      IP data was downloaded from GSM1499607.
    2.
    build yeast bowtie index
    Yeast sacCer2 genome data was downloaded from UCSC http://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/bigZips/chromFa.tar.gz.
    Index was build with commad:
    1
    tar -xvf chromfa.tar.gz
    2
    cat *.fa >yeast.allchrom.fa
    3
    mkdir bowtie_index_yeast
    4
    bowtie-build yeast.allchrom.fa bowtie_index_yeast/sacCer2
    Copied!
    3.
    mapping
    1
    bowtie -p 4 -m 1 -v 3 --best --strata bowtie_index_yeast/sacCer2 \
    2
    -q input/ip.fastq -S input/ip.sam
    Copied!
    4.
    sampling
    As the .sam file is too big for tutorial example, so we selected parts of them as example file.
    1
    samtools sort input/ip.sam >input/ip.sorted.bam
    2
    samtools index input/ip.sorted.bam
    3
    samtools view input/ip.sorted.bam chrI chrII chrIII -b >input/ip.part.bam
    Copied!

4b) peak file header

The top portion of the peak file will contain parameters and various analysis information
图2. peak file 元信息
Two fields need special notification:
    genome size represents the total effective number of mappable bases in the genome;
    Approximate IP effeciency describes the fraction of tags found in peaks versus genomic background. This provides an estimate of how well the ChIP worked. Certain antibodies like H3K4me3, ERa, or PU.1 will yield very high IP efficiencies (>20%), while most rand in the 1-20% range. Once this number dips below 1% it's a good sign the ChIP didn't work very well and should probably be optimized.

4c) peak calling using MACS

MACS can also be used to call peaks (the program is already installed in Docker).
1
macs2 callpeak -t input/ip.part.bam -c input/input.part.bam --outdir output/macs_peak \
2
--name=yeast_macs_p05 --format=BAM --gsize=1.2e7 --tsize=50 --pvalue=1e-5
Copied!
The main output is /home/test/chip-seq/output/macs_peak/yeast_macs_p05_peaks.xls, which is a tabular file containing information about called peaks. You can open it in excel and sort/filter using excel functions. Information include:
    chromosome name
    start position of peak
    end position of peak
    length of peak region
    absolute peak summit position
    pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
    fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit

4d) Visualize peaks

Advanced users: for a better ChIP-seq visualization, ChIPseeker was used in R software.
1
# Prepare the environment of software
2
rm(list = ls())
3
options(stringsAsFactors = FALSE)
4
5
# Library packages
6
library("RMariaDB")
7
library("rtracklayer")
8
library("ChIPseeker")
9
library("org.Sc.sgd.db")
10
library("clusterProfiler")
11
library("GenomicFeatures")
12
library("ggplot2")
13
library("ggupset")
14
15
16
# Display available genomes list
17
ucscGenomes()[ , "db"]
18
# Display the list of table known to work
19
supportedUCSCtables()
20
# Retrieve full datasets (responds to genome ID of fasta) for Yeast from UCSC
21
TxDb.Ssaccer.USUC.sacCer2.ensGene <- makeTxDbFromUCSC(genome ="sacCer2", tablename = "ensGene")
22
23
# Find the position of peak in WGS
24
txdb <- TxDb.Ssaccer.USUC.sacCer2.ensGene
25
# Read the exported peak file from MACS
26
yeast <- readPeakFile("yeast_macs_p05_peaks.narrowPeak")
27
covplot(yeast, weightCol = 5)
28
29
# ChIP peaks binding TSS region
30
promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
31
tagMatrix <- getTagMatrix(peak = yeast, windows = promoter)
32
tagHeatmap(tagMatrix = tagMatrix, xlim = c(-3000, 3000), color = "red")
33
34
# Average profile of ChIP peaks binding to TSS region
35
plotAvgProf(tagMatrix, xlim = c(-3000, 3000), conf = 0.95, resample = 1000,
36
xlab = "Genomone Region (5'->3')", ylab = "Read Count Frequency")
37
38
# Peaks annotation
39
peakAnno <- annotatePeak(yeast, tssRegion = c(-3000, 3000), TxDb = txdb, annoDb = "org.Sc.sgd.db")
40
41
# Visualize the genome annotation
42
plotAnnoPie(peakAnno)
43
plotAnnoBar(peakAnno)
44
vennpie(peakAnno)
45
upsetplot(peakAnno)
46
47
# Visualize distribution of TF-binding loci relative to TSS
48
plotDistToTSS(peakAnno)
Copied!

5) Homework

    1.
    解释 findPeaksfindMotifsGenome.pl 主要参数的含义。
    2.
    提交文件: 提交Snf1蛋白在DNA上结合的 peak的位置, fold change, p-value等信息 和 motif 及p-value信息 (peak 和 motif最好放到一个word/pdf文件中提交)。
      Snf1蛋白是染色体重塑复合体中的重要组成部分之一,为了探究Snf1蛋白会特异性识别哪些DNA序列,Workman小组在酵母中做了该蛋白的ChIP-seq实验。
      其中Snf1蛋白IP实验数据为 ip.chrom_part.bam,Input背景数据为input.chrom_part.bam。请同学们根据本节课所学内容寻找Snf1蛋白在DNA上结合的 peak 和 motif。
      作业的数据:bam 文件在 Docker 的 /home/test/chip-seq/homework/ 文件夹中。
      筛选标准 1. peak:Fold Change (vs Control) >=8,且 p-value (vs Control) <
      10810^{-8}
      ; 2. motif: p-value <
      101010^{-10}
      ;

6) Teaching Video (link)

    PART III: ChIP-seq Analysis

7) References

Last modified 9h ago