2.2.Differential Expression with Cufflinks

本章我们将:

  1. 理解和掌握基本的基因差异表达分析方法(Differential Expression Analysis);学会 Differential Expression Analysis 的基本软件。

  2. 使用 topHat,cufflinkscuffdiff这一系列工具完成基因表达定量和差异表达分析。

  • 本教程的实现主要参考2012, Nature Protocol, Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks这篇文章。这是一个比较早期的流程(我们在mapping一章也已经提到,在实际工作中仍然还在用tophat的研究者已经不多了),但是从学习的角度来说还是值得我们去了解的。

  • 2016的另一篇文章针对相同的目的用更优化的工具给出了一些更新,2016,Nature Protocol,Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown,有兴趣的同学可以自行参考。

  • 本章讲授的cufflinks-cuffdiff不需要太多的R语言基础,适合初学者学习。目前在实际工作中更常见的作法是用2.1节介绍的方法构建表达矩阵,再用2.3节介绍的deseq2或edgeR来进行差异分析。所以推荐熟悉R语言的同学优先尝试2.3节介绍的方法。

  • cufflinks-cuffdiff流程和featureCount-deseq2/edgeR流程的主要差异总结如下:

    • cufflink流程会通过转录本组装的方式注释一些gtf/gff文件中没有注释的转录本,featureCount只考虑gtf/gff文件中提供的基因。

    • cufflink流程在转录本的水平上对基因表达进行估计,featureCount在基因的水平上统计落在每一个基因上的reads数量。

    • 感兴趣的同学可以结合2016,Genome Biology,A survey of best practices for RNA-seq data analysis这篇文章进一步了解。

1) Pipeline

2) Data Structure

2a) getting software & data

Method 1: Docker

  1. software (already available in Docker,docker images的下载链接如附表所示)

    1. R & cummeRbund package (BioConductor)

    2. tophat

    3. bamtools

    4. cufflinks

  2. Data (already available in Docker)

    The Yeast RNA-seq data were downloaded from GSE42983,有两种条件,每种两个生物重复

    • wild-type : wt1.fq & wt2.fq

    • H2O2 treatment (oxidative stress): wt1X.fq & wt2X.fq

    We random sample 1M or 10K reads per sample, 10K reads are in /home/test/diff-exp/Raw_reads_10k of the Docker.

Method 2: Cluster

  • 本节课的singularity镜像为/data/images/bioinfo_tsinghua.simg,具体使用方法见集群和singlarity使用说明。

Method 3: Directly Download

  1. Install software:tophat,bamtools,cufflinks

  2. Dowload data: 所用到的数据可以直接从 Files needed 中的Files/ 路径下的相应文件夹中下载所需要的数据。

2b) input

format

description

Notes

.fastq

RNA sequences of each sample

raw data of RNA-seq

.fa

sequences of the whole genome

.gff

genome annotation file

Default genome annotation file is from GENCODE, is similar to gtf file

.ebwt

bowtie index file

used to align RNA sequences to genome

2c) output

format

description

Notes

.bam

genome mapped reads (binary format of sam)

generated by align step

transcript.gtf

assembled transcripts of each sample

generated by assemble step

merged.gtf

all annotation of assembled transcripts

generated by merge step

txt/tsv

differentially expressed genes

generated by DE gene identify step

pdf

R plot to explore differentially expressed genes

generated by R package

cuffdiff output

format

description

gene_exp.diff

Differentially expressed genes

isoform_exp.diff

Differentially expressed transcripts

cds_exp.diff

Differentially expressed coding sequences

cds.diff

Genes with differential CDS output

promoters.diff

Genes with differential promoter use

splicing.diff

Differentially spliced TSS groups

tss_group_exp.diff

Differentially expressed TSS groups

3) Running Steps

和之前的章节一样,首先进入到docker容器:

docker exec -it bioinfo_tsinghua bash

以下步骤均在 /home/test/diff-exp/ 下进行:

cd /home/test/diff-exp/

3a) Align the RNA-seq reads to the genome

(1) map reads using Tophat

tophat maps short sequences from spliced transcripts to whole genome.

mkdir mapping

tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt1_thout \
    bowtie_index/YeastGenome Raw_reads_10k/wt1.fq  

tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt2_thout \
    bowtie_index/YeastGenome Raw_reads_10k/wt2.fq 

tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt1X_thout \
    bowtie_index/YeastGenome Raw_reads_10k/wt1X.fq 

tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt2X_thout \
    bowtie_index/YeastGenome Raw_reads_10k/wt2X.fq

注意:这一步会出现一个不影响输出文件结果和进程退出码的samtools view failed info,点击这里查看讨论。简要描述:The issue appears to be that python's handling of SIGPIPEs in this kind of pipeline is nonstandard. Python ignores SIGPIPE on startup, because it prefers to check every write and raise an IOError exception rather than taking the signal. This is all well and good for Python itself, but most Unix subprocesses don’t expect to work this way.

用法:tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]]

  • -p/--num-threads default: 1

  • -G/--GTF GTF/GFF with known transcripts

  • -o/--output-dir default: ./tophat_out

(2) Extract mapped reads on chr I only (for quick running)

bamtools是一个用途和samtools比较相似的工具。

我们这里从bam文件中挑出对应chr1上的一部分reads用于后续分析来减少计算的开销,在实际的生物信息分析中是不需要这一步的。

  • index .bam file

    index command generates index for .bam file

bamtools index -in mapping/wt1_thout/accepted_hits.bam 

bamtools index -in mapping/wt2_thout/accepted_hits.bam 

bamtools index -in mapping/wt1X_thout/accepted_hits.bam 

bamtools index -in mapping/wt2X_thout/accepted_hits.bam
  • extract

    filter command filters .bam files by user-specified criteria

bamtools filter -region chrI -in mapping/wt1_thout/accepted_hits.bam \
    -out mapping/wt1_thout/chrI.bam

bamtools filter -region chrI -in mapping/wt2_thout/accepted_hits.bam \
    -out mapping/wt2_thout/chrI.bam

bamtools filter -region chrI -in mapping/wt1X_thout/accepted_hits.bam \
    -out mapping/wt1X_thout/chrI.bam

bamtools filter -region chrI -in mapping/wt2X_thout/accepted_hits.bam \
    -out mapping/wt2X_thout/chrI.bam

(1) Assemble transcripts for each sample:

usage: cufflinks -p number_of_cores -o output_dir input_file

cufflinks -p 4 -o assembly/wt1_clout  mapping/wt1_thout/chrI.bam 

cufflinks -p 4 -o assembly/wt2_clout  mapping/wt2_thout/chrI.bam 

cufflinks -p 4 -o assembly/wt1X_clout mapping/wt1X_thout/chrI.bam 

cufflinks -p 4 -o assembly/wt2X_clout mapping/wt2X_thout/chrI.bam

(2) Create a file called assemblies.txt which lists the assembly files for each sample in the assembly folder. Its content is as follows:

assembly/wt1X_clout/transcripts.gtf
assembly/wt1_clout/transcripts.gtf
assembly/wt2X_clout/transcripts.gtf   
assembly/wt2_clout/transcripts.gtf

You can create that file manually by vim or use this command: ls assembly/*/transcripts.gtf > assembly/assemblies.txt

(3) Merge all assemblies to one file containing merged transcripts:

cuffmerge takes two or more Cufflinks .gtf files and merges them into a single unified transcript catalog

cuffmerge -g yeast_chrI_annotation.gff -s bowtie_index/YeastGenome.fa \
    -p 4 -o assembly/merged assembly/assemblies.txt

3c) Identify differentially expressed genes and transcripts

Run cuffdiff based on the merged transcripts and mapped reads for each sample

cuffdiff -o diff_expr -b bowtie_index/YeastGenome.fa \
    -p 4 -u assembly/merged/merged.gtf \
    mapping/wt1_thout/chrI.bam,mapping/wt2_thout/chrI.bam \
    mapping/wt1X_thout/chrI.bam,mapping/wt2X_thout/chrI.bam

参数意义:

  • -o: 输出文件夹

  • -b: Bowtie index

  • -p: 使用的线程数

  • -u: 转录本注释文件,一般使用 merge 得到的文件

  • 最后两行为每个样本的 .bam 文件,这里上一行为对照组,下一行为实验组,组内的不同样本用英文逗号分隔

这里我们用10k的数据生成结果,可能无明显差异表达。可以查看1M_reads_diff_expr文件夹(已经预先跑好)中查看1M数据的结果。

3d) Explore differential analysis results with CummeRbund package

由于 10k reads 结果太少,我们用 1M reads 的结果(已经预先跑好)来画图

Rscript  plot_DE_chart.R

图片可在 1M_reads_diff_expr 中查看。

Tips: input/output file is hard-coded in plot_DE_chart.R, if you want to plot your own results, you need to edit plot_DE_chart.R (replace 1M_reads_diff_expr/ with your own output directory of cutdiff).

4) Homework

  1. 提交差异表达的gene的 name,注明相关的fold change, p-value,FDR(q value) 等信息, 并提交绘制的 Volcano Plot。(基因列表和图最好放到一个word/pdf文件中提交)。

    • 使用教程中的数据,操作时,跳过3a)(2) Extract,(直接用accepted_hits.bam 来进行 Assemble transcripts),要求找到差异表达的基因(满足log2(fold change)\lvert log_2(fold\ change) \rvert > 1, FDR < 0.05)。

    • 差异表达的基因有的没有基因 name, 可以用 gene id 代替。

    • 画 Volcano Plot 时,需要将所有的基因进行绘图,包括未差异表达的基因。画图用 I-3.R-3.2.Plot with R 中 7 Volcano plots 的代码画图。

  2. 寻找 differentially expressed genes/transcripts时要对数据做怎样的处理?这些处理有哪些统计学上的考虑?

Last updated