3.1.GO

1) Pipeline

2) Data Structure

参考 文件获取方式下载 GO_gene.txt 文件, 内容为一组人类基因的ensembl ID:

ENSG00000001036
ENSG00000003756
ENSG00000008018
ENSG00000012048
ENSG00000043355
ENSG00000074755
ENSG00000079616
ENSG00000089280
ENSG00000100591
ENSG00000100941
ENSG00000101109
ENSG00000101974
ENSG00000104611
ENSG00000104738
ENSG00000105738
ENSG00000113318
ENSG00000114867
ENSG00000116221
ENSG00000116857
ENSG00000117724
ENSG00000119285
ENSG00000121774
ENSG00000127663
ENSG00000127884
ENSG00000128159
ENSG00000129187
ENSG00000130640
ENSG00000131473
ENSG00000134287
ENSG00000134644
ENSG00000136628
ENSG00000137273
ENSG00000146263
ENSG00000153187
ENSG00000160285
ENSG00000164818
ENSG00000164944
ENSG00000167325
ENSG00000167548
ENSG00000170448
ENSG00000179632
ENSG00000183207
ENSG00000187954
ENSG00000196700
ENSG00000196924
ENSG00000198604
ENSG00000198886
ENSG00000198899
ENSG00000206503
ENSG00000223609
ENSG00000272822

2a) Inputs

File format

Information contained in file

File description

Notes

txt

Gene encode id

The file contain the gene encode id

-

2b) Outputs

File format

Information contained in file

File description

Notes

txt

Output information

The gene ontology of each gene

-

3) Running Steps

open http://geneontology.org/

3a) Input gene name

3b) Output the results

Reference list

User upload

Mapped IDs:

21042 out of 21042

50 out of 50

Unmapped IDs:

0

1

Multiple mapping information:

0

0

3c) Display the results

We only display results with False Discovery Rate (FDR) < 0.05.

GO termReferenceInput numberexpectedFold Enrichment+/-raw P valueFDR

DNA replication

208

6

0.49

12.14

+

1.11E-05

1.25E-02

  • 通过和数据库比对,我们可以知道在数据库参考基因组中的21042基因中,被注释到DNA replication 的有208个,在用户上传的50个可以识别的基因中有6个基因被注释为DNA replication。

  • expected 0.4942= 208*50/21042

  • Fold Enrichment 12.14=6/0.4942

  • +/- 富集用“+”表示

  • raw P value 可以用下面公式计算

  • N: numbers of one organism's genes annotated with GO or of the user-provided background . 这里N等于21042

  • n: numbers of genes mapped to the background in the query list

    . 这里n等于50

  • K: numbers of genes in one GO term

    . 这里K等于208

  • k: the counts of genes mapped to the GO term in the query list

    . 这里k等于6

4) Tips/Utilities

其他一些网页工具和R package也常被用来做富集分析,有兴趣的同学可自行了解:

  • 网页工具:

    • David (教程中的[3.2.kegg](https://book.ncrnalab.org/teaching/part-ii.-basic-analyses/3.function-analysis/3.2.kegg)是基于David实现的,供大家参考)

5) Homework

  1. wt.light.vs.dark.all.txt(这是我们在差异表达一节获得的野生型的结果)中选取显著上调的(FDR<0.05, logFC>1)的基因进行GO分析。

  2. 请问上面的例子中, Fold Enrichment和P value是如何计算的? 请写出公式,并解释原理。此外,在定义显著富集的 GO terms 时为什么一般不是参考P value的大小,而是要计算一个 FDR来做为参考?

Last updated