1.1.Data Introduction

数据介绍

我们使用的数据主要包括两种癌症和正常人样本,其中Colorectal Cancer, Prostate Cancer和Healthy Control的样本数量分别为99,36和50。数据存放公共目录为cnode服务器的/BioII/chenxupeng/student/目录。

  • data目录下为已经建好的expression matrix,相应的label和annotation

  • data目录下另外的文件夹中存放的文件是读者用于自己完成对五个正常人样本Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25进行mapping和创建expression matrix等操作的。

1) mapping相关文件

路径:包括/BioII/chenxupeng/student/data/目录下的hg38_index, raw_data, RNA_index文件夹。

data

path

raw data

/BioII/chenxupeng/student/data/raw_data/*.fastq

hg38

/BioII/chenxupeng/student/data/hg38_index/GRCh38.p10.genome.fa

gtf

/BioII/chenxupeng/student/data/gtf

RNA index

/BioII/chenxupeng/student/data/RNA_index/

具体内容参考 11.1 Helps: mapping指南

2) expression matrix

路径:/BioII/chenxupeng/student/data/expression_matrix/

expression matrix每一行为一个feature,每一列为一个样本,其中我们去掉了Sample_N13, Sample_N19, Sample_N25三个样本的相应数据,需要读者自己完成mapping和构建expression matrix(详见 11.2 Requirement: Expression Matrix)。

import pandas as pd
import numpy as np
scirepount = pd.read_table('data/expression_matrix/GSE71008.txt',sep=',',index_col=0)
scirepount.iloc[:,:5].head()

Sample_1S10

Sample_1S11

Sample_1S12

Sample_1S13

Sample_1S14

transcript

ENST00000473358.1|MIR1302-2HG-202|1544

0

0

0

0

0

ENST00000469289.1|MIR1302-2HG-201|843

0

0

0

0

0

ENST00000466430.5|AL627309.1-201|31638

0

0

0

0

0

ENST00000471248.1|AL627309.1-203|18221

0

0

0

0

0

ENST00000610542.1|AL627309.1-205|12999

0

0

0

0

0

scirepount.shape
(89619, 188)

3) sample labels

路径:/BioII/chenxupeng/student/data/labels

scirep_samplenames = pd.read_table('data/labels/scirep_classes.txt',delimiter=',' , index_col=0)
scirep_samplenames.head()

label

sample_id

Sample_1S3

Colorectal Cancer

Sample_1S6

Colorectal Cancer

Sample_1S9

Colorectal Cancer

Sample_1S12

Colorectal Cancer

Sample_1S15

Colorectal Cancer

delete_sample = ['Sample_N1','Sample_N7','Sample_N13','Sample_N19','Sample_N25']
check_sample = ['Sample_N1','Sample_N7']
np.unique(scirep_samplenames['label'],return_counts=True)
(array(['Colorectal Cancer', 'Healthy Control', 'Pancreatic Cancer',
'Prostate Cancer'], dtype=object), array([99, 50, 6, 36]))

4) other annotations

路径:/BioII/chenxupeng/student/data/other_annotations

4.1) gene annotation

可以通过feature的transcript id找到feature的transcript_name, gene_type等信息

geneannotation = pd.read_table('data/other_annotations/transcript_anno.txt')
geneannotation.iloc[:,:5].head()

chrom

start

end

name

score

0

chr1

14629

14657

piR-hsa-18438

0

1

chr1

17368

17436

ENSG00000278267.1

0

2

chr1

18535

18563

piR-hsa-7508

0

3

chr1

26805

26836

piR-hsa-23387

0

4

chr1

29553

31097

ENSG00000243485.5

0

4.2) batch信息

batch信息记录了对不同样本采取的不同实验条件,包括处理时间,处理材料的规格差异等,可能会造成同类样本的较大差异,称为batch effect。

对于exoRBase数据,每一种癌症样本均来自不同的实验室,因此其batch与样本类别重合。对于scirep数据和hcc数据,batch信息如下:

scirepbatch = pd.read_csv('data/other_annotations/scirep_batch.txt',index_col=0)
scirepbatch.head()

RNA Isolation batch

library prepration day

gel cut size selection

Sample_1S1

2

22

7

Sample_1S2

2

22

8

Sample_1S3

2

22

1

Sample_2S1

2

22

2

Sample_2S2

2

22

3

5) RNA type 统计信息

scireprnastats = pd.read_csv('data/other_annotations/scirep_rna_stats.txt',index_col=0)
scireprnastats.iloc[:,:5].head()

Sample_1S10

Sample_1S11

Sample_1S12

Sample_1S13

Sample_1S14

Y_RNA

88835

127497

145142

90106

105377

cleanN

9034303

10963430

11077344

10262615

11065325

hg38other

1462269

2044478

2624270

1476586

1806268

libSizeN

11362190

13437632

13905951

12271219

13619701

lncRNA

26733

38346

35639

25523

31489