1.1.Data Introduction

数据介绍

我们使用的数据主要包括两种癌症和正常人样本，其中Colorectal Cancer, Prostate Cancer和Healthy Control的样本数量分别为99，36和50。数据存放公共目录为cnode服务器的/BioII/chenxupeng/student/目录。

data目录下为已经建好的expression matrix，相应的label和annotation
data目录下另外的文件夹中存放的文件是读者用于自己完成对五个正常人样本Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25进行mapping和创建expression matrix等操作的。

1) mapping相关文件

路径：包括/BioII/chenxupeng/student/data/目录下的hg38_index, raw_data, RNA_index文件夹。

data

path

raw data

/BioII/chenxupeng/student/data/raw_data/*.fastq

hg38

/BioII/chenxupeng/student/data/hg38_index/GRCh38.p10.genome.fa

gtf

/BioII/chenxupeng/student/data/gtf

RNA index

/BioII/chenxupeng/student/data/RNA_index/

具体内容参考 11.1 Helps: mapping指南

2) expression matrix

路径：/BioII/chenxupeng/student/data/expression_matrix/

expression matrix每一行为一个feature，每一列为一个样本，其中我们去掉了Sample_N13, Sample_N19, Sample_N25三个样本的相应数据，需要读者自己完成mapping和构建expression matrix（详见 11.2 Requirement: Expression Matrix）。

import pandas as pd
import numpy as np
scirepount = pd.read_table('data/expression_matrix/GSE71008.txt',sep=',',index_col=0)

scirepount.iloc[:,:5].head()

	Sample_1S10	Sample_1S11	Sample_1S12	Sample_1S13	Sample_1S14
transcript
ENST00000473358.1\|MIR1302-2HG-202\|1544	0	0	0	0	0
ENST00000469289.1\|MIR1302-2HG-201\|843	0	0	0	0	0
ENST00000466430.5\|AL627309.1-201\|31638	0	0	0	0	0
ENST00000471248.1\|AL627309.1-203\|18221	0	0	0	0	0
ENST00000610542.1\|AL627309.1-205\|12999	0	0	0	0	0

scirepount.shape

(89619, 188)

3) sample labels

路径：/BioII/chenxupeng/student/data/labels

scirep_samplenames = pd.read_table('data/labels/scirep_classes.txt',delimiter=',' , index_col=0)

scirep_samplenames.head()

label

sample_id

Sample_1S3

Colorectal Cancer

Sample_1S6

Colorectal Cancer

Sample_1S9

Colorectal Cancer

Sample_1S12

Colorectal Cancer

Sample_1S15

Colorectal Cancer

delete_sample = ['Sample_N1','Sample_N7','Sample_N13','Sample_N19','Sample_N25']
check_sample = ['Sample_N1','Sample_N7']

np.unique(scirep_samplenames['label'],return_counts=True)

(array(['Colorectal Cancer', 'Healthy Control', 'Pancreatic Cancer',
        'Prostate Cancer'], dtype=object), array([99, 50,  6, 36]))

4) other annotations

路径：/BioII/chenxupeng/student/data/other_annotations

4a) gene annotation

可以通过feature的transcript id找到feature的transcript_name, gene_type等信息

geneannotation = pd.read_table('data/other_annotations/transcript_anno.txt')

geneannotation.iloc[:,:5].head()

	chrom	start	end	name	score
0	chr1	14629	14657	piR-hsa-18438	0
1	chr1	17368	17436	ENSG00000278267.1	0
2	chr1	18535	18563	piR-hsa-7508	0
3	chr1	26805	26836	piR-hsa-23387	0
4	chr1	29553	31097	ENSG00000243485.5	0

4b) batch信息

batch信息记录了对不同样本采取的不同实验条件，包括处理时间，处理材料的规格差异等，可能会造成同类样本的较大差异，称为batch effect。

对于exoRBase数据，每一种癌症样本均来自不同的实验室，因此其batch与样本类别重合。对于scirep数据和hcc数据，batch信息如下：

scirepbatch = pd.read_csv('data/other_annotations/scirep_batch.txt',index_col=0)

scirepbatch.head()

	RNA Isolation batch	library prepration day	gel cut size selection
Sample_1S1	2	22	7
Sample_1S2	2	22	8
Sample_1S3	2	22	1
Sample_2S1	2	22	2
Sample_2S2	2	22	3

5) RNA type 统计信息

scireprnastats = pd.read_csv('data/other_annotations/scirep_rna_stats.txt',index_col=0)

scireprnastats.iloc[:,:5].head()

	Sample_1S10	Sample_1S11	Sample_1S12	Sample_1S13	Sample_1S14
Y_RNA	88835	127497	145142	90106	105377
cleanN	9034303	10963430	11077344	10262615	11065325
hg38other	1462269	2044478	2624270	1476586	1806268
libSizeN	11362190	13437632	13905951	12271219	13619701
lncRNA	26733	38346	35639	25523	31489

PreviousArchive: Version 2018 Next1.2.Requirement

Last updated 4 years ago