2.Machine Learning with R
本章介绍如何用R进行简单的机器学习,练习时需要在R中安装文中所提到的R packages。

1) Practice Guide

我们选用Random Forest模型作为示例讲解机器学习,我们只做一次简单的Training and Predict,更多复杂的Feature Selection,Cross-validation等内容大家可以参考 Machine Learning with Python

1a) Load R package

运行代码之前需要以下 R packages
    randomForest: 构建Random Forest模型
    ROCR: 绘制ROC曲线和计算AUC
    GGally: 画图表示特征之间相关性
    mlbench: 常用机器学习数据集
以下代码加载所需的R package。require()函数判断每个package是否已经安装,如果已安装则加载。 如果未安装该package,则调用install.packages()安装。 library()函数加载一个package。
1
for(pkg in c('randomForest', 'ROCR', 'GGally', 'mlbench')){
2
if(!require(pkg, character.only=TRUE)){
3
install.packages(pkg)
4
library(pkg, character.only=TRUE)
5
}
6
}
Copied!
设置随机数种子保证本教程的结果可重复
1
set.seed(1234)
Copied!

1b) Load data

我们采用R内置的数据集iris,其中包含4个特征和3个类别。每个类别包含50个样本,对应一个花的物种。数据集一共包含150个样本。
1
head(iris)
Copied!
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
5.1
3.5
1.4
0.2
setosa
4.9
3.0
1.4
0.2
setosa
4.7
3.2
1.3
0.2
setosa
4.6
3.1
1.5
0.2
setosa
5.0
3.6
1.4
0.2
setosa
5.4
3.9
1.7
0.4
setosa
为简单起见,我们只选用versicolorvirginica两类做二分类问题。
1
# 去除类别为setosa的样本
2
df <- iris[iris$Species != 'setosa', ]
3
# 去除最后一列Species,产生输入矩阵
4
all_data <- df[, 1:(ncol(df) - 1)]
5
# 需要预测的类别向量。factor()函数用于把原来的3种类别编程2中类别
6
all_classes <- factor(df$Species)
Copied!

1c) Dividing data

我们首先随机预留20%的样本作为一个独立的Validation Set,其他80%我们定义为Training Set
1
# 总样本数为输入矩阵的行数
2
n_samples <- nrow(all_data)
3
# 计算训练集样本数(80%)
4
n_train <- floor(n_samples * 0.8) # Calculate the size of training sets
5
6
# 产生所有样本序号的重新排列
7
indices <- sample(1:n_samples)
8
# 从以上随机排列中选取n_train个作为训练集样本序号
9
indices <- indices[1:n_train]
10
11
# 选取Training Set
12
# 以下代码从矩阵all_data按照indices里的顺序抽取相应的行拼接成一个新的矩阵
13
train_data <- all_data[indices,]
14
# 以下代码从向量all_classes中按照indices里的顺序抽取相应元素组成一个新的向量
15
train_classes <- all_classes[indices]
16
# 选取Validation Set
17
# -indices表示选取序号不在indices中的样本
18
validation_data <- all_data[-indices,]
19
validation_classes <- all_classes[-indices]
Copied!

1d) Model training on training set

以下代码在Training Set上训练一个由100棵分类树组成的Random Forest模型:
1
rf_classifier = randomForest(train_data, train_classes, trees = 100)
Copied!
函数返回的变量rf_classifier包含了已经训练好的模型:
1
rf_classifier
Copied!
1
Call:
2
randomForest(x = train_data, y = train_classes, trees = 100)
3
Type of random forest: classification
4
Number of trees: 500
5
No. of variables tried at each split: 2
6
7
OOB estimate of error rate: 8.75%
8
Confusion matrix:
9
versicolor virginica class.error
10
versicolor 36 3 0.07692308
11
virginica 4 37 0.09756098
Copied!

1e) Predict on validation set

模型训练完成之后,可用predict函数在Validation Set上进行预测和评估:
    预测得到分类标签
1
predicted_classes <- predict(rf_classifier, validation_data)
Copied!
1
print(validation_classes)
Copied!
1
[1] versicolor versicolor versicolor versicolor versicolor versicolor
2
[7] versicolor versicolor versicolor versicolor versicolor virginica
3
[13] virginica virginica virginica virginica virginica virginica
4
[19] virginica virginica
5
Levels: versicolor virginica
Copied!
1
print(predicted_classes)
Copied!
1
55 56 57 64 70 77 79
2
versicolor versicolor versicolor versicolor versicolor versicolor versicolor
3
80 81 93 95 101 105 106
4
versicolor versicolor versicolor versicolor virginica virginica virginica
5
107 116 118 135 139 141
6
versicolor virginica virginica versicolor virginica virginica
7
Levels: versicolor virginica
Copied!
    建立混淆矩阵
用预测的类别和真实类别可构建一个混淆矩阵(confusion matrix)
1
# 定义versicolor为正类别
2
positive_class <- 'versicolor'
3
# true positive count (TP)
4
TP <- sum((predicted_classes == positive_class) & (validation_classes == positive_class))
5
# false positive count (FP)
6
FP <- sum((predicted_classes == positive_class) & (validation_classes != positive_class))
7
# false negative count (FN)
8
FN <- sum((predicted_classes != positive_class) & (validation_classes == positive_class))
9
# true negative count (TN)
10
TN <- sum((predicted_classes != positive_class) & (validation_classes != positive_class))
11
# 构建2x2矩阵,填充以上计算的四个数
12
confusion <- matrix(c(TP, FP, FN, TN), nrow=2, ncol=2, byrow=TRUE)
13
colnames(confusion) <- c('True versicolor', 'True virginica')
14
rownames(confusion) <- c('Predicted versicolor', 'Predicted virginica')
15
confusion
Copied!
True versicolor
True virginica
Predicted versicolor
11
2
Predicted virginica
0
7
关于混淆矩阵的计算,可参考(https://en.wikipedia.org/wiki/Confusion_matrix) 获得更多信息。
我们可以基于混淆矩阵计算accuracy、sensitivity、positive predicted value、specificity等评估指标:
1
cat('accuracy =', (TP + TN)/(TP + TN + FP + FN), '\n')
2
cat('sensitivity =', TP/(TP + FN), '\n')
3
cat('positive predicted value =', TP/(TP + FP), '\n')
4
cat('specificity =', TN/(TN + FP), '\n')
Copied!
1
accuracy = 0.9
2
sensitivity = 1
3
positive predicted value = 0.8461538
4
specificity = 0.7777778
Copied!

1f) Plot ROC on validation set

ROC曲线需要两组数据:真实类别和预测某一类别的概率。 首先,调用predict函数时加上type = 'prob'参数可计算每个类别的概率
1
predicted_probs <- predict(rf_classifier, validation_data, type = 'prob')
Copied!
predicted_probs包含两列,对应两个类别
1
head(predicted_probs)
Copied!
versicolor
virginica
55
0.970
0.030
56
0.998
0.002
57
0.956
0.044
64
0.990
0.010
70
1.000
0.000
77
0.878
0.122
因为我们把vesicolor当作正样本,所以只选取预测为正样本的概率来计算ROC:
1
# 创建一个长度与测试集大小相同的0-1向量,1代表要预测的类别
2
validation_labels <- vector('integer', length(validation_classes))
3
validation_labels[validation_classes != positive_class] <- 0
4
validation_labels[validation_classes == positive_class] <- 1
5
# 通过prediction函数,使用预测为正样本的概率和真实类别创建一个对象pred
6
pred <- prediction(predicted_probs[, positive_class], validation_labels)
Copied!
以假阳性率(false positive rate, fpr)为X轴, 真阳性率(true positive rate, tpr)为y轴绘制ROC曲线:
1
roc <- performance(pred, 'tpr', 'fpr')
2
plot(roc, main = 'ROC Curve')
Copied!
png
计算ROC曲线下面积(area under the curve, AUC):
1
auc <- performance(pred, 'auc')
2
cat('auc =', auc@y.values[[1]], '\n')
Copied!
1
auc = 0.9848485
Copied!

2) Tips

2a) 特征之间相关性

在模型训练之前,可以计算特征之间相关性,去除冗余的特征。注意特征数较多时,由于计算量很大,不适合分析所有特征之间的相关性。
1
GGally::ggpairs(df, columns = 1:4, ggplot2::aes(color = Species))
Copied!

2b) Top 20 R packages for machine learning

    e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc (142479 downloads)
    rpart Recursive Partitioning and Regression Trees. (135390)
    igraph A collection of network analysis tools. (122930)
    nnet Feed-forward Neural Networks and Multinomial Log-Linear Models. (108298)
    randomForest Breiman and Cutler's random forests for classification and regression. (105375)
    caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. (87151)
    kernlab Kernel-based Machine Learning Lab. (62064)
    glmnet Lasso and elastic-net regularized generalized linear models. (56948)
    ROCR Visualizing the performance of scoring classifiers. (51323)
    gbm Generalized Boosted Regression Models. (44760)
    party A Laboratory for Recursive Partitioning. (43290)
    arules Mining Association Rules and Frequent Itemsets. (39654)
    tree Classification and regression trees. (27882)
    klaR Classification and visualization. (27828)
    RWeka R/Weka interface. (26973)
    ipred Improved Predictors. (22358)
    lars Least Angle Regression, Lasso and Forward Stagewise. (19691)
    earth Multivariate Adaptive Regression Spline Models. (15901)
    CORElearn Classification, regression, feature evaluation and ordinal evaluation. (13856)
    mboost Model-Based Boosting. (13078)

3) Homework

    按照教程中的流程,利用分类器模型对给定数据集BreastCancer进行二分类,并且汇报模型的预测表现,包括accuracy, precision, recall, roc_auc等指标,并绘制ROC曲线。
R中mlbench包中有数据集BreastCancer,也可从该链接Files needed by this Tutorial中的清华云Bioinformatics Tutorial / Files路径下的相应文件夹中下载数据文件再读入R。数据集包括9个特征,两种类别,良性(benign)和恶性(malignant)。如需了解更多关于BreastCancer数据集信息,可参考mlbench的文档。以下是各特征的含义:
Id
Cl.thickness
Cell.size
Cell.shape
Marg.adhesion
Epith.c.size
Bare.nuclei
Bl.cromatin
Normal.nucleoli
Mitoses
Class
1000025
5
1
1
1
2
1
3
1
1
benign
1002945
5
4
4
5
7
10
3
2
1
benign
1015425
3
1
1
1
2
2
3
1
1
benign
1016277
6
8
8
1
3
4
3
7
1
benign
1017023
4
1
1
3
2
1
3
1
1
benign
1017122
8
10
10
8
7
10
9
7
1
malignant
Cl.thickness: Clump Thickness Cell.size: Uniformity of Cell Size Cell.shape: Uniformity of Cell Shape Marg.adhesion: Marginal Adhesion Epith.c.size: Single Epithelial Cell Size Bare.nuclei: Bare Nuclei Bl.cromatin: Bland Chromatin Normal.nucleoli: Normal Nucleoli Mitoses: Mitoses
    数据预处理:1)去除含有空缺值(NA)的样本 2)对数据进行归一化
    数据标签处理:正样本为benign,负样本为malignant
    数据集划分:Trianing set和Validation set划分参考教程中的80%/20%划分方式,程序运行最开头加上set.seed(1234)保证划分一致
    分类器模型:LR,SVM,DT,RF任选一个
    编程工具:R
    作业要求:上传word/pdf文档附件,记录处理过程所用代码,并汇报每个模型在Validation Set的accuracy,precision,recall, roc_auc等指标,并绘制ROC曲线。

4) More reading

    本教程代码参考链接
    另一个版本参见这里,代码更紧凑。
    其他机器学习模型相关代码:
      randomForest.R: Random Forest
      logistic_regression.R: Logistic Regression
      svm.R: SVM
      plot_result.R: Plot your training and testing performance
    caret: 用R实现了大量机器学习模型,并包含一个用GitBook教程。
Last modified 2mo ago