学习篇章之随机森林

时间: 2023-07-11 admin 互联网

学习篇章之随机森林

学习篇章之随机森林

随机森林预测鸢尾花类型代码实现

一、随机森林
1.应用
远程遥感
物体侦察
kinect体感器:用一个训练集识别身体部位,比如手在哪、脚在哪、身体部位发生了哪些变化,随机森林分类器就可以开始学习,比如可以在人跳舞时识别身体部位并以计算机能听懂的方式表示,然后据此对用户在游戏中的动作进行评分。
1.为什么学习随机森林?
无过度拟合
高准确度(最重要)
可以估计丢失的数据
2.什么随机森林?
通过大多数决策树的投票选择确定最终决策
3.随机森林和决策树
决策树是一种树形图,用来决定一个行动的过程,树的每个分支代表一个可能的决策、事件或反应。

二.学习代码

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
np.random.seed(0)

加载不同的模块,sklear数据集不是实际的数据,而是允许我们引入数据的模块;
load_iris是1936年Ronaled Fisher发表的一篇关于他们正在测量的花的不同部分并根据测量结果预测该花是什么的数据;
pandas为我们创建了一种数据格式,有点像excel电子表格;
numpy是数字python,其运行我们做不同的数学集
np.random.seed实际并未显示任何东西,但是它表示了随机性为0的种子坐标。

#creating an object called iris with the iris data
iris=load_iris()
#creating a dataframe with the four feature variables
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df.head()

用df定义数据框,我们正在查看iris数据;
如果不清楚数据来自哪,可以print(iris)查看,可看到’feature_names’: [‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’];
df.head是pandas数据帧的简洁功能之一,可以打印数据集的前五行以及标题。

{'data': array([[5.1, 3.5, 1.4, 0.2],[4.9, 3. , 1.4, 0.2],[4.7, 3.2, 1.3, 0.2],[4.6, 3.1, 1.5, 0.2],[5. , 3.6, 1.4, 0.2],...................[5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'frame': None, 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n                \n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'filename': 'F:\\Anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'}

#adding a new colunm for the species name
df['species']=pd.Categorical.from_codes(iris.target,iris.target_names)
#viewing the top 5 rows
df.head()

df[‘species’]是创建另一列的关键代码,右上文的iris数据集可看到target=0,1,2分别代表花的三种类型,‘target_names’: array([‘setosa’, ‘versicolor’, ‘virginica’],

#creating test and train data
df['is_train']=np.random.uniform(0,1,len(df))<=.75
#viewing the top 5 rows
df.head()
第二行表示生成一个介于0和1之间的随机数,若该长度小于0.75则为真,反之为假,最终数据集的75%则作为训练集,剩下的25%作为测试集  

#creating dataframes with test rows and traing rows
train,test=df[df['is_train']==True],df[df['is_train']==False]
#show the number of observations for the test and training dataframes
print("numbers of observations in the traing data:",len(train))
print("numbers of observationgs in the test data:",len(test))

用两个变量将数据分类为不同的两个变量train和test

#create a list of the feature colunmn's names
features=df.columns[:4]
#view features
print(features)

把df中的第一行的前4列翻译成文字

#converting each species name into digits
y=pd.factorize(train['species'])[0]
#view target
print(y) 

将每朵花的物种类别转换成代码;
训练集中的species有三种,0、1、2分别代表setosa, versicolor, virginica,[0]是数组的形式,共有118个数据。

不知道咋回事

#creating a random forest classifer
clf=RandomForestClassifier(n_jobs=2 , random_state=0)
#training the classiifier
clf.fit(train[features],y)

n_jobs是确定优先级,好像不太重要?y是目标对象,train[features]是训练集的实际数据,这步不知道为啥我的fit不出来,也没报错

#applying the trained classifier to the test
clf.predict(test[features])
test[features]

进行测试集的预测

#viewing the predicted probabilities of the first 10 observations
clf.predict_proba(test[features])[10:20]

由上文可知,第11-20分别为0, 0, 0, 1, 1, 1, 2, 2, 1, 1,该代码可具体得投票得概率如下

#mapping names for the plants for each predicted plant class
preds=iris.target_names[clf.predict(test[features])]
#view the predicted species for the first five observations
preds[0:25]

测试集预测的数据
将预测的0、1、2转变成花的类别

#viewing the actual species fot the first observations 
test['species'].head()

测试集实际的数据

#creating confusion matrix
pd.crosstab(test['species'],preds,rownames=['Actual Species'],colnames=['Predicted Species'])

结合预测和实际的数据来判断该随机森林分类器的性能
crosstab是一个做交叉表的函数,需要两组数据,下图表示准确预测数为13+5+12=30,但有2个本应为versicolor却误判为virginica,所以准确率为30/32=93%

preds=iris.target_names[clf.predict([[5.0,3.8,1.4,2.0],[5.0,3.6,1.6,2.3]])]
preds

现在就可以随意预测啦!!!!