pos机网站模板,wordpress 微信授权,自己的网站在哪做的忘了,施工合同电子版在本次分析中#xff0c;我使用了随机森林回归#xff0c;并涉及数据标准化和超参数调优。在这里#xff0c;我使用随机森林分类器#xff0c;对好酒和不太好的酒进行二元分类。首先导入数据包#xff1a;importnumpy as npimportpandas as pdimportmatplotlib.pyplot as …在本次分析中我使用了随机森林回归并涉及数据标准化和超参数调优。在这里我使用随机森林分类器对好酒和不太好的酒进行二元分类。首先导入数据包importnumpy as npimportpandas as pdimportmatplotlib.pyplot as pltimport seaborn as sns导入数据data pd.read_csv(‘winequality-red.csv‘)data.head()data.describe()注释fixed acidity非挥发性酸volatile acidity 挥发性酸citric acid柠檬酸residual sugar 剩余糖分chlorides氯化物free sulfur dioxide 游离二氧化硫total sulfur dioxide总二氧化硫density密度pHpHsulphates硫酸盐alcohol酒精quality质量所有数据的数值为1599所以没有缺失值。让我们看看是否有重复值extra data[data.duplicated()]extra.shape有240个重复值但先不删除它因为葡萄酒的质量等级是由不同的品酒师给出的。数据可视化sns.set()data.hist(figsize(10,10), color‘red‘)plt.show()只有质量是离散型变量主要集中在5和6中下面分析下变量的相关性colormap plt.cm.viridisplt.figure(figsize(12,12))plt.title(‘Correlation of Features‘, y1.05, size15)sns.heatmap(data.astype(float).corr(),linewidths0.1,vmax1.0, squareTrue,linecolor‘white‘, annotTrue)观察:酒精与葡萄酒质量的相关性最高其次是各种酸度、硫酸盐、密度和氯化物。使用分类器将葡萄酒分成两组;“优质”5为“好酒”y data.quality #set ‘quality‘ as targetX data.drop(‘quality‘, axis1) #rest are featuresprint(y.shape, X.shape) #check correctness#Create a new y1y1 (y 5).astype(int)y1.head()# plot histogramax y1.plot.hist(color‘green‘)ax.set_title(‘Wine quality distribution‘, fontsize14)ax.set_xlabel(‘aggregated target value‘)利用随机森林分类器训练预测模型from sklearn.model_selection importtrain_test_split, cross_val_scorefrom sklearn.ensemble importRandomForestClassifierfrom sklearn.metrics importaccuracy_score, log_lossfrom sklearn.metrics import confusion_matrix将数据分割为训练和测试数据集seed 8 #set seed for reproducibilityX_train, X_test, y_train, y_test train_test_split(X, y1, test_size0.2,random_stateseed)print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)对随机森林分类器进行交叉验证训练和评价#Instantiate the Random Forest ClassifierRF_clf RandomForestClassifier(random_stateseed)RF_clf#在训练数据集上计算k-fold交叉验证并查看平均精度得分cv_scores cross_val_score(RF_clf,X_train, y_train, cv10, scoring‘accuracy‘)print(‘The accuracy scores for the iterations are {}‘.format(cv_scores))print(‘The mean accuracy score is {}‘.format(cv_scores.mean()))执行预测RF_clf.fit(X_train, y_train)pred_RF RF_clf.predict(X_test)#Print 5 results to seefor i in range(0,5):print(‘Actual wine quality is‘, y_test.iloc[i], ‘and predicted is‘, pred_RF[i])在前五名中有一个错误。让我们看看指标。print(accuracy_score(y_test, pred_LR))print(log_loss(y_test, pred_LR))print(confusion_matrix(y_test, pred_LR))总共有81个分类错误。与Logistic回归分类器相比随机森林分类器更优。让我们调优随机森林分类器的超参数from sklearn.model_selection importGridSearchCVgrid_values {‘n_estimators‘:[50,100,200],‘max_depth‘:[None,30,15,5],‘max_features‘:[‘auto‘,‘sqrt‘,‘log2‘],‘min_samples_leaf‘:[1,20,50,100]}grid_RF GridSearchCV(RF_clf,param_gridgrid_values,scoring‘accuracy‘)grid_RF.fit(X_train, y_train)grid_RF.best_params_除了估计数之外其他推荐值是默认值。RF_clf RandomForestClassifier(n_estimators100,random_stateseed)RF_clf.fit(X_train,y_train)pred_RFRF_clf.predict(X_test)print(accuracy_score(y_test,pred_RF))print(log_loss(y_test,pred_RF))print(confusion_matrix(y_test,pred_RF))通过超参数调谐射频分类器的准确度已提高到82.5%日志损失值也相应降低。分类错误的数量也减少到56个。将随机森林分类器作为基本推荐器将红酒分为“推荐”(6级以上)或“不推荐”(5级以下)预测准确率为82.5%似乎是合理的。