当前位置：首页 > news >正文

潍坊住房和城乡建设厅网站电话wordpress插件代码

news 2025/11/14 17:30:45

潍坊住房和城乡建设厅网站电话,wordpress插件代码,网站建设布局利于优化,知识库主题 wordpress文章目录1. 逻辑回归二分类2. 垃圾邮件过滤2.1 性能指标2.2 准确率2.3 精准率、召回率2.4 F1值2.5 ROC、AUC3. 网格搜索调参4. 多类别分类5. 多标签分类5.1 多标签分类性能指标本文为 scikit-learn机器学习#xff08;第2版#xff09;学习笔记逻辑回归常用于分类任务 1. 逻… 文章目录1. 逻辑回归二分类2. 垃圾邮件过滤2.1 性能指标2.2 准确率2.3 精准率、召回率2.4 F1值2.5 ROC、AUC3. 网格搜索调参4. 多类别分类5. 多标签分类5.1 多标签分类性能指标本文为 scikit-learn机器学习第2版学习笔记逻辑回归常用于分类任务 1. 逻辑回归二分类《统计学习方法》逻辑斯谛回归模型 Logistic RegressionLR 定义设 XXX 是连续随机变量 XXX 服从 logistic 分布是指 XXX 具有下列分布函数和密度函数 F(x)P(X≤x)11e−(x−μ)/γF(x) P(X \leq x) \frac{1}{1e^{{-(x-\mu)} / \gamma}}F(x)P(X≤x)1e−(x−μ)/γ1 f(x)F′(x)e−(x−μ)/γγ(1e−(x−μ)/γ)2f(x)F(x) \frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1e^{{-(x-\mu)}/\gamma})}^2}f(x)F′(x)γ(1e−(x−μ)/γ)2e−(x−μ)/γ 在逻辑回归中当预测概率阈值预测为正类否则预测为负类 2. 垃圾邮件过滤从信息中提取 TF-IDF 特征并使用逻辑回归进行分类 import pandas as pd data pd.read_csv(SMSSpamCollection, delimiter\t,headerNone) datadata[data[0]ham][0].count() # 4825 条正常信息 data[data[0]spam][0].count() # 747 条垃圾信息import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split, cross_val_scoreX data[1].values y data[0].values from sklearn.preprocessing import LabelBinarizer lb LabelBinarizer() y lb.fit_transform(y)X_train_raw, X_test_raw, y_train, y_test train_test_split(X, y, random_state520)vectorizer TfidfVectorizer() X_train vectorizer.fit_transform(X_train_raw) X_test vectorizer.transform(X_test_raw)classifier LogisticRegression() classifier.fit(X_train, y_train)pred classifier.predict(X_test) for i, pred_i in enumerate(pred[:5]):print(预测为%s, 信息为%s,真实为%s %(pred_i,X_test_raw[i],y_test[i]))预测为0, 信息为Aww thats the first time u said u missed me without asking if I missed u first. You DO love me! :),真实为[0] 预测为0, 信息为Poor girl cant go one day lmao,真实为[0] 预测为0, 信息为Also remember the beads dont come off. Ever.,真实为[0] 预测为0, 信息为I see the letter B on my car,真实为[0] 预测为0, 信息为My love ! How come it took you so long to leave for Zahers? I got your words on ym and was happy to see them but was sad you had left. I miss you,真实为[0]2.1 性能指标混淆矩阵 from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt confusion_matrix confusion_matrix(y_test, pred) plt.matshow(confusion_matrix) plt.rcParams[font.sans-serif] SimHei # 消除中文乱码 plt.title(混淆矩阵) plt.ylabel(真实) plt.xlabel(预测) plt.colorbar()2.2 准确率 scores cross_val_score(classifier, X_train, y_train, cv5) print(Accuracies: %s % scores) print(Mean accuracy: %s % np.mean(scores))Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623] Mean accuracy: 0.9569274847434318准确率不是一个很合适的性能指标它不能区分预测错误是正预测为负还是负预测为正 2.3 精准率、召回率可以参考 [Hands On ML] 3. 分类MNIST手写数字预测单独只看精准率或者召回率是没有意义的 from sklearn.metrics import precision_score, recall_score, f1_score precisions precision_score(y_test, pred) print(Precision: %s % precisions) recalls recall_score(y_test, pred) print(Recall: %s % recalls)Precision: 0.9852941176470589 预测为垃圾信息的基本上真的是垃圾信息Recall: 0.6979166666666666 有30%的垃圾信息预测为了非垃圾信息2.4 F1值 F1 值是以上精准率和召回率的均衡 f1s f1_score(y_test, pred) print(F1 score: %s % f1s) # F1 score: 0.81707317073170742.5 ROC、AUC 好的分类器AUC面积越接近1越好随机分类器AUC面积为0.5 from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_scorefalse_positive_rate, recall, thresholds roc_curve(y_test, pred) roc_auc_score roc_auc_score(y_test, pred)plt.title(受试者工作特性) plt.plot(false_positive_rate, recall, b, labelAUC %0.2f % roc_auc_score) plt.legend(loclower right) plt.plot([0, 1], [0, 1], r--) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.ylabel(Recall) plt.xlabel(Fall-out) plt.show()3. 网格搜索调参 import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn.metrics import precision_score, recall_score, accuracy_scorepipeline Pipeline([(vect, TfidfVectorizer(stop_wordsenglish)),(clf, LogisticRegression()) ]) parameters {vect__max_df: (0.25, 0.5, 0.75), # 模块name__参数namevect__stop_words: (english, None),vect__max_features: (2500, 5000, None),vect__ngram_range: ((1, 1), (1, 2)),vect__use_idf: (True, False),clf__penalty: (l1, l2),clf__C: (0.01, 0.1, 1, 10), }if __name__ __main__:df pd.read_csv(./SMSSpamCollection, delimiter\t, headerNone)X df[1].valuesy df[0].valueslabel_encoder LabelEncoder()y label_encoder.fit_transform(y)X_train, X_test, y_train, y_test train_test_split(X, y)grid_search GridSearchCV(pipeline, parameters, n_jobs-1, verbose1, scoringaccuracy, cv3)grid_search.fit(X_train, y_train)print(Best score: %0.3f % grid_search.best_score_)print(Best parameters set:)best_parameters grid_search.best_estimator_.get_params()for param_name in sorted(parameters.keys()):print(\t%s: %r % (param_name, best_parameters[param_name]))predictions grid_search.predict(X_test)print(Accuracy: %s % accuracy_score(y_test, predictions))print(Precision: %s % precision_score(y_test, predictions))print(Recall: %s % recall_score(y_test, predictions))Best score: 0.985 Best parameters set:clf__C: 10clf__penalty: l2vect__max_df: 0.5vect__max_features: 5000vect__ngram_range: (1, 2)vect__stop_words: Nonevect__use_idf: True Accuracy: 0.9791816223977028 Precision: 1.0 Recall: 0.8605769230769231调整参数后提高了召回率 4. 多类别分类电影情绪评价预测 data pd.read_csv(./chapter5_movie_train.csv,header0,delimiter\t) datadata[Sentiment].describe()count 156060.000000 mean 2.063578 std 0.893832 min 0.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 4.000000 Name: Sentiment, dtype: float64平均都是比较中立的情绪 data[Sentiment].value_counts()/data[Sentiment].count()2 0.509945 3 0.210989 1 0.174760 4 0.058990 0 0.045316 Name: Sentiment, dtype: float6450% 的例子都是中立的情绪 from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score, confusion_matrix from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCVdf pd.read_csv(./chapter5_movie_train.csv, header0, delimiter\t) X, y df[Phrase], df[Sentiment].values X_train, X_test, y_train, y_test train_test_split(X, y, train_size0.5)pipeline Pipeline([(vect, TfidfVectorizer(stop_wordsenglish)),(clf, LogisticRegression()) ]) parameters {vect__max_df: (0.25, 0.5),vect__ngram_range: ((1, 1), (1, 2)),vect__use_idf: (True, False),clf__C: (0.1, 1, 10), }grid_search GridSearchCV(pipeline, parameters, n_jobs-1, verbose1, scoringaccuracy) grid_search.fit(X_train, y_train)print(Best score: %0.3f % grid_search.best_score_) print(Best parameters set:) best_parameters grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()):print(\t%s: %r % (param_name, best_parameters[param_name]))Best score: 0.619 Best parameters set:clf__C: 10vect__max_df: 0.25vect__ngram_range: (1, 2)vect__use_idf: False性能指标 predictions grid_search.predict(X_test)print(Accuracy: %s % accuracy_score(y_test, predictions)) print(Confusion Matrix:) print(confusion_matrix(y_test, predictions)) print(Classification Report:) print(classification_report(y_test, predictions))Accuracy: 0.6292323465333846 Confusion Matrix: [[ 1013 1742 682 106 11][ 794 5914 6275 637 49][ 196 3207 32397 3686 222][ 28 488 6513 8131 1299][ 1 59 548 2388 1644]] Classification Report:precision recall f1-score support0 0.50 0.29 0.36 35541 0.52 0.43 0.47 136692 0.70 0.82 0.75 397083 0.54 0.49 0.52 164594 0.51 0.35 0.42 4640accuracy 0.63 78030macro avg 0.55 0.48 0.50 78030 weighted avg 0.61 0.63 0.62 780305. 多标签分类一个实例可以被贴上多个 labels 问题转换实例的标签(假设为L1,L2)转换成L1 and L2,以此类推缺点产生很多种类的标签且模型只能训练数据中包含的类很多可能无法覆盖到对每个标签训练一个二分类器这个实例是L1吗是L2吗缺点忽略了标签之间的关系 5.1 多标签分类性能指标汉明损失不正确标签的平均比例0最好杰卡德相似系数预测与真实标签的交集数量 / 并集数量1最好 from sklearn.metrics import hamming_loss, jaccard_score # help(jaccard_score)print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),averageNone))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),averageNone))print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),averageNone))0.0 0.25 0.5 [1. 1.] [0.5 1. ] [0. 1.]

查看全文

http://www.zqtcl.cn/news/354129/