当前位置：首页 > news >正文

深圳网站设计技术申请手机网站

news 2025/11/21 9:46:43

深圳网站设计技术,申请手机网站,个人做网站租云服务器,各网站特点目录 1. 无监督学习的类型2. 无监督学习的挑战3. 预处理与缩放3.1 不同类型的预处理3.2 应用数据变换3.3 对训练数据和测试数据进行相同的缩放快捷方式与高效的替代方法 3.4 预处理对监督学习的作用 4. 降维、特征提取与流形学习4.1 主成分分析#xff08;PCA#xff09;4.1.… 目录 1. 无监督学习的类型2. 无监督学习的挑战3. 预处理与缩放3.1 不同类型的预处理3.2 应用数据变换3.3 对训练数据和测试数据进行相同的缩放快捷方式与高效的替代方法 3.4 预处理对监督学习的作用 4. 降维、特征提取与流形学习4.1 主成分分析PCA4.1.1 将PCA应用于cancer数据集并可视化4.1.2 特征提取的特征脸 4.2 非负矩阵分解NMF4.2.1 将NMF应用于模拟数据4.2.2 将NMF应用于人脸图像 4.3 用t-SNE进行流形学习 5. 聚类5.1 k均值聚类5.1.1 k均值的失败案例5.1.2 矢量量化或者将k均值看作分解5.1.3 优点、缺点 5.2 凝聚聚类层次聚类与树状图 5.3 DBSCAN5.4 聚类算法的对比与评估5.4.1 用真实值评估聚类5.4.2 在没有真实值的情况下评估聚类5.4.3 在人脸数据集上比较算法用DBSCAN分析人脸数据集用k均值分析人脸数据集用凝聚聚类分析人脸数据集 5.5 聚类方法小结 1. 无监督学习的类型两种无监督学习数据集变换数据集的无监督变换创建数据新的表示的算法新的表示可能更容易被人或其他机器学习算法所理解常见应用降维接受包含许多特征的数据的高维表示找到表示该数据的一种新方法用较少的特征就可以概括其重要特征常见应用将数据降为二维为了可视化找到“构成”数据的各个组成部分常见应用对文本文档集合进行主题提取任务找到每个文档中讨论的未知主题学习每个文档中出现了哪些主题用于追踪社交媒体上的话题讨论聚类将数据划分成不同的组每个组包含相似的物项常见应用相册的智能分类提取所有的人脸将看起来相似的人脸分在一组 2. 无监督学习的挑战主要挑战评估算法是否学到了有用的东西无监督学习算法一般用于不包含任何标签信息的数据所以我们不知道正确的输出应该是什么我们没有办法“告诉”算法我们要的是什么通常来说评估无监督算法结果的唯一方法就是人工检查如果数据科学家想要更好地理解数据那么无监督算法通常可用于探索性的目的而不是作为大型自动化系统的一部分无监督算法的另一个常见应用是作为监督算法的预处理步骤可以提高监督算法的精度可以减少内存占用和时间开销 3. 预处理与缩放对于数据缩放敏感的算法可以对特征进行调节使数据表示更适合与这些算法对数据的简单的按特征的缩放和移动 3.1 不同类型的预处理 from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_scaling()plt.tight_layout() plt.show()左侧有两个特征的二分类数据第一个特征值10~15第二个特征值1~9 右侧4种数据变换方法 StandardScaler 确保每个特征的平均值为0方差为1使所有特征都位于同一量级不能确保特征任何特定的最大值和最小值 RobustScaler 确保每个特征的统计属性都位于同一范围中位数和四分位数忽略与其他点有很大不同的数据点异常值 MinMaxScaler 使所有特征都刚好位于0~1 Normalizer 对每个数据点进行缩放使特征向量的欧式长度等于1 将每个数据点投射到半径为1的圆球面上每个数据点的缩放比例都不相同 3.2 应用数据变换 from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScalercancer load_breast_cancer() X_train, X_test, y_train, y_test train_test_split(cancer.data, cancer.target, random_state1)scaler MinMaxScaler()scaler.fit(X_train)# 对训练数据进行变换 X_train_scaled scaler.transform(X_train)# 打印缩放之后数据集属性 print(per-feature minimum after scaling:\n {}.format(X_train_scaled.min(axis0))) # per-feature minimum after scaling: # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. # 0. 0. 0. 0. 0. 0.] print(per-feature maximum after scaling:\n {}.format(X_train_scaled.max(axis0))) # per-feature maximum after scaling: # [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. # 1. 1. 1. 1. 1. 1.]# 对测试数据进行变换 X_test_scaled scaler.transform(X_test)# 打印缩放之后数据集属性 print(per-feature minimum after scaling:\n {}.format(X_test_scaled.min(axis0))) # per-feature minimum after scaling: # [ 0.0336031 0.0226581 0.03144219 0.01141039 0.14128374 0.04406704 # 0. 0. 0.1540404 -0.00615249 -0.00137796 0.00594501 # 0.00430665 0.00079567 0.03919502 0.0112206 0. 0. # -0.03191387 0.00664013 0.02660975 0.05810235 0.02031974 0.00943767 # 0.1094235 0.02637792 0. 0. -0.00023764 -0.00182032]print(per-feature maximum after scaling:\n {}.format(X_test_scaled.max(axis0))) # per-feature maximum after scaling: # [0.9578778 0.81501522 0.95577362 0.89353128 0.81132075 1.21958701 # 0.87956888 0.9333996 0.93232323 1.0371347 0.42669616 0.49765736 # 0.44117231 0.28371044 0.48703131 0.73863671 0.76717172 0.62928585 # 1.33685792 0.39057253 0.89612238 0.79317697 0.84859804 0.74488793 # 0.9154725 1.13188961 1.07008547 0.92371134 1.20532319 1.63068851]由于scaler进行拟合使用的数据为X_train所以对于X_train所有的特征都在0~1而对于X_test则出现数据混乱 3.3 对训练数据和测试数据进行相同的缩放 import mglearn from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.datasets import make_blobsX, _ make_blobs(n_samples50, centers5, random_state4, cluster_std2) X_train, X_test train_test_split(X, random_state5, test_size.1)# 绘制训练集和测试集 fig, axes plt.subplots(1, 3, figsize(13, 4)) axes[0].scatter(X_train[:, 0], X_train[:, 1], cmglearn.cm2(0), labelTraining set, s60) axes[0].scatter(X_test[:, 0], X_test[:, 1], marker^, cmglearn.cm2(1), labelTest set, s60) axes[0].legend(locupper left) axes[0].set_title(Original Data)# 利用MinMaxScaler缩放数据 scaler MinMaxScaler() scaler.fit(X_train)X_train_scaled scaler.transform(X_train) X_test_scaled scaler.transform(X_test)# 将正确缩放的数据可视化 axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], cmglearn.cm2(0), labelTraining set, s60) axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker^, cmglearn.cm2(1), labelTest set, s60) axes[1].set_title(Scaled Data)# 单独对测试集进行缩放 test_scaler MinMaxScaler() test_scaler.fit(X_test)X_test_scaled_badly test_scaler.transform(X_test)# 将错误缩放的数据可视化 axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], cmglearn.cm2(0), labeltraining set, s60) axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker^, cmglearn.cm2(1), labeltest set, s60) axes[2].set_title(Improperly Scaled Data)for ax in axes:ax.set_xlabel(Feature 0)ax.set_ylabel(Feature 1)plt.tight_layout() plt.show()左图未缩放的二维数据集中图使用MinMaxScaler进行缩放右图训练集和测试集分解进行不同的缩放快捷方式与高效的替代方法 scaler.fit(X).transform(X) # 等效于 scaler.fit_transform(X)3.4 预处理对监督学习的作用 from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScalercancer load_breast_cancer() X_train, X_test, y_train, y_test train_test_split(cancer.data, cancer.target, random_state0)svm SVC(C100) svm.fit(X_train, y_train)print(test score: {:.3f}.format(svm.score(X_test, y_test))) # test score: 0.944# 使用0-1缩放进行预处理 scaler MinMaxScaler() scaler.fit(X_train) X_train_scaled scaler.transform(X_train) X_test_scaled scaler.transform(X_test)# 在缩放后的训练数据上学习SVM svm.fit(X_train_scaled, y_train)# 在缩放后的测试集上计算分数 print(test score: {:.3f}.format(svm.score(X_test_scaled, y_test))) # test score: 0.965# 利用零均值和单位方差的缩放方法进行预处理 scaler StandardScaler() scaler.fit(X_train) X_train_scaled scaler.transform(X_train) X_test_scaled scaler.transform(X_test)# 在缩放后的训练数据上学习SVM svm.fit(X_train_scaled, y_train)# 在缩放后的测试集上计算分数 print(test score: {:.3f}.format(svm.score(X_test_scaled, y_test))) # test score: 0.9584. 降维、特征提取与流形学习 4.1 主成分分析PCA 一种旋转数据集的方法旋转后的特征在统计上不相关转转后通常根据新特征对解释数据的重要性来选择它的一个子集 import matplotlib.pyplot as plt import mglearnmglearn.plots.plot_pca_illustration()plt.tight_layout() plt.show()左上图原始数据点算法查找方差最大的方向Component 1 数据中包含最多信息的方向算法找到与第一个方向正交且包含最多信息的方向利用此方法找到的方向称为主成分数据方差的主要方向主成分的个数与原始特征相同右上图旋转原始数据使第一主成分与x轴平行且第二主成分与y轴平行旋转之前数据减去平均值使变换后的数据以0为中心左下图仅保留第一个主成分将二维数据降为一维数据右下图反向旋转并将平均值重新加到数据中去除数据中的噪声影响将主成分中保留的那部分信息可视化 4.1.1 将PCA应用于cancer数据集并可视化对每个特征分别计算两个类别的直方图 import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_splitcancer load_breast_cancer() X_train, X_test, y_train, y_test train_test_split(cancer.data, cancer.target, random_state0)fig, axes plt.subplots(15, 2, figsize(10, 20)) malignant cancer.data[cancer.target 0] benign cancer.data[cancer.target 1]ax axes.ravel()for i in range(30):_, bins np.histogram(cancer.data[:, i], bins50)ax[i].hist(malignant[:, i], binsbins, colormglearn.cm3(0), alpha.5)ax[i].hist(benign[:, i], binsbins, colormglearn.cm3(2), alpha.5)ax[i].set_title(cancer.feature_names[i])ax[i].set_yticks(())ax[0].set_xlabel(Feature magnitude) ax[0].set_ylabel(Frequency) ax[0].legend([malignant, benign], locbest)fig.tight_layout() fig.show()利用PCA获取到主要的相互作用利用StandardScaler缩放数据 from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancercancer load_breast_cancer()scaler StandardScaler() scaler.fit(cancer.data) X_scaled scaler.transform(cancer.data)学习并应用PCA 默认情况下PCA仅旋转移动数据并保留所有主成分 from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from sklearn.decomposition import PCAcancer load_breast_cancer()scaler StandardScaler() scaler.fit(cancer.data) X_scaled scaler.transform(cancer.data)# 保留数据的前两个主成分 pca PCA(n_components2) # n_components: 保留的主成分个数# 对乳腺癌数据拟合PCA模型 pca.fit(X_scaled)# 将数据变换到前两个主成分的方向上 X_pca pca.transform(X_scaled)print(Original shape: {}.format(str(X_scaled.shape))) # Original shape: (569, 30)print(Reduced shape: {}.format(str(X_pca.shape))) # Reduced shape: (569, 2)对前两个主成分作图 import mglearn from matplotlib import pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from sklearn.decomposition import PCAcancer load_breast_cancer()scaler StandardScaler() scaler.fit(cancer.data) X_scaled scaler.transform(cancer.data)pca PCA(n_components2) pca.fit(X_scaled)X_pca pca.transform(X_scaled)plt.figure(figsize(8, 8)) mglearn.discrete_scatter(X_pca[:, 0], X_pca[:, 1], cancer.target)plt.legend(cancer.target_names, locbest) plt.gca().set_aspect(equal) plt.xlabel(First principal component) plt.ylabel(Second principal component)plt.tight_layout() plt.show()PCA的缺点不容易对图中的两个轴做出解释主成分在PCA对象的components_属性中用热图将系数可视化 from matplotlib import pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer from sklearn.decomposition import PCAcancer load_breast_cancer()scaler StandardScaler() scaler.fit(cancer.data) X_scaled scaler.transform(cancer.data)pca PCA(n_components2) pca.fit(X_scaled)X_pca pca.transform(X_scaled)plt.matshow(pca.components_, cmapviridis) plt.yticks([0, 1], [First component, Second component]) plt.colorbar() plt.xticks(range(len(cancer.feature_names)), cancer.feature_names, rotation60, haleft)plt.xlabel(Feature) plt.ylabel(Principal components) plt.tight_layout() plt.show()4.1.2 特征提取的特征脸思想找到一种表示比给定的原始更适合于分析应用实例图像图像由像素构成通常存储为RGB强度 from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7)image_shape people.images[0].shapefix, axes plt.subplots(2, 5, figsize(15, 8), subplot_kw{xticks: (), yticks: ()})for target, image, ax in zip(people.target, people.images, axes.ravel()):ax.imshow(image)ax.set_title(people.target_names[target])print(people.images.shape: {}.format(people.images.shape)) # people.images.shape: (3023, 87, 65) # 3023张图像 # 87像素*65像素print(Number of classes: {}.format(len(people.target_names))) # Number of classes: 62 # 62个人plt.tight_layout() plt.show()数据集有些偏斜参与分类的两个类别或多个类别样本数量差异很大 import numpy as np from sklearn.datasets import fetch_lfw_people import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7)# 计算每个目标出现的次数 counts np.bincount(people.target)# 将次数与目标名称一起打印出来 for i, (count, name) in enumerate(zip(counts, people.target_names)):print({0:25} {1:3}.format(name, count), end )if (i 1) % 3 0:print() # Alejandro Toledo 39 Alvaro Uribe 35 Amelie Mauresmo 21 # Andre Agassi 36 Angelina Jolie 20 Ariel Sharon 77 # Arnold Schwarzenegger 42 Atal Bihari Vajpayee 24 Bill Clinton 29 # Carlos Menem 21 Colin Powell 236 David Beckham 31 # Donald Rumsfeld 121 George Robertson 22 George W Bush 530 # Gerhard Schroeder 109 Gloria Macapagal Arroyo 44 Gray Davis 26 # Guillermo Coria 30 Hamid Karzai 22 Hans Blix 39 # Hugo Chavez 71 Igor Ivanov 20 Jack Straw 28 # Jacques Chirac 52 Jean Chretien 55 Jennifer Aniston 21 # Jennifer Capriati 42 Jennifer Lopez 21 Jeremy Greenstock 24 # Jiang Zemin 20 John Ashcroft 53 John Negroponte 31 # Jose Maria Aznar 23 Juan Carlos Ferrero 28 Junichiro Koizumi 60 # Kofi Annan 32 Laura Bush 41 Lindsay Davenport 22 # Lleyton Hewitt 41 Luiz Inacio Lula da Silva 48 Mahmoud Abbas 29 # Megawati Sukarnoputri 33 Michael Bloomberg 20 Naomi Watts 22 # Nestor Kirchner 37 Paul Bremer 20 Pete Sampras 22 # Recep Tayyip Erdogan 30 Ricardo Lagos 27 Roh Moo-hyun 32 # Rudolph Giuliani 26 Saddam Hussein 23 Serena Williams 52 # Silvio Berlusconi 33 Tiger Woods 23 Tom Daschle 25 # Tom Ridge 33 Tony Blair 144 Vicente Fox 32 # Vladimir Putin 49 Winona Ryder 24 降低数据偏斜每个人最多取50张图像 import numpy as np from sklearn.datasets import fetch_lfw_people import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]# 将灰度值缩放到0到1之间而不是在0到255之间 # 以得到更好的数据稳定性 X_people X_people / 255.使用单一最近邻分类器1-nn 寻找与要分类的人脸最为相似的人脸 import numpy as np from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]X_people X_people / 255.# 将数据分为训练集和测试集 X_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)# 使用一个邻居构建KNeighborsClassifier knn KNeighborsClassifier(n_neighbors1) knn.fit(X_train, y_train)print(test score: {:.3f}.format(knn.score(X_test, y_test))) # test score: 0.215使用PCA 启动白化选项将主成分缩放到相同的尺度结果与StandardScaler相同 from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_pca_whitening()plt.tight_layout() plt.show()import numpy as np from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]X_people X_people / 255.X_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)pca PCA(n_components100, whitenTrue, random_state0).fit(X_train) # 提取前100个主成分并进行拟合X_train_pca pca.transform(X_train) X_test_pca pca.transform(X_test)print(X_train_pca.shape: {}.format(X_train_pca.shape)) # X_train_pca.shape: (1547, 100)knn KNeighborsClassifier(n_neighbors1) knn.fit(X_train_pca, y_train)print(test score: {:.3f}.format(knn.score(X_test_pca, y_test))) # test score: 0.297主成分可视化 import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]X_people X_people / 255.X_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)pca PCA(n_components100, random_state0).fit(X_train)image_shape people.images[0].shapefix, axes plt.subplots(3, 5, figsize(15, 12), subplot_kw{xticks: (), yticks: ()})for i, (component, ax) in enumerate(zip(pca.components_, axes.ravel())):ax.imshow(component.reshape(image_shape), cmapviridis)ax.set_title({}. component.format((i 1)))plt.tight_layout() plt.show()尝试找到一些数字PCA旋转后的新特征值使我们可以将测试点表示为主成分的加权求和 x 0 x_0 x0、 x 1 x_1 x1等数据点的主成分系数对人脸数据进行变换将数据降维到只包含一些主成分然后反向旋转回到原始空间回到原始特征空间的方法inverse_transform import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]X_people X_people / 255.X_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)image_shape people.images[0].shapemglearn.plots.plot_pca_faces(X_train, X_test, image_shape)plt.tight_layout() plt.show()利用PCA的前两个主成分将数据集中的所有人脸在散点图中可视化 import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.model_selection import train_test_split import mglearn import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]X_people X_people / 255.X_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)pca PCA(n_components100, whitenTrue, random_state0).fit(X_train)X_train_pca pca.transform(X_train) X_test_pca pca.transform(X_test)mglearn.discrete_scatter(X_train_pca[:, 0], X_train_pca[:, 1], y_train)plt.xlabel(First principal component) plt.ylabel(Second principal component)plt.tight_layout() plt.show()4.2 非负矩阵分解NMF 提取有用的特征将每个数据点写成一些分量的加权求和希望分量和系数都大于或等于0只能应用于每个特征都是非负的数据对由多个独立源相加创建而成的数据特别有用多人说话的音轨多种乐器的音乐 4.2.1 将NMF应用于模拟数据 from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_nmf_illustration()plt.tight_layout() plt.show()左图所有数据点都可以写成这两个分量的正数组合右图指向平均值的分量NMF使用随机初始化根据随机种子的不同可能产生不同的结果 4.2.2 将NMF应用于人脸图像 NMF的主要参数想要提取的分量个数要小于输入特征的个数分量个数对NMF重建数据的影响 from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_nmf_illustration()plt.tight_layout() plt.show()比PCA稍差提取一部分分量并观察数据 import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.decomposition import NMF import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]X_people X_people / 255.X_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)image_shape people.images[0].shapenmf NMF(n_components15, random_state0) nmf.fit(X_train)X_train_nmf nmf.transform(X_train) X_test_nmf nmf.transform(X_test)fix, axes plt.subplots(3, 5, figsize(15, 12), subplot_kw{xticks: (), yticks: ()}) for i, (component, ax) in enumerate(zip(nmf.components_, axes.ravel())):ax.imshow(component.reshape(image_shape))ax.set_title({}. component.format(i))plt.tight_layout() plt.show()绘制分量4和7的图像 import numpy as np from matplotlib import pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.decomposition import NMF import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask]X_people X_people / 255.X_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)image_shape people.images[0].shapenmf NMF(n_components15, random_state0) nmf.fit(X_train)X_train_nmf nmf.transform(X_train)compn 4 # 按第4个分量排序绘制前10张图像 inds np.argsort(X_train_nmf[:, compn])[::-1] fig, axes plt.subplots(2, 5, figsize(15, 8), subplot_kw{xticks: (), yticks: ()}) for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):ax.imshow(X_train[ind].reshape(image_shape))plt.tight_layout() plt.show()compn 7 # 按第7个分量排序绘制前10张图像 inds np.argsort(X_train_nmf[:, compn])[::-1] fig, axes plt.subplots(2, 5, figsize(15, 8), subplot_kw{xticks: (), yticks: ()}) for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):ax.imshow(X_train[ind].reshape(image_shape))plt.tight_layout() plt.show()对信号进行处理 import mglearn from matplotlib import pyplot as pltS mglearn.datasets.make_signals()plt.figure(figsize(6, 1)) plt.plot(S, -) plt.xlabel(Time) plt.ylabel(Signal)plt.tight_layout() plt.show()将混合信号分解为原始信号 import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.decomposition import NMF, PCAS mglearn.datasets.make_signals()# 将数据混合成100维的状态 A np.random.RandomState(0).uniform(size(100, 3)) X np.dot(S, A.T)# 使用NMF还原信号 nmf NMF(n_components3, random_state42) S_ nmf.fit_transform(X)# 使用PCA还原信号 pca PCA(n_components3) H pca.fit_transform(X)models [X, S, S_, H] names [Observations (first three measurements),True sources,NMF recovered signals,PCA recovered signals] fig, axes plt.subplots(4, figsize(8, 4), gridspec_kw{hspace: .5}, subplot_kw{xticks: (), yticks: ()})for model, name, ax in zip(models, names, axes):ax.set_title(name)ax.plot(model[:, :3], -)plt.tight_layout() plt.show()4.3 用t-SNE进行流形学习流形学习算法用于可视化的算法允许进行复杂的映射可以给出较好的可视化算法计算训练数据的一种新表示但不允许变换新数据只能变换用于测试的数据 t-SNE 思想找到数据的一个二维表示尽可能地保持数据点之间的距离步骤给出每个数据点的随机二维表示尝试让在原始特征空间中距离较近的点更加靠近原始特征空间中距离较远的更加远离重点关注距离较近的点试图保存那些表示哪些点比较靠近的信息仅根据原始空间中数据点之间的靠近程度就能将各个类别明确分开加载手写数字数据集 from matplotlib import pyplot as plt from sklearn.datasets import load_digitsdigits load_digits()fig, axes plt.subplots(2, 5, figsize(10, 5), subplot_kw{xticks: (), yticks: ()}) for ax, img in zip(axes.ravel(), digits.images):ax.imshow(img)plt.tight_layout() plt.show()用PCA将降到二维的数据可视化对前两个主成分作图并按类别对数据点着色 from matplotlib import pyplot as plt from sklearn.datasets import load_digits from sklearn.decomposition import PCAdigits load_digits()# 构建一个PCA模型 pca PCA(n_components2) pca.fit(digits.data)# 将digits数据变换到前两个主成分的方向上 digits_pca pca.transform(digits.data) colors [#476A2A, #7851B8, #BD3430, #4A2D4E, #875525,#A83683, #4E655E, #853541, #3A3120, #535D8E]plt.figure(figsize(10, 10)) plt.xlim(digits_pca[:, 0].min(), digits_pca[:, 0].max()) plt.ylim(digits_pca[:, 1].min(), digits_pca[:, 1].max())for i in range(len(digits.data)):# 将数据实际绘制成文本而不是散点plt.text(digits_pca[i, 0], digits_pca[i, 1], str(digits.target[i]),colorcolors[digits.target[i]],fontdict{weight: bold, size: 9})plt.xlabel(First principal component) plt.ylabel(Second principal component)plt.tight_layout() plt.show()0、4、6相对较好地分开将t-SNE应用于数据集 TSNE类没有transform方法调用fit_transform代替构建模型并立刻返回变换后的数据 from matplotlib import pyplot as plt from sklearn.datasets import load_digits from sklearn.manifold import TSNEdigits load_digits()tsne TSNE(random_state42)# 使用fit_transform而不是fit因为TSNE没有transform方法 digits_tsne tsne.fit_transform(digits.data) colors [#476A2A, #7851B8, #BD3430, #4A2D4E, #875525,#A83683, #4E655E, #853541, #3A3120, #535D8E]plt.figure(figsize(10, 10)) plt.xlim(digits_tsne[:, 0].min(), digits_tsne[:, 0].max() 1) plt.ylim(digits_tsne[:, 1].min(), digits_tsne[:, 1].max() 1)for i in range(len(digits.data)):# 将数据实际绘制成文本而不是散点plt.text(digits_tsne[i, 0], digits_tsne[i, 1], str(digits.target[i]),colorcolors[digits.target[i]],fontdict{weight: bold, size: 9})plt.xlabel(t-SNE feature 0) plt.xlabel(t-SNE feature 1)plt.tight_layout() plt.show()大多数类别都形成一个密集的组 5. 聚类将数据集划分成组簇的任务目标划分数据使得一个簇内的数据点非常相似且不同簇内的数据点非常不同算法为每个数据点分配或预测一个数字表示这个点属于哪个簇 5.1 k均值聚类试图找到代表数据特定区域的簇中心步骤将每个数据点分配给最近的簇中心将每个簇中心设置为所分配的所有数据点的平均值重复执行以上两个步骤直到簇的分配不再发生变化算法说明 from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_kmeans_algorithm()plt.tight_layout() plt.show()三角形簇中心圆形数据点颜色簇成员寻找3个簇声明3个随机数据点为簇中心来将算法初始化运行迭代算法每个数据点被分配给距离最近的簇中心将簇中心修改为所分配点的平均值重复2次第三次迭代后为簇中心分配的数据点保持不变算法结束簇中心的边界 from matplotlib import pyplot as plt import mglearnmglearn.plots.plot_kmeans_boundaries()plt.tight_layout() plt.show()使用k均值 from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeans# 生成模拟的二维数据 X, y make_blobs(random_state1)# 构建聚类模型 kmeans KMeans(n_clusters3) # n_clusters: 簇的个数(默认为8)kmeans.fit(X)# 打印每个点的簇标签 print(Cluster memberships:\n{}.format(kmeans.labels_)) # Cluster memberships: # [0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2 # 2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0 # 0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]# predict方法也可以为新数据点分配簇标签 print(kmeans.predict(X)) # [0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2 # 2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0 # 0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markerso) mglearn.discrete_scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],markers^, markeredgewidth2)plt.tight_layout() plt.show()每个元素都有一个标签不存在真实的标签标签本身没有先验意义绘制图像 from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeansX, y make_blobs(random_state1)kmeans KMeans(n_clusters3) kmeans.fit(X)mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markerso) mglearn.discrete_scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],markers^, markeredgewidth2)plt.tight_layout() plt.show()使用更多或更少的簇中心 from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeansX, y make_blobs(random_state1)fig, axes plt.subplots(1, 2, figsize(10, 5))# 使用2个簇中心 kmeans KMeans(n_clusters2) kmeans.fit(X) assignments kmeans.labels_mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, axaxes[0])# 使用5个簇中心 kmeans KMeans(n_clusters5) kmeans.fit(X) assignments kmeans.labels_mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, axaxes[1])plt.tight_layout() plt.show()5.1.1 k均值的失败案例每个簇仅由其中心定义每个簇都是凸形 k均值只能找到相对简单的形状 k均值假设所有簇在某种程度上都具有相同的直径总是将簇之间的边界刚好画在簇中心的中间位置 from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeansX, y make_blobs(random_state1)X_varied, y_varied make_blobs(n_samples200, cluster_std[1.0, 2.5, 0.5], random_state170)y_pred KMeans(n_clusters3, random_state0).fit_predict(X_varied)mglearn.discrete_scatter(X_varied[:, 0], X_varied[:, 1], y_pred)plt.legend([cluster 0, cluster 1, cluster 2], locbest) plt.xlabel(Feature 0) plt.ylabel(Feature 1)plt.tight_layout() plt.show()簇0和簇1都包含一些原理簇中其他点的点 k均值假设所有方向对每个簇都同等重要 import numpy as np from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import KMeans# 生成一些随机分组数据 X, y make_blobs(random_state170, n_samples600) rng np.random.RandomState(74)# 变换数据使其拉长 transformation rng.normal(size(2, 2)) X np.dot(X, transformation)# 将数据聚类成3个簇 kmeans KMeans(n_clusters3) kmeans.fit(X) y_pred kmeans.predict(X)# 画出簇分配和簇中心 plt.scatter(X[:, 0], X[:, 1], cy_pred, cmapmglearn.cm3) plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],marker^, c[0, 1, 2], s100, linewidth2) plt.xlabel(Feature 0) plt.ylabel(Feature 1)plt.tight_layout() plt.show()簇的形状很复杂 from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_moons from sklearn.cluster import KMeans# 生成模拟的two moons数据这次的噪声较小 X, y make_moons(n_samples200, noise0.05, random_state0)# 将数据聚类成2个簇 kmeans KMeans(n_clusters2) kmeans.fit(X) y_pred kmeans.predict(X)# 画出簇分配和簇中心 plt.scatter(X[:, 0], X[:, 1], cy_pred, cmapmglearn.cm2, s60) plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],marker^, c[mglearn.cm2(0), mglearn.cm2(1)], s100, linewidth2) plt.xlabel(Feature 0) plt.ylabel(Feature 1)plt.tight_layout() plt.show()5.1.2 矢量量化或者将k均值看作分解矢量量化k均值是一种分解方法其中每个点用单一分量来表示并排比较PCA、NMF和k均值分别显示提取的分量以及利用100个分量对测试集中人脸的重建 import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.model_selection import train_test_split from sklearn.decomposition import NMF, PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapeX_train, X_test, y_train, y_test train_test_split(X_people, y_people, stratifyy_people, random_state0)nmf NMF(n_components100, random_state0) nmf.fit(X_train) pca PCA(n_components100, random_state0) pca.fit(X_train) kmeans KMeans(n_clusters100, random_state0) kmeans.fit(X_train)X_reconstructed_pca pca.inverse_transform(pca.transform(X_test)) X_reconstructed_kmeans kmeans.cluster_centers_[kmeans.predict(X_test)] X_reconstructed_nmf np.dot(nmf.transform(X_test), nmf.components_)fig, axes plt.subplots(3, 5, figsize(8, 8), subplot_kw{xticks: (), yticks: ()})fig.suptitle(Extracted Components) for ax, comp_kmeans, comp_pca, comp_nmf in zip(axes.T, kmeans.cluster_centers_, pca.components_, nmf.components_):ax[0].imshow(comp_kmeans.reshape(image_shape))ax[1].imshow(comp_pca.reshape(image_shape), cmapviridis)ax[2].imshow(comp_nmf.reshape(image_shape))axes[0, 0].set_ylabel(kmeans) axes[1, 0].set_ylabel(pca) axes[2, 0].set_ylabel(nmf)plt.tight_layout()fig, axes plt.subplots(4, 5, subplot_kw{xticks: (), yticks: ()},figsize(8, 8))fig.suptitle(Reconstructions) for ax, orig, rec_kmeans, rec_pca, rec_nmf in zip(axes.T, X_test, X_reconstructed_kmeans, X_reconstructed_pca, X_reconstructed_nmf):ax[0].imshow(orig.reshape(image_shape))ax[1].imshow(rec_kmeans.reshape(image_shape))ax[2].imshow(rec_pca.reshape(image_shape))ax[3].imshow(rec_nmf.reshape(image_shape))axes[0, 0].set_ylabel(original) axes[1, 0].set_ylabel(kmeans) axes[2, 0].set_ylabel(pca) axes[3, 0].set_ylabel(nmf)plt.tight_layout() plt.show()用比输入维度更多的簇来对数据进行编码 from matplotlib import pyplot as plt from sklearn.datasets import make_moons from sklearn.cluster import KMeansX, y make_moons(n_samples200, noise0.05, random_state0)kmeans KMeans(n_clusters10, random_state0) kmeans.fit(X) y_pred kmeans.predict(X)plt.scatter(X[:, 0], X[:, 1], cy_pred, s60, cmapPaired) plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],marker^, s60, crange(kmeans.n_clusters),linewidth2, cmapPaired) plt.xlabel(Feature 0) plt.ylabel(Feature 1)print(Cluster memberships:\n{}.format(kmeans.labels_)) # Cluster memberships: # [9 2 5 4 2 7 9 6 9 6 1 0 2 6 1 9 3 0 3 1 7 6 8 6 8 5 2 7 5 8 9 8 6 5 3 7 0 # 9 4 5 0 1 3 5 2 8 9 1 5 6 1 0 7 4 6 3 3 6 3 8 0 4 2 9 6 4 8 2 8 4 0 4 0 5 # 6 4 5 9 3 0 7 8 0 7 5 8 9 8 0 7 3 9 7 1 7 2 2 0 4 5 6 7 8 9 4 5 4 1 2 3 1 # 8 8 4 9 2 3 7 0 9 9 1 5 8 5 1 9 5 6 7 9 1 4 0 6 2 6 4 7 9 5 5 3 8 1 9 5 6 # 3 5 0 2 9 3 0 8 6 0 3 3 5 6 3 2 0 2 3 0 2 6 3 4 4 1 5 6 7 1 1 3 2 4 7 2 7 # 3 8 6 4 1 4 3 9 9 5 1 7 5 8 2]plt.tight_layout() plt.show()将到每个簇中心的距离作为特征可以得到一种表现力很强的数据表示使用transform方法 from sklearn.datasets import make_moons from sklearn.cluster import KMeansX, y make_moons(n_samples200, noise0.05, random_state0)kmeans KMeans(n_clusters10, random_state0) kmeans.fit(X)distance_featureskmeans.transform(X) print(Distance feature shape: {}.format(distance_features.shape)) # Distance feature shape: (200, 10)print(Distance features:\n{}.format(distance_features)) # Distance features: # [[0.9220768 1.46553151 1.13956805 ... 1.16559918 1.03852189 0.23340263] # [1.14159679 2.51721597 0.1199124 ... 0.70700803 2.20414144 0.98271691] # [0.78786246 0.77354687 1.74914157 ... 1.97061341 0.71561277 0.94399739] # ... # [0.44639122 1.10631579 1.48991975 ... 1.79125448 1.03195812 0.81205971] # [1.38951924 0.79790385 1.98056306 ... 1.97788956 0.23892095 1.05774337] # [1.14920754 2.4536383 0.04506731 ... 0.57163262 2.11331394 0.88166689]]5.1.3 优点、缺点优点非常流行的聚类算法相对容易理解和实现运行速度相对较快可以轻松扩展到大型数据集缺点依赖于随机初始化算法的输出依赖于随机种子默认情况下scikit-learn用10种不同的随机初始化将算法运行10次并返回最佳结果簇的方差之和最小对簇形状的假设的约束性较强要求指定所要寻找的簇的个数在现实世界的应用中可能并不知道这个数字 5.2 凝聚聚类许多基于相同原则构建的聚类算法原则算法首先声明每个点是自己的簇然后合并两个最相似的簇直到满足某种停止准则为止准则 scikit-learn簇的个数链接准则规定如何度量最相似的簇定义在两个现有的簇之间scikit-learn中实现的三种选项 ward 默认选项挑选两个簇进行合并使得所有簇中的方差增加最小会得到大小差不多相等的簇用于大多数数据集 average 将簇中所有点之间平均距离最小的两个簇合并 complete 将簇中点之间最大距离最小的两个簇合并二维数据集上的凝聚聚类过程寻找3个簇 import matplotlib.pyplot as plt import mglearnmglearn.plots.plot_agglomerative_algorithm()plt.tight_layout() plt.show()凝聚聚类对简单三簇数据的效果 from matplotlib import pyplot as plt import mglearn from sklearn.datasets import make_blobs from sklearn.cluster import AgglomerativeClusteringX, y make_blobs(random_state1) agg AgglomerativeClustering(n_clusters3)assignment agg.fit_predict(X) mglearn.discrete_scatter(X[:, 0], X[:, 1], assignment)plt.xlabel(Feature 0) plt.ylabel(Feature 1)plt.tight_layout() plt.show()层次聚类与树状图同时查看所有可能的聚类 from matplotlib import pyplot as plt import mglearn mglearn.plots.plot_agglomerative()plt.tight_layout() plt.show()树状图可以处理多维数据 from matplotlib import pyplot as plt from sklearn.datasets import make_blobs from scipy.cluster.hierarchy import dendrogram, wardX, y make_blobs(random_state0, n_samples12)# 将ward聚类应用于数据数组X # SciPy的ward函数返回一个数组指定执行凝聚聚类时跨越的距离 linkage_array ward(X)# 现在为包含簇之间距离的linkage array绘制树状图 dendrogram(linkage_array)# 在树中标记划分成两个簇或三个簇的位置 ax plt.gca() bounds ax.get_xbound() ax.plot(bounds, [7.25, 7.25], --, ck) ax.plot(bounds, [4, 4], --, ck)ax.text(bounds[1], 7.25, two clusters, vacenter, fontdict{size: 15}) ax.text(bounds[1], 4, three clusters, vacenter, fontdict{size: 15})plt.xlabel(Sample index) plt.ylabel(Cluster distance)plt.tight_layout() plt.show()x轴数据点y轴聚类算法中簇的合并时间分支长度合并的簇之间的距离 5.3 DBSCAN 优点不需要用户先验地设置簇的个数可以划分具有复杂形状的簇可以找出不属于任何簇的点可以扩展到相对较大的数据集缺点运行速度较慢原理识别特征空间的“拥挤”区域中的点 “拥挤”区域密集区域区域中许多数据点靠近在一起密集区域中的点核心样本如果在距一个给定数据点eps的距离内至少有min_samples个数据点那么这个点就是核心样本DBSCAN将彼此距离小于eps的核心样本放到同一个簇中思想簇形成数据的密集区域并由相对较空的区域隔开步骤选取任意一个点找到到这个点的距离小于等于eps的所有点如果距起始点的距离在eps之内的数据点个数小于min_samples则这个点被标记为噪声这个点不属于任何簇如果距起始点的距离在eps之内的数据点个数大于min_samples则这个点被标记为核心样本并被分配一个新的簇标签访问该点的所有邻居在距离eps以内如果它们还没有被分配一个簇则将刚刚创建的新的簇标签分配给它们如果它们是核心样本那么依次访问其邻居簇逐渐增大直到在簇的eps距离内没有更多的核心样本为止选取另一个未被访问过的点重复以上步骤 eps和min_samples取不同值时的簇分类 import matplotlib.pyplot as plt import mglearnmglearn.plots.plot_dbscan()plt.tight_layout() plt.show() # min_samples: 2 eps: 1.000000 cluster: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1] # min_samples: 2 eps: 1.500000 cluster: [0 1 1 1 1 0 2 2 1 2 2 0] # min_samples: 2 eps: 2.000000 cluster: [0 1 1 1 1 0 0 0 1 0 0 0] # min_samples: 2 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0] # min_samples: 3 eps: 1.000000 cluster: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1] # min_samples: 3 eps: 1.500000 cluster: [0 1 1 1 1 0 2 2 1 2 2 0] # min_samples: 3 eps: 2.000000 cluster: [0 1 1 1 1 0 0 0 1 0 0 0] # min_samples: 3 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0] # min_samples: 5 eps: 1.000000 cluster: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] # min_samples: 5 eps: 1.500000 cluster: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1] # min_samples: 5 eps: 2.000000 cluster: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1] # min_samples: 5 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0]-1噪声实心属于簇的点空心噪声点较大的标记核心样本较小的标记边界点使用StandardScaler或MinMaxScaler对数据进行缩放后有时更容易找到eps的较好取值在two_moons数据集上运行DBSCAN的结果 import mglearn from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScalerX, y make_moons(n_samples200, noise0.05, random_state0)# 将数据缩放成平均值为0、方差为1 scaler StandardScaler() scaler.fit(X) X_scaled scaler.transform(X)dbscan DBSCAN() clusters dbscan.fit_predict(X_scaled)# 绘制簇分配 plt.scatter(X_scaled[:, 0], X_scaled[:, 1], cclusters, cmapmglearn.cm2, s60)plt.xlabel(Feature 0) plt.ylabel(Feature 1)plt.tight_layout() plt.show()5.4 聚类算法的对比与评估 5.4.1 用真实值评估聚类用于评估聚类算法相对于真实聚类结果的指标调整rand指数ARI 最佳值1不相关0 归一化互信息NMI 最佳值1不相关0 使用ARI比较k均值、凝聚聚类和DBSCAN算法 import numpy as np import mglearn from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScaler from sklearn.metrics.cluster import adjusted_rand_scoreX, y make_moons(n_samples200, noise0.05, random_state0)# 将数据缩放成平均值为0、方差为1 scaler StandardScaler() scaler.fit(X) X_scaled scaler.transform(X)fig, axes plt.subplots(1, 4, figsize(15, 3), subplot_kw{xticks: (), yticks: ()})# 列出要使用的算法 algorithms [KMeans(n_clusters2), AgglomerativeClustering(n_clusters2), DBSCAN()]# 创建一个随机的簇分配作为参考 random_state np.random.RandomState(seed0) random_clusters random_state.randint(low0, high2, sizelen(X))# 绘制随机分配 axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], crandom_clusters, cmapmglearn.cm3, s60) axes[0].set_title(Random assignment - ARI: {:.2f}.format(adjusted_rand_score(y, random_clusters)))for ax, algorithm in zip(axes[1:], algorithms):# 绘制簇分配和簇中心clusters algorithm.fit_predict(X_scaled)ax.scatter(X_scaled[:, 0], X_scaled[:, 1], cclusters, cmapmglearn.cm3, s60)ax.set_title({} - ARI: {:.2f}.format(algorithm.__class__.__name__, adjusted_rand_score(y, clusters)))plt.tight_layout() plt.show()评估聚类时不应该使用accuracy_score 精度评估分配的簇标签与真实值完全匹配但簇标签没有意义 from sklearn.metrics.cluster import adjusted_rand_score from sklearn.metrics import accuracy_score# 这两种点标签对应于相同的聚类 clusters1 [0, 0, 1, 1, 0] clusters2 [1, 1, 0, 0, 1]# 精度为0因为二者标签完全不同 print(Accuracy: {:.2f}.format(accuracy_score(clusters1, clusters2))) # Accuracy: 0.00# 调整rand分数为1因为二者聚类完全相同 print(ARI: {:.2f}.format(adjusted_rand_score(clusters1, clusters2))) # ARI: 1.005.4.2 在没有真实值的情况下评估聚类不需要真实值的聚类评分指标轮廓系数计算一个簇的紧致度越大越好最大值1不允许复杂的形状使用轮廓系数比较k均值、凝聚聚类和DBSCAN算法 import numpy as np import mglearn from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering from sklearn.datasets import make_moons from sklearn.preprocessing import StandardScaler from sklearn.metrics.cluster import silhouette_scoreX, y make_moons(n_samples200, noise0.05, random_state0)# 将数据缩放成平均值为0、方差为1 scaler StandardScaler() scaler.fit(X)X_scaled scaler.transform(X)fig, axes plt.subplots(1, 4, figsize(15, 3), subplot_kw{xticks: (), yticks: ()})# 列出要使用的算法 algorithms [KMeans(n_clusters2), AgglomerativeClustering(n_clusters2), DBSCAN()]# 创建一个随机的簇分配作为参考 random_state np.random.RandomState(seed0) random_clusters random_state.randint(low0, high2, sizelen(X))# 绘制随机分配 axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], crandom_clusters, cmapmglearn.cm3, s60) axes[0].set_title(Random assignment - ARI: {:.2f}.format(silhouette_score(X_scaled, random_clusters)))for ax, algorithm in zip(axes[1:], algorithms):# 绘制簇分配和簇中心clusters algorithm.fit_predict(X_scaled)ax.scatter(X_scaled[:, 0], X_scaled[:, 1], cclusters, cmapmglearn.cm3, s60)ax.set_title({} - ARI: {:.2f}.format(algorithm.__class__.__name__, silhouette_score(X_scaled, clusters)))plt.tight_layout() plt.show()较好的评估聚类策略使用基于鲁捧性聚类指标先向数据中添加一些噪声或使用不同的参数设定然后运行算法并对结果进行比较思想如果许多算法参数和许多数据扰动返回相同的结果那么它很可能是可信的 5.4.3 在人脸数据集上比较算法加载人脸数据使用数据的特征脸表示由100个成分的PCA(whitenTrue)生成 import numpy as np from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.# 从lfw数据中提取特征脸并对数据进行变换 pca PCA(n_components100, whitenTrue, random_state0) # 100个成分pca.fit_transform(X_people)X_pca pca.transform(X_people)用DBSCAN分析人脸数据集应用DBSCAN import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.pca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)# 应用默认参数的DBSCAN dbscan DBSCAN() labels dbscan.fit_predict(X_pca) print(Unique labels: {}.format(np.unique(labels))) # Unique labels: [-1]所有数据点都被标记为噪声改进的两种方式增大eps参数减小min_samples参数减小min_samples参数 import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.pca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)dbscan DBSCAN(min_samples3) labels dbscan.fit_predict(X_pca) print(Unique labels: {}.format(np.unique(labels))) # Unique labels: [-1]没有发生变化增大eps参数 import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.pca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)dbscan DBSCAN(min_samples3, eps15) labels dbscan.fit_predict(X_pca) print(Unique labels: {}.format(np.unique(labels))) # Unique labels: [-1 0]得到了单一簇和噪声点查看数据点的情况 import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.pca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)dbscan DBSCAN(min_samples3, eps15) labels dbscan.fit_predict(X_pca)# 计算所有簇中的点数和噪声中的点数 # bincount不允许负值所以我们需要加1 # 结果中的第一个数字对应于噪声点 print(Number of points per cluster: {}.format(np.bincount(labels 1))) # Number of points per cluster: [ 37 2026]查看所有的噪声点 import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)dbscan DBSCAN(min_samples3, eps15) labels dbscan.fit_predict(X_pca)noise X_people[labels -1]fig, axes plt.subplots(3, 9, subplot_kw{xticks: (), yticks: ()}, figsize(12, 4)) for image, ax in zip(noise, axes.ravel()):ax.imshow(image.reshape(image_shape))plt.tight_layout() plt.show()异常值检测尝试找出数据集中对不匹配的数据 eps不同取值对应的结果 import numpy as np from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.pca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)for eps in [1, 3, 5, 7, 9, 11, 13]:print(\neps{}.format(eps))dbscan DBSCAN(epseps, min_samples3)labels dbscan.fit_predict(X_pca)print(Clusters present: {}.format(np.unique(labels)))print(Cluster sizes: {}.format(np.bincount(labels 1))) # eps1 # Clusters present: [-1] # Cluster sizes: [2063] # # eps3 # Clusters present: [-1] # Cluster sizes: [2063] # # eps5 # Clusters present: [-1 0] # Cluster sizes: [2059 4] # # eps7 # Clusters present: [-1 0 1 2 3 4 5 6] # Cluster sizes: [1954 75 4 14 6 4 3 3] # # eps9 # Clusters present: [-1 0 1] # Cluster sizes: [1199 861 3] # # eps11 # Clusters present: [-1 0] # Cluster sizes: [ 403 1660] # # eps13 # Clusters present: [-1 0] # Cluster sizes: [ 119 1944]打印eps7时7个簇中的图像 import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import DBSCAN from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)dbscan DBSCAN(min_samples3, eps7) labels dbscan.fit_predict(X_pca)for cluster in range(max(labels) 1):mask labels clustern_images np.sum(mask)fig, axes plt.subplots(1, n_images, figsize(n_images * 1.5, 4), subplot_kw{xticks: (), yticks: ()})for image, label, ax in zip(X_people[mask], y_people[mask], axes):ax.imshow(image.reshape(image_shape))ax.set_title(people.target_names[label].split()[-1])plt.tight_layout()plt.show()用k均值分析人脸数据集提取簇 import numpy as np from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)# 用k均值提取簇 km KMeans(n_clusters10, random_state0) labels_km km.fit_predict(X_pca)print(Cluster sizes k-means: {}.format(np.bincount(labels_km))) # Cluster sizes k-means: [ 70 198 139 109 196 351 207 424 180 189]簇的大小相似可视化 import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)km KMeans(n_clusters10, random_state0) labels_km km.fit_predict(X_pca)fig, axes plt.subplots(2, 5, subplot_kw{xticks: (), yticks: ()}, figsize(12, 4)) for center, ax in zip(km.cluster_centers_, axes.ravel()):ax.imshow(pca.inverse_transform(center).reshape(image_shape))plt.tight_layout() plt.show()绘制每个簇中心最典型和最不典型各5个图像 import mglearn import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)km KMeans(n_clusters10, random_state0) km.fit_predict(X_pca)mglearn.plots.plot_kmeans_faces(km, pca, X_pca, X_people, y_people, people.target_names)plt.tight_layout() plt.show()用凝聚聚类分析人脸数据集提取簇 import numpy as np from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)# 用ward凝聚聚类提取簇 agglomerative AgglomerativeClustering(n_clusters10) labels_agg agglomerative.fit_predict(X_pca)print(Cluster sizes agglomerative clustering: {}.format(np.bincount(labels_agg))) # Cluster sizes agglomerative clustering: [264 100 275 553 49 64 546 52 51 109]计算ARI来度量凝聚聚类与k均值给出的两种数据划分是否相似 import numpy as np from sklearn.cluster import AgglomerativeClustering, KMeans from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA from sklearn.metrics import adjusted_rand_score import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)km KMeans(n_clusters10, random_state0) labels_km km.fit_predict(X_pca)agglomerative AgglomerativeClustering(n_clusters10) labels_agg agglomerative.fit_predict(X_pca)print(ARI: {:.3f}.format(adjusted_rand_score(labels_agg, labels_km))) # ARI: 0.088绘制树状图 import numpy as np from matplotlib import pyplot as plt from scipy.cluster.hierarchy import ward, dendrogram from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)agglomerative AgglomerativeClustering(n_clusters10) labels_agg agglomerative.fit_predict(X_pca)linkage_array ward(X_pca)# 现在我们为包含簇之间距离的linkage array绘制树状图 plt.figure(figsize(20, 5)) dendrogram(linkage_array, p7, truncate_modelevel, no_labelsTrue)plt.xlabel(Sample index) plt.ylabel(Cluster distance)plt.tight_layout() plt.show()将10个簇可视化 import numpy as np from matplotlib import pyplot as plt from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import PCA import sslssl._create_default_https_context ssl._create_unverified_contextpeople fetch_lfw_people(min_faces_per_person20, resize0.7) mask np.zeros(people.target.shape, dtypenp.bool_)for target in np.unique(people.target):mask[np.where(people.target target)[0][:50]] 1X_people people.data[mask] y_people people.target[mask] X_people X_people / 255.image_shape people.images[0].shapepca PCA(n_components100, whitenTrue, random_state0) pca.fit_transform(X_people)X_pca pca.transform(X_people)agglomerative AgglomerativeClustering(n_clusters10) labels_agg agglomerative.fit_predict(X_pca)n_clusters 10 for cluster in range(n_clusters):mask labels_agg clusterfig, axes plt.subplots(1, 10, subplot_kw{xticks: (), yticks: ()}, figsize(15, 8))axes[0].set_ylabel(np.sum(mask))for image, label, asdf, ax in zip(X_people[mask], y_people[mask], labels_agg[mask], axes):ax.imshow(image.reshape(image_shape))ax.set_title(people.target_names[label].split()[-1], fontdict{fontsize: 9})plt.tight_layout()plt.show()5.5 聚类方法小结聚类的应用与评估是一个非常定性的过程通常在数据分析的探索阶段很有帮助三种聚类算法 k均值允许指定想要的簇的数量可以用簇的平均值来表示簇可以被看作一种分解方法每个数据点都由其簇中心表示 DBSCAN 允许用eps参数定义接近程度从而间接影响簇的大小可以检测到没有分配任何簇的“噪声点”可以帮助自动判断簇的数量允许簇具有复杂的形状凝聚聚类允许指定想要的簇的数量可以提供数据的可能划分的整个层次结构可以通过树状图轻松查看三种算法都可以控制聚类的粒度三种方法都可以用于大型的现实世界数据集都相对容易理解也都可以聚类成多个簇

查看全文

http://www.zqtcl.cn/news/320564/