网站开发建设方案书,谷歌做新媒体运营的网站,聊城职业 网站建设与管理,做印刷品去哪个网站特征工程 有这么一句话在业界广泛流传#xff1a;数据和特征决定了机器学习的上限#xff0c;而模型和算法只是逼近这个上限而已。由此可见#xff0c;特征工程在机器学习中占有相当重要的地位。在实际应用当中#xff0c;可以说特征工程是机器学习成功的关键。
特征工程是… 特征工程 有这么一句话在业界广泛流传数据和特征决定了机器学习的上限而模型和算法只是逼近这个上限而已。由此可见特征工程在机器学习中占有相当重要的地位。在实际应用当中可以说特征工程是机器学习成功的关键。
特征工程是数据分析中最耗时间和精力的一部分工作它不像算法和模型那样是确定的步骤更多是工程上的经验和权衡。因此没有统一的方法。这里只是对一些常用的方法做一个总结。
特征工程包含了 Data PreProcessing数据预处理、Feature Extraction特征提取、Feature Selection特征选择和 Feature construction特征构造等子问题。
特征选择 前言单变量特征选择 互信息相关系数方差分析卡方检验IV值基尼系数VIF值Pipeline实现 递归消除特征特征重要性主成分分析小结
导入必要的包
import numpy as np
import pandas as pd
import re
import sys
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer, make_column_transformer, make_column_selector
from sklearn.pipeline import FeatureUnion, make_union, Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import SelectFpr, SelectFdr, SelectFwe
from sklearn.model_selection import cross_val_score
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc# Setting configuration.
warnings.filterwarnings(ignore)
sns.set_style(whitegrid)SEED 42定义一个计时器
def timer(func):import timeimport functoolsdef strfdelta(tdelta, fmt):hours, remainder divmod(tdelta, 3600)minutes, seconds divmod(remainder, 60)return fmt.format(hours, minutes, seconds)functools.wraps(func)def wrapper(*args, **kwargs):click time.time()print(Starting time\t, time.strftime(%H:%M:%S, time.localtime()))result func(*args, **kwargs)delta strfdelta(time.time() - click, {:.0f} hours {:.0f} minutes {:.0f} seconds)print(f{func.__name__} cost {delta})return resultreturn wrapper特征选择
前言
现在我们已经有大量的特征可使用有的特征携带的信息丰富有的特征携带的信息有重叠有的特征则属于无关特征尽管在拟合一个模型之前很难说哪些特征是重要的但如果所有特征不经筛选地全部作为训练特征经常会出现维度灾难问题甚至会降低模型的泛化性能因为较无益的特征会淹没那些更重要的特征。因此我们需要进行特征筛选排除无效/冗余的特征把有用的特征挑选出来作为模型的训练数据。
特征选择方法有很多一般分为三类
过滤法Filter比较简单它按照特征的发散性或者相关性指标对各个特征进行评分设定评分阈值或者待选择阈值的个数选择合适特征。包装法Wrapper根据目标函数通常是预测效果评分每次选择部分特征或者排除部分特征。嵌入法Embedded则稍微复杂一点它先使用选择的算法进行训练得到各个特征的权重根据权重从大到小来选择特征。
sklearn.feature_selection所属方法说明VarianceThresholdFilter方差选择法SelectKBestFilter常用相关系数、卡方检验、互信息作为得分计算的方法SelectPercentileFilter根据最高分数的百分位数选择特征SelectFpr, SelectFdr, SelectFweFilter根据假设检验的p-value选择特征RFECVWrapper在交叉验证中执行递归式特征消除SequentialFeatureSelectorWrapper前向/向后搜索SelectFromModelEmbedded训练基模型选择权值系数较高的特征 如何通俗地理解Family-wise error rate(FWER)和False discovery rate(FDR) df pd.read_csv(../datasets/Home-Credit-Default-Risk/created_data.csv, index_colSK_ID_CURR)定义帮助节省内存的函数
timer
def convert_dtypes(df, verboseTrue):original_memory df.memory_usage().sum()df df.apply(pd.to_numeric, errorsignore)# Convert booleans to integersboolean_features df.select_dtypes(bool).columns.tolist()df[boolean_features] df[boolean_features].astype(np.int32)# Convert objects to categoryobject_features df.select_dtypes(object).columns.tolist()df[object_features] df[object_features].astype(category)# Float64 to float32float_features df.select_dtypes(float).columns.tolist()df[float_features] df[float_features].astype(np.float32)# Int64 to int32int_features df.select_dtypes(int).columns.tolist()df[int_features] df[int_features].astype(np.int32)new_memory df.memory_usage().sum()if verbose:print(fOriginal Memory Usage: {round(original_memory / 1e9, 2)} gb.)print(fNew Memory Usage: {round(new_memory / 1e9, 2)} gb.)return dfdf convert_dtypes(df)
X df.drop(TARGET, axis1)
y df[TARGET]Starting time 20:30:38
Original Memory Usage: 5.34 gb.
New Memory Usage: 2.63 gb.
convert_dtypes cost 0 hours 2 minutes 1 secondsX.dtypes.value_counts()float32 2104
int32 16
category 3
category 3
category 3
category 3
category 3
category 3
category 3
category 2
category 2
category 2
category 2
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
category 1
Name: count, dtype: int64del df
gc.collect()0# Encode categorical features
categorical_features X.select_dtypes(excludenumber).columns.tolist()
X[categorical_features] X[categorical_features].apply(lambda x: x.cat.codes)X.dtypes.value_counts()float32 2104
int8 47
int32 16
Name: count, dtype: int64定义数据集评估函数
timer
def score_dataset(X, y, categorical_features, nfold5):# Create Dataset object for lightgbmdtrain lgb.Dataset(X, labely)# Use a dictionary to set Parameters.params dict(objectivebinary,is_unbalanceTrue,metricauc,n_estimators500,verbose0)# Training with 5-fold CV:print(Starting training...)eval_results lgb.cv(params, dtrain, nfoldnfold,categorical_feature categorical_features,callbacks[lgb.early_stopping(50), lgb.log_evaluation(50)],return_cvboosterTrue)boosters eval_results[cvbooster].boosters# Initialize an empty dataframe to hold feature importancesfeature_importances pd.DataFrame(indexX.columns)for i in range(nfold):# Record the feature importancesfeature_importances[fcv_{i}] boosters[i].feature_importance()feature_importances[score] feature_importances.mean(axis1)# Sort features according to importancefeature_importances feature_importances.sort_values(score, ascendingFalse)return eval_results, feature_importanceseval_results, feature_importances score_dataset(X, y, categorical_features)Starting time 20:32:42
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.778018 0.00319843
[100] cv_aggs valid auc: 0.783267 0.00307558
[150] cv_aggs valid auc: 0.783211 0.00299384
Early stopping, best iteration is:
[115] cv_aggs valid auc: 0.783392 0.00298777
score_dataset cost 0 hours 6 minutes 9 seconds单变量特征选择
ReliefRelevant Features是著名的过滤式特征选择方法。该方法假设特征子集的重要性是由子集中的每个特征所对应的相关统计量分量之和所决定的。所以只需要选择前k个大的相关统计量对应的特征或者大于某个阈值的相关统计量对应的特征即可。
常用的过滤指标
函数python模块说明VarianceThresholdsklearn.feature_selection方差过滤r_regressionsklearn.feature_selection回归任务的目标/特征之间的Pearson相关系数。f_regressionsklearn.feature_selection回归任务的目标/特征之间的t检验F值。mutual_info_regressionsklearn.feature_selection估计连续目标变量的互信息。chi2sklearn.feature_selection分类任务的非负特征的卡方值和P值。f_classifsklearn.feature_selection分类任务的目标/特征之间的方差分析F值。mutual_info_classifsklearn.feature_selection估计离散目标变量的互信息。df.corr, df.corrwithpandasPearson, Kendall, Spearman相关系数calc_gini_scoresself-define基尼系数variance_inflation_factorstatsmodels.stats.outliers_influenceVIF值df.isna().mean()pandas缺失率DropCorrelatedFeaturesfeature_engine.selection删除共线特征基于相关系数SelectByInformationValuefeature_engine.selectionIV值筛选DropHighPSIFeaturesfeature_engine.selection删除不稳定特征
互信息
互信息是从信息熵的角度分析各个特征和目标之间的关系包括线性和非线性关系。
from sklearn.feature_selection import SelectKBest, mutual_info_classiftimer
def calc_mi_scores(X, y):colnames X.select_dtypes(excludenumber).columnsX[colnames] X[colnames].astype(category).apply(lambda x:x.cat.codes)discrete [X[col].nunique()50 for col in X]mi_scores mutual_info_classif(X, y, discrete_featuresdiscrete, random_stateSEED)mi_scores pd.Series(mi_scores, nameMI Scores, indexX.columns)mi_scores mi_scores.sort_values(ascendingFalse)return mi_scoresclass DropUninformative(BaseEstimator, TransformerMixin):def __init__(self, threshold0.0):self.threshold thresholddef fit(self, X, y):mi_scores calc_mi_scores(X, y)self.variables mi_scores[mi_scores self.threshold].index.tolist()return selfdef transform(self, X, yNone): return X[self.variables]def get_feature_names_out(self, input_featuresNone):return self.variablesinit_n len(X.columns)
selected_features DropUninformative(threshold0.0) \.fit(X, y) \.get_feature_names_out()print(The number of selected features:, len(selected_features))
print(fDropped {init_n - len(selected_features)} uninformative features.)Starting time 20:38:51
calc_mi_scores cost 0 hours 17 minutes 49 seconds
The number of selected features: 2050
Dropped 117 uninformative features.selected_categorical_features [col for col in categorical_features if col in selected_features]
eval_results, feature_importances score_dataset(X[selected_features], y, selected_categorical_features)Starting time 20:56:46
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.778311 0.00276223
[100] cv_aggs valid auc: 0.783085 0.00266899
[150] cv_aggs valid auc: 0.783015 0.00280856
Early stopping, best iteration is:
[122] cv_aggs valid auc: 0.783271 0.00267406
score_dataset cost 0 hours 6 minutes 7 seconds相关系数
皮尔森相关系数是一种最简单的方法能帮助理解两个连续变量之间的线性相关性。
import timedef progress(percent0, width50, descProcessing):import mathtags math.ceil(width * percent) * #print(f\r{desc}: [{tags:-{width}}]{percent:.1%}, end, flushTrue)timer
def drop_correlated_features(X, y, threshold0.9):to_keep []to_drop []categorical X.select_dtypes(excludenumber).columns.tolist()for i, col in enumerate(X.columns):if col in categorical:continue# The correlationscorr X[to_keep].corrwith(X[col]).abs()# Select columns with correlations above thresholdif any(corr threshold):to_drop.append(col)else:to_keep.append(col)progress((i1) / len(X.columns))print(\nThe number of correlated features:, len(to_drop))return to_keep上述函数会倾向于删除最后出现的相关特征为了尽可能保留原始特征我们调换下特征顺序
original_df pd.read_csv(../datasets/Home-Credit-Default-Risk/prepared_data.csv, nrows5)original_features [f for f in X.columns if f in original_df.columns]
derived_features [f for f in X.columns if f not in original_df.columns]selected_features [col for col in original_features derived_features if col in selected_features]# Drops features that are correlated# init_n len(selected_features)
selected_features drop_correlated_features(X[selected_features], y, threshold0.9) print(The number of selected features:, len(selected_features))
print(fDropped {init_n - len(selected_features)} correlated features.)Starting time 21:03:05
Processing: [##################################################]100.0%
The number of correlated features: 1110
drop_correlated_features cost 0 hours 33 minutes 5 seconds
The number of selected features: 940
Dropped 1227 correlated features.工作中我们常调用feature_engine包实现
# Drops features that are correlated
# from feature_engine.selection import DropCorrelatedFeatures# init_n len(selected_features)
# selected_features DropCorrelatedFeatures(threshold0.9) \
# .fit(X[selected_features], y) \
# .get_feature_names_out() # print(The number of selected features:, len(selected_features))
# print(fDropped {init_n - len(selected_features)} features.)selected_categorical_features [col for col in categorical_features if col in selected_features]
eval_results, feature_importances score_dataset(X[selected_features], y, selected_categorical_features)Starting time 21:36:12
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.776068 0.00333724
[100] cv_aggs valid auc: 0.781097 0.00296053
[150] cv_aggs valid auc: 0.781236 0.00298245
Early stopping, best iteration is:
[136] cv_aggs valid auc: 0.781375 0.00302538
score_dataset cost 0 hours 2 minutes 23 seconds方差分析
方差分析主要用于分类问题中连续特征的相关性。
from sklearn.feature_selection import f_classif numeric_features [col for col in X.columns if col not in categorical_features]f_statistic, p_values f_classif(X[numeric_features], y)
anova pd.DataFrame({f_statistic: f_statistic,p_values: p_values}, indexnumeric_features
)
print(fThe number of irrelevant features for classification:, anova[p_values].ge(0.05).sum())The number of irrelevant features for classification: 274卡方检验
卡方检验是一种用于衡量两个分类变量之间相关性的统计方法。
from sklearn.feature_selection import chi2chi2_stats, p_values chi2(X[categorical_features], y
)
chi2_test pd.DataFrame({chi2_stats: chi2_stats,p_values: p_values}, indexcategorical_features
)
print(The number of irrelevant features for classification:, chi2_test[p_values].ge(0.05).sum())The number of irrelevant features for classification: 9如果针对分类问题f_classif 和 chi2两个评分函数搭配使用就能够完成一次完整的特征筛选其中f_classif用于筛选连续特征chi2用于筛选离散特征。
feature_selection make_column_transformer((SelectFdr(score_funcf_classif, alpha0.05), numeric_features),(SelectFdr(score_funcchi2, alpha0.05), categorical_features),verboseTrue,verbose_feature_names_outFalse
)selected_features_by_fdr feature_selection.fit(X, y).get_feature_names_out()
print(The number of selected features:, len(selected_features_by_fdr))
print(Dropped {} features..format(X.shape[1] - len(selected_features_by_fdr)))[ColumnTransformer] ... (1 of 2) Processing selectfdr-1, total 2.7min
[ColumnTransformer] ... (2 of 2) Processing selectfdr-2, total 0.1s
The number of selected features: 1838
Dropped 329 features.selected_categorical_features_by_fdr [col for col in categorical_features if col in selected_features_by_fdr]
eval_results, feature_importances score_dataset(X[selected_features_by_fdr], y, selected_categorical_features_by_fdr)Starting time 21:44:08
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.777829 0.00296151
[100] cv_aggs valid auc: 0.782637 0.00263458
[150] cv_aggs valid auc: 0.782612 0.0023263
Early stopping, best iteration is:
[129] cv_aggs valid auc: 0.782834 0.00242003
score_dataset cost 0 hours 5 minutes 41 secondsIV值
IVInformation Value用来评价离散特征对二分类变量的预测能力。一般认为IV小于0.02的特征为无用特征。
def calc_iv_scores(X, y, bins10):X pd.DataFrame(X)y pd.Series(y)assert y.nunique() 2, y must be binaryiv_scores pd.Series()# Find discrete featurescolnames X.select_dtypes(excludenumber).columnsX[colnames] X[colnames].astype(category).apply(lambda x:x.cat.codes)discrete [X[col].nunique()50 for col in X]# Compute information valuefor colname in X.columns:if colname in discrete:var X[colname]else:var pd.qcut(X[colname], bins, duplicatesdrop)grouped y.groupby(var).agg([(Positive,sum),(All,count)]) grouped[Negative] grouped[All]-grouped[Positive] grouped[Positive rate] grouped[Positive]/grouped[Positive].sum()grouped[Negative rate] grouped[Negative]/grouped[Negative].sum()grouped[woe] np.log(grouped[Positive rate]/grouped[Negative rate])grouped[iv] (grouped[Positive rate]-grouped[Negative rate])*grouped[woe]grouped.name colname iv_scores[colname] grouped[iv].sum()return iv_scores.sort_values(ascendingFalse)iv_scores calc_iv_scores(X, y)
print(fThere are {iv_scores.le(0.02).sum()} features with iv 0.02.)There are 987 features with iv 0.02.基尼系数
基尼系数用来衡量分类问题中特征对目标变量的影响程度。它的取值范围在0到1之间值越大表示特征对目标变量的影响越大。常见的基尼系数阈值为0.02如果基尼系数小于此阈值则被认为是不重要的特征。
def calc_gini_scores(X, y, bins10):X pd.DataFrame(X)y pd.Series(y)gini_scores pd.Series()# Find discrete featurescolnames X.select_dtypes(excludenumber).columnsX[colnames] X[colnames].astype(category).apply(lambda x:x.cat.codes)discrete [X[col].nunique()50 for col in X]# Compute gini scorefor colname in X.columns:if colname in discrete:var X[colname]else:var pd.qcut(X[colname], bins, duplicatesdrop)p y.groupby(var).mean()gini 1 - p.pow(2).sum()gini_scores[colname] gini return gini_scores.sort_values(ascendingFalse)gini_scores calc_gini_scores(X, y)
print(fThere are {gini_scores.le(0.02).sum()} features with iv 0.02.)There are 0 features with iv 0.02.VIF值
VIF用于衡量特征之间的共线性程度。通常VIF小于5被认为不存在多重共线性问题VIF大于10则存在明显的多重共线性问题。
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constanttimer
def calc_vif_scores(X, yNone):numeric_X X.select_dtypes(number)numeric_X add_constant(numeric_X)# collinear featuresvif pd.Series()for i, col in enumerate(numeric_X.columns):vif[col] variance_inflation_factor(numeric_X.values, i)progress((i1)/numeric_X.shape[1])return vif.sort_values(ascendingFalse)# vif_scores calc_vif_scores(X)
# print(fThere are {vif_scores.gt(10).sum()} collinear features (VIF above 10))Pipeline实现
我们准备采取以下两步
首先删除互信息为0的特征。然后对于相关性大于0.9的每对特征删除其中一个特征。
from feature_engine.selection import DropCorrelatedFeaturesfeature_selection make_pipeline(DropUninformative(threshold0.0),DropCorrelatedFeatures(threshold0.9),verboseTrue
)# init_n len(X.columns)
# selected_features feature_selection.fit(X, y).get_feature_names_out()# print(The number of selected features:, len(selected_features))
# print(fDropped {init_n - len(selected_features)} features.)在2167个总特征中只保留了914个表明我们创建的许多特征是多余的。
递归消除特征
最常用的包装法是递归消除特征法(recursive feature elimination)。递归消除特征法使用一个机器学习模型来进行多轮训练每轮训练后消除最不重要的特征再基于新的特征集进行下一轮训练。
由于RFE需要消耗大量的资源就不再运行了代码如下
# from sklearn.svm import LinearSVC
# from sklearn.feature_selection import RFECV# Use SVM as the model
# svc LinearSVC(dualauto, penaltyl1)# Recursive feature elimination with cross-validation to select features.
# rfe RFECV(svc, step1, cv5, verbose1)
# rfe.fit(X, y)# The mask of selected features.
# print(zip(X.columns, rfe.support_))
# print(The number of features:, rfe.n_features_in_)
# print(The number of selected features:, rfe.n_features_)# feature_rank pd.Series(rfe.ranking_, indexX.columns).sort_values(ascendingFalse)
# print(Features sorted by their rank:, feature_rank[:10], sep\n)特征重要性
嵌入法也是用模型来选择特征但是它和RFE的区别是它不通过不停的筛掉特征来进行训练而是使用特征全集训练模型。
最常用的是使用带惩罚项 ℓ 1 , ℓ 2 \ell_1,\ell_2 ℓ1,ℓ2 正则项的基模型来选择特征例如 LassoRidge。或者简单的训练基模型选择权重较高的特征。
我们先使用之前定义的 score_dataset 获取每个特征的重要性分数
selected_categorical_features [col for col in categorical_features if col in selected_features]
eval_results, feature_importances score_dataset(X[selected_features], y, selected_categorical_features)Starting time 21:55:44
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.776068 0.00333724
[100] cv_aggs valid auc: 0.781097 0.00296053
[150] cv_aggs valid auc: 0.781236 0.00298245
Early stopping, best iteration is:
[136] cv_aggs valid auc: 0.781375 0.00302538
score_dataset cost 0 hours 2 minutes 25 seconds# Sort features according to importance
feature_importances feature_importances.sort_values(score, ascendingFalse)
feature_importances[score].head(15)AMT_ANNUITY / AMT_CREDIT 92.6
MODE(previous.PRODUCT_COMBINATION) 62.8
EXT_SOURCE_2 EXT_SOURCE_3 60.4
MODE(installments.previous.PRODUCT_COMBINATION) 52.0
MAX(bureau.DAYS_CREDIT) 40.4
DAYS_BIRTH / EXT_SOURCE_1 38.4
MAX(bureau.DAYS_CREDIT_ENDDATE) 35.8
SUM(bureau.AMT_CREDIT_MAX_OVERDUE) 34.2
MEAN(bureau.AMT_CREDIT_SUM_DEBT) 34.0
AMT_GOODS_PRICE / AMT_ANNUITY 30.6
MODE(cash.previous.PRODUCT_COMBINATION) 29.8
MAX(cash.previous.DAYS_LAST_DUE_1ST_VERSION) 29.8
SUM(bureau.AMT_CREDIT_SUM) 29.0
MEAN(previous.MEAN(cash.CNT_INSTALMENT_FUTURE)) 29.0
AMT_CREDIT - AMT_GOODS_PRICE 28.2
Name: score, dtype: float64可以看到我们构建的许多特征进入了前15名这应该让我们有信心我们所有的辛勤工作都是值得的
接下来我们删除重要性为0的特征因为这些特征实际上从未用于在任何决策树中拆分节点。因此删除这些特征是一个非常安全的选择至少对这个特定模型来说。
# Find the features with zero importance
zero_features feature_importances.query(score 0.0).index.tolist()
print(f\nThere are {len(zero_features)} features with 0.0 importance)There are 105 features with 0.0 importanceselected_features [col for col in selected_features if col not in zero_features]
print(The number of selected features:, len(selected_features))
print(Dropped {} features with zero importance..format(len(zero_features)))The number of selected features: 835
Dropped 105 features with zero importance.selected_categorical_features [col for col in categorical_features if col in selected_features]
eval_results, feature_importances score_dataset(X[selected_features], y, selected_categorical_features)Starting time 21:58:13
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.77607 0.00333823
[100] cv_aggs valid auc: 0.781042 0.00295406
[150] cv_aggs valid auc: 0.781317 0.00303434
[200] cv_aggs valid auc: 0.780819 0.00281177
Early stopping, best iteration is:
[154] cv_aggs valid auc: 0.781405 0.0029417
score_dataset cost 0 hours 2 minutes 34 seconds删除0重要性的特征后我们还有834个特征。如果我们认为此时特征量依然非常大我们可以继续删除重要性最小的特征。 下图显示了累积重要性与特征数量
feature_importances feature_importances.sort_values(score, ascendingFalse)sns.lineplot(xrange(1, feature_importances.shape[0]1), yfeature_importances[score].cumsum())
plt.show()
如果我们选择是只保留95%的重要性所需的特征
def select_import_features(scores, thresh0.95):feature_imp pd.DataFrame(scores, columns[score])# Sort features according to importancefeature_imp feature_imp.sort_values(score, ascendingFalse)# Normalize the feature importancesfeature_imp[score_normalized] feature_imp[score] / feature_imp[score].sum()feature_imp[cumsum] feature_imp[score_normalized].cumsum()selected_features feature_imp.query(fcumsum {thresh})return selected_features.index.tolist()init_n len(selected_features)
import_features select_import_features(feature_importances[score], thresh0.95)
print(The number of import features:, len(import_features))
print(fDropped {init_n - len(import_features)} features.)The number of import features: 241
Dropped 594 features.剩余248个特征足以覆盖95%的重要性。
import_categorical_features [col for col in categorical_features if col in import_features]
eval_results, feature_importances score_dataset(X[import_features], y, import_categorical_features)Starting time 22:00:49
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.756425 0.0029265
[100] cv_aggs valid auc: 0.759284 0.0029921
[150] cv_aggs valid auc: 0.759162 0.00314089
Early stopping, best iteration is:
[115] cv_aggs valid auc: 0.759352 0.00300464
score_dataset cost 0 hours 0 minutes 21 seconds在继续之前我们应该记录我们采取的特征选择步骤以备将来使用
删除互信息为0的无效特征删除了117个特征删除相关系数大于0.9的共线变量删除了1108个特征根据GBM删除0.0重要特征删除108个特征(可选)仅保留95%特征重要性所需的特征删除了586个特征
我们看下特征组成
original set(original_features) set(import_features)
derived set(import_features) - set(original)print(fSelected features: {len(original)} original features, {len(derived)} derived features.)Selected features: 33 original features, 208 derived features.保留的248个特征有37个是原始特征211个是衍生特征。
主成分分析
常见的降维方法除了基于L1惩罚项的模型以外另外还有主成分分析法PCA和线性判别分析LDA。这两种方法的本质是相似的本节主要介绍PCA。
方法函数python包主成分分析法PCAsklearn.decomposition线性判别分析法LinearDiscriminantAnalysissklearn.discriminant_analysis
应用主成分分析
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler# Standardize
pca Pipeline([(standardize, RobustScaler()),(pca, PCA(n_componentsNone, random_stateSEED)),], verboseTrue
)principal_components pca.fit_transform(X)
weight_matrix pca[pca].components_[Pipeline] ....... (step 1 of 2) Processing standardize, total 1.1min
[Pipeline] ............... (step 2 of 2) Processing pca, total11.8min其中 pca.components_ 对应sklearn 中 PCA 求解矩阵的SVD分解的截断的 V T V^T VT。在PCA转换后pca.components_ 是一个具有形状为 (n_components, n_features) 的数组其中 n_components 是我们指定的主成分数目n_features 是原始数据的特征数目。pca.components_ 的每一行表示一个主成分每一列表示原始数据的一个特征。因此pca.components_ 的每个元素表示对应特征在主成分中的权重。
可视化方差
def plot_variance(pca, n_components10):evr pca.explained_variance_ratio_[:n_components]grid range(1, n_components 1)# Create figureplt.figure(figsize(6, 4))# Percentage of variance explained for each components.plt.bar(grid, evr, labelExplained Variance)# Cumulative Varianceplt.plot(grid, np.cumsum(evr), o-, labelCumulative Variance, colororange) plt.xlabel(The number of Components)plt.xticks(grid)plt.title(Explained Variance Ratio)plt.ylim(0.0, 1.1)plt.legend(locbest)plot_variance(pca[pca])
plt.show()
我们使用pca前两个主成分进行可视化
print(explained variance ratio (first two components): %s% str(pca[pca].explained_variance_ratio_[:2])
) sns.kdeplot(xprincipal_components[:, 0], yprincipal_components[:, 1], huey)
plt.xlim(-1e8, 1e8)
plt.ylim(-1e8, 1e8)explained variance ratio (first two components): [0.43424749 0.33590885](-100000000.0, 100000000.0)
这两个类别没有完全分开因此我们需要更多的主成分。
PCA可以有效地减少维度的数量但他们的本质是要将原始的样本映射到维度更低的样本空间中。这意味着PCA特征没有真正的业务含义。此外PCA假设数据是正态分布的这可能不是真实数据的有效假设。因此我们只是展示了如何使用pca实际上并没有将其应用于数据。
小结
本章介绍了很多特征选择方法
单变量特征选择可以用于理解数据、数据的结构、特点也可以用于排除不相关特征但是它不能发现冗余特征。正则化的线性模型可用于特征理解和特征选择。但是它需要先把特征装换成正态分布。嵌入法的特征重要性选择是一种非常流行的特征选择方法它易于使用。但它有两个主要问题 重要的特征有可能得分很低关联特征问题这种方法对类别多的特征越有利偏向问题
至此经典的特征工程至此已经完结了我们继续使用LightGBM模型评估筛选后的特征。
eval_results, feature_importances score_dataset(X[selected_features], y, selected_categorical_features)Starting time 22:14:44
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores dont improve for 50 rounds
[50] cv_aggs valid auc: 0.77607 0.00333823
[100] cv_aggs valid auc: 0.781042 0.00295406
[150] cv_aggs valid auc: 0.781317 0.00303434
[200] cv_aggs valid auc: 0.780819 0.00281177
Early stopping, best iteration is:
[154] cv_aggs valid auc: 0.781405 0.0029417
score_dataset cost 0 hours 2 minutes 25 seconds特征重要性
# Sort features according to importance
feature_importances[score].sort_values(ascendingFalse).head(15)AMT_ANNUITY / AMT_CREDIT 98.6
EXT_SOURCE_2 EXT_SOURCE_3 66.6
MODE(previous.PRODUCT_COMBINATION) 66.0
MODE(installments.previous.PRODUCT_COMBINATION) 54.4
MAX(bureau.DAYS_CREDIT) 41.8
MAX(bureau.DAYS_CREDIT_ENDDATE) 40.0
DAYS_BIRTH / EXT_SOURCE_1 39.8
MEAN(bureau.AMT_CREDIT_SUM_DEBT) 37.2
SUM(bureau.AMT_CREDIT_MAX_OVERDUE) 35.2
MODE(cash.previous.PRODUCT_COMBINATION) 33.4
AMT_GOODS_PRICE / AMT_ANNUITY 33.0
SUM(bureau.AMT_CREDIT_SUM) 30.8
MAX(cash.previous.DAYS_LAST_DUE_1ST_VERSION) 30.8
MEAN(previous.MEAN(cash.CNT_INSTALMENT_FUTURE)) 29.8
AMT_CREDIT - AMT_GOODS_PRICE 29.2
Name: score, dtype: float64del X, y
gc.collect()df pd.read_csv(../datasets/Home-Credit-Default-Risk/created_data.csv, index_colSK_ID_CURR)selected_data df[selected_features [TARGET]]
selected_data.to_csv(../datasets/Home-Credit-Default-Risk/selected_data.csv, indexTrue)