当前位置: 首页 > news >正文

重庆勘察设计网福州百度推广排名优化

重庆勘察设计网,福州百度推广排名优化,常见的跨境电商平台有哪些,网站建设的进度表刚入手data science, 想着自己玩一玩kaggle#xff0c;玩了新手Titanic和House Price的 项目, 觉得基本的baseline还是可以写出来#xff0c;但是具体到一些细节#xff0c;以至于到能拿到的出手的成绩还是需要理论分析的。 本文旨在介绍kaggle比赛到各种原理与技巧#xf…刚入手data science, 想着自己玩一玩kaggle玩了新手Titanic和House Price的 项目, 觉得基本的baseline还是可以写出来但是具体到一些细节以至于到能拿到的出手的成绩还是需要理论分析的。 本文旨在介绍kaggle比赛到各种原理与技巧当然一切源自于coursera由于课程都是英文的且都比较好理解这里直接使用英文 Reference How to Win a Data Science Competition: Learn from Top Kagglers Features: numeric, categorical, ordinal, datetime, coordinate, text Numeric features All models are divided into tree-based model and non-tree-based model.   Scaling For example: if we apply KNN algorithm to the instances below, as we see in the second row, we caculate the distance between the instance and the object. It is obvious that dimension of large scale dominates the distance.   Tree-based models doesn’t depend on scaling Non-tree-based models hugely depend on scaling How to do sklearn: To [0,1] sklearn.preprocessing.MinMaxScaler X ( X-X.min( ) )/( X.max()-X.min() ) To mean0, std1 sklearn.preprocessing.StandardScaler X ( X-X.mean( ) )/X.std() if you want to use KNN, we can go one step ahead and recall that the bigger feature is, the more important it will be for KNN. So, we can optimize scaling parameter to boost features which seems to be more important for us and see if this helpsOutliers The outliers make the model diviate like the red line. We can clip features values between teo chosen values of lower bound and upper bound Rank TransformationIf we have outliers, it behaves better than scaling. It will move the outliers closer to other objects Linear model, KNN, Neural Network will benefit from this mothod. rank([-100, 0, 1e5]) [0,1,2] rank([1000,1,10]) [2,0,1] scipy: scipy.stats.rankdata Other method Log transform: np.log(1 x)Raising to the power 1: np.sqrt(x 2/3)Feature Generation Depends on a. Prior knowledge b. Exploratory data analysis Ordinal features Examples: Ticket class: 1,2,3Driver’s license: A, B, C, DEducation: kindergarden, school, undergraduate, bachelor, master, doctoralProcessing 1.Label Encoding * Alphabetical (sorted) [S,C,Q] - [2, 1, 3] sklearn.preprocessing.LabelEncoder Order of appearance [S,C,Q] - [1, 2, 3]Pandas.factorize This method works fine with two ways because tree-methods can split feature, and extract most of the useful values in categories on its own. Non-tree-based-models, on the other side,usually can’t use this feature effectively. 2.Frequency Encoding [S,C,Q] - [0.5, 0.3, 0.2] encoding titanic.groupby(‘Embarked’).size() encoding encoding/len(titanic) titanic[‘enc’] titanic.Embarked.map(encoding) from scipy.stats import rankdata For linear model, it is also helpful. if frequency of category is correlated with target value, linear model will utilize this dependency. 3.One-hot Encoding pandas.get_dummies It give all the categories of one feature a new columns and often used for non-tree-based model. It will slow down tree-based model, so we introduce sparse matric. Most of libaraies can work with these sparse matrices directly. Namely, xgboost, lightGBM Feature generation Interactions of categorical features can help linear models and KNN By concatenating string Datetime and Coordinates Date and time 1.Periodicity 2.Time since a. Row-independent moment For example: since 00:00:00 UTC, 1 January 1970;b. Row-dependent important moment Number of days left until next holidays/ time passed after last holiday.3.Difference betwenn dates We can add date_diff feature which indicates number of days between these events Coordicates 1.Interesting places from train/test data or additional data Generate distance between the instance to a flat or an old building(Everything that is meanful) 2.Aggergates statistics The price of surrounding building 3.Rotation Sometime it makes the model more precisely to classify the instances. Missing data Hidden Nan, numeric When drawing a histgram, we see the following picture: It is obivous that -1 is a hidden Nan which is no meaning for this feature. Fillna approaches 1.-999,-1,etc(outside the feature range) It is useful in a way that it gives three possibility to take missing value into separate category. The downside of this is that performance of linear networks can suffer. 2.mean,median Second method usually beneficial for simple linear models and neural networks. But again for trees it can be harder to select object which had missing values in the first place. 3.Reconstruct: Isnull Prediction * Replace the missing data with the mean of medain grouped by another feature. But sometimes it can be screwed up, like: The way to handle this is to ignore missing values while calculating means for each category. Treating values which do not present in trian dataJust generate new feature indicating number of occurrence in the data(freqency) Xgboost can handle Nan4.Remove rows with missing values This one is possible, but it can lead to loss of important samples and a quality decrease. Text Bag of words Text preprocessing 1.Lowercase 2.Lemmatization and Stemming 3.Stopwords Examples: 1.Articles(冠词) or prepositions 2.Very common words sklearn.feature_extraction.text.CountVectorizer: max_df max_df : float in range [0.0, 1.0] or int, default1.0 When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.CountVectorizer The number of times a term occurs in a given document sklearn.feature_extraction.text.CountVectorizer TFiDF In order to re-weight the count features into floating point values suitable for usage by a classifier Term frequency tf 1 / x.sum(axis1) [:,None] x x * tf Inverse Document Frequency idf np.log(x.shape[0] / (x 0).sum(0)) x x * idf N-gram sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer ngram_range : tuple (min_n, max_n) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n n max_n will be used.Embeddings(~word2vec) It converts each word to some vector in some sophisticated space, which usually have several hundred dimensions a. Relatively small vectors b. Values in vector can be interpreted only in some cases c. The words with similar meaning often have similar embeddings Example:  转载于:https://www.cnblogs.com/bjwu/p/8970821.html
http://www.zqtcl.cn/news/705816/

相关文章:

  • 网站建立的链接不安全怎么解决学校网站怎样建设
  • 信阳市工程建设信息网站wordpress段子
  • 网站建设和网络搭建是一回事吗长沙网站搭建优化
  • 基础网站怎么做石景山公司
  • 吉他谱网站如何建设wordpress主题字体用隶书
  • 做一个宣传网站的策划书自己怎样推广呢
  • 网站建设布局利于优化火狐搜索引擎
  • 公司给别人做的网站违法吗hexo插件wordpress
  • 网站用什么语言做动易网站迁移
  • 网站备案上传照片几寸织梦模板网站好吗
  • 怎么通过数据库做网站的登录wordpress 注册登录插件
  • 读书网站排名大的网站建设公司好
  • 电商网站建设系统公司 网站建
  • 西安建站费用优化系统是什么意思
  • 做网站认证对网站有什么好处中信建设有限责任公司四川分公司电话
  • 王者做网站福州seo外包公司
  • 网站建设教程百度网盘网站报价明细
  • 网站建设杭州哪家好ui设计学校
  • 门户网站做等级保护测评成都企业建站系统
  • 网站建设需求确认表网站建设需求材料
  • 定制型网站制作价格北京网站建设费用
  • 与女鬼做的网站上海有限公司
  • ytwzjs烟台网站建设c 做的网站又哪些
  • 做网站就是做app中国包装创意设计网
  • 淄博做网站宿迁房产网丫丫找房
  • 苏州专业做网站比较好的公司杭州好的公司网站设计
  • 做百度网站要多少钱帮做网站一般多少钱
  • 云南网站备案查询山西做网站费用
  • 北京建站管理系统开发网站高转化页面
  • 南充网站制作不会做网站能做网络销售吗