当前位置：首页 > news >正文

mvc网站开发实例教程互联网技术试验卫星

news 2025/11/15 7:16:32

mvc网站开发实例教程,互联网技术试验卫星,番禺移动网站建设,wordpress安卓app10.13 Update#xff1a;最近新出了一个state-of-the-art预训练模型#xff0c;传送门#xff1a;李入魔#xff1a;【NLP】Google BERT详解zhuanlan.zhihu.com1. 简介长期以来#xff0c;词向量一直是NLP任务中的主要表征技术。随着2017年底以及2018年初的一系列技术突…10.13 Update最近新出了一个state-of-the-art预训练模型传送门李入魔【NLP】Google BERT详解zhuanlan.zhihu.com1. 简介长期以来词向量一直是NLP任务中的主要表征技术。随着2017年底以及2018年初的一系列技术突破研究证实预训练的语言表征经过精调后可以在众多NLP任务中达到更好的表现。目前预训练有两种方法Feature-based将训练出的representation作为feature用于任务从词向量、句向量、段向量、文本向量都是这样的。新的ELMo也属于这类但迁移后需要重新计算出输入的表征。Fine-tuning这个主要借鉴于CV就是在预训练好的模型上加些针对任务的层再对后几层进行精调。新的ULMFit和OpenAI GPT属于这一类。本文主要对ELMo、ULMFiT以及OpenAI GPT三种预训练语言模型作简要介绍。2. ELMo2.1 模型原理与架构原文链接Deep contextualized word representationsELMo是从双向语言模型BiLM中提取出的Embedding。训练时使用BiLSTM给定N个tokens (t1, t2,...,tN), 目标为最大化ELMo对于每个token , 通过一个L层的biLM计算出2L1个表示其中是对token进行直接编码的结果(这里是字符通过CNN编码) 是每个biLSTM层输出的结果。应用中将ELMo中所有层的输出R压缩为单个向量, , 最简单的压缩方法是取最上层的结果做为token的表示: , 更通用的做法是通过一些参数来联合所有层的信息其中是softmax出来的权重, 是一个任务相关的scale参数在优化过程中很重要同时因为每层BiLM的输出分布不同可以对层起到normalisation的作用。论文中使用的预训练BiLM在Jozefowicz et al.中的CNN-BIG-LSTM基础上做了修改最终模型为2层biLSTM4096 units, 512 dimension projections并在第一层和第二层之间增加了残差连接。同时使用CNN和两层Highway对token进行字符级的上下文无关编码。使得模型最终对每个token输出三层向量表示。2.2 模型训练注意事项- 正则化 1. Dropout 2. 在loss中添加权重的惩罚项 (实验结果显示ELMo适合较小的 )- TF版源码解析 1. 模型架构的代码主要在training模块的LanguageModel类中分为两步第一步创建word或character的Embedding层CNNHighway第二步创建BiLSTM层。 2. 加载所需的预训练模型为model模块中的BidirectionalLanguageModel类。2.3 模型的使用将ELMo向量与传统的词向量拼接成后输入到对应具体任务的RNN中。将ELMo向量放到模型输出部分与具体任务RNN输出的拼接成。Keras代码示例import tensorflow as tf from keras import backend as K import keras.layers as layers from keras.models import Model# Initialize session sess tf.Session() K.set_session(sess)# Instantiate the elmo model elmo_model hub.Module(https://tfhub.dev/google/elmo/1, trainableTrue) sess.run(tf.global_variables_initializer()) sess.run(tf.tables_initializer())# We create a function to integrate the tensorflow model with a Keras model # This requires explicitly casting the tensor to a string, because of a Keras quirk def ElmoEmbedding(x):return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signaturedefault, as_dictTrue)[default]input_text layers.Input(shape(1,), dtypetf.string) embedding layers.Lambda(ElmoEmbedding, output_shape(1024,))(input_text) dense layers.Dense(256, activationrelu)(embedding) pred layers.Dense(1, activationsigmoid)(dense)model Model(inputs[input_text], outputspred)model.compile(lossbinary_crossentropy, optimizeradam, metrics[accuracy]) model.summary()复制代码2.4 模型的优缺点优点效果好在大部分任务上都较传统模型有提升。实验正式ELMo相比于词向量可以更好地捕捉到语法和语义层面的信息。传统的预训练词向量只能提供一层表征而且词汇量受到限制。ELMo所提供的是character-level的表征对词汇量没有限制。缺点速度较慢对每个token编码都要通过language model计算得出。2.5 适用任务Question AnsweringTextual entailmentSemantic role labelingCoreference resolutionNamed entity extractionSentiment analysis3. ULMFiT3.1 模型原理与架构原文链接Universal Language Model Fine-tuning for Text ClassificationULMFiT是一种有效的NLP迁移学习方法核心思想是通过精调预训练的语言模型完成其他NLP任务。文中所用的语言模型参考了Merity et al. 2017a的AWD-LSTM模型即没有attention或shortcut的三层LSTM模型。ULMFiT的过程分为三步1. General-domain LM pre-train在Wikitext-103上进行语言模型的预训练。预训练的语料要求large capture general properties of language 预训练对小数据集十分有效之后仅有少量样本就可以使模型泛化。2. Target task LM fine-tuning文中介绍了两种fine-tuning方法 Discriminative fine-tuning因为网络中不同层可以捕获不同类型的信息因此在精调时也应该使用不同的learning rate。作者为每一层赋予一个学习率实验后发现首先通过精调模型的最后一层L确定学习率再递推地选择上一层学习率进行精调的效果最好递推公式为: Slanted triangular learning rates (STLR)为了针对特定任务选择参数理想情况下需要在训练开始时让参数快速收敛到一个合适的区域之后进行精调。为了达到这种效果作者提出STLR方法即让LR在训练初期短暂递增在之后下降。如图b的右上角所示。具体的公式为 T: number of training iterationscut_frac: fraction of iterations we increase the LRcut: the iteration when we switch from increasing to decreasing the LRp: the fraction of the number of iterations we have increased or will decrease the LR respectivelyratio: specifies how much smaller the lowest LR is from thr max LR : the LR at iteration t 文中作者使用的 3. Target task classifier fine-tuning为了完成分类任务的精调作者在最后一层添加了两个线性block每个都有batch-norm和dropout使用ReLU作为中间层激活函数最后经过softmax输出分类的概率分布。最后的精调涉及的环节如下 Concat pooling 第一个线性层的输入是最后一个隐层状态的池化。因为文本分类的关键信息可能在文本的任何地方所以只是用最后时间步的输出是不够的。作者将最后时间步与尽可能多的时间步池化后拼接起来以作为输入。 Gradual unfreezing 由于过度精调会导致模型遗忘之前预训练得到的信息作者提出逐渐unfreez网络层的方法从最后一层开始unfreez和精调由后向前地unfreez并精调所有层。 BPTT for Text Classification (BPT3C) 为了在large documents上进行模型精调作者将文档分为固定长度为b的batches并在每个batch训练时记录mean和max池化梯度会被反向传播到对最终预测有贡献的batches。 Bidirectional language model 在作者的实验中分别独立地对前向和后向LM做了精调并将两者的预测结果平均。两者结合后结果有0.5-0.7的提升。3.2 模型训练注意事项- PyTorch版源码解析 (FastAI第10课)# location: fastai/lm_rnn.pydef get_language_model(n_tok, emb_sz, n_hid, n_layers, pad_token,dropout0.4, dropouth0.3, dropouti0.5, dropoute0.1, wdrop0.5, tie_weightsTrue, qrnnFalse, biasFalse):Returns a SequentialRNN model.A RNN_Encoder layer is instantiated using the parameters provided.This is followed by the creation of a LinearDecoder layer.Also by default (i.e. tie_weights True), the embedding matrix used in the RNN_Encoderis used to instantiate the weights for the LinearDecoder layer.The SequentialRNN layer is the native torchs Sequential wrapper that puts the RNN_Encoder andLinearDecoder layers sequentially in the model.Args:n_tok (int): number of unique vocabulary words (or tokens) in the source datasetemb_sz (int): the embedding size to use to encode each tokenn_hid (int): number of hidden activation per LSTM layern_layers (int): number of LSTM layers to use in the architecturepad_token (int): the int value used for padding text.dropouth (float): dropout to apply to the activations going from one LSTM layer to anotherdropouti (float): dropout to apply to the input layer.dropoute (float): dropout to apply to the embedding layer.wdrop (float): dropout used for a LSTMs internal (or hidden) recurrent weights.tie_weights (bool): decide if the weights of the embedding matrix in the RNN encoder should be tied to theweights of the LinearDecoder layer.qrnn (bool): decide if the model is composed of LSTMS (False) or QRNNs (True).bias (bool): decide if the decoder should have a bias layer or not.Returns:A SequentialRNN modelrnn_enc RNN_Encoder(n_tok, emb_sz, n_hidn_hid, n_layersn_layers, pad_tokenpad_token,dropouthdropouth, dropoutidropouti, dropoutedropoute, wdropwdrop, qrnnqrnn)enc rnn_enc.encoder if tie_weights else Nonereturn SequentialRNN(rnn_enc, LinearDecoder(n_tok, emb_sz, dropout, tie_encoderenc, biasbias))def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid, n_layers, pad_token, layers, drops, bidirFalse,dropouth0.3, dropouti0.5, dropoute0.1, wdrop0.5, qrnnFalse):rnn_enc MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_tokenpad_token, bidirbidir,dropouthdropouth, dropoutidropouti, dropoutedropoute, wdropwdrop, qrnnqrnn)return SequentialRNN(rnn_enc, PoolingLinearClassifier(layers, drops))复制代码3.3 模型的优缺点优点对比其他迁移学习方法ELMo更适合以下任务 - 非英语语言有标签训练数据很少 - 没有state-of-the-art模型的新NLP任务 - 只有部分有标签数据的任务缺点对于分类和序列标注任务比较容易迁移对于复杂任务问答等需要新的精调方法。3.4 适用任务ClassificationSequence labeling4. OpenAI GPT4.1 模型原理与架构原文链接Improving Language Understanding by Generative Pre-Training (未出版)OpenAI Transformer是一类可迁移到多种NLP任务的基于Transformer的语言模型。它的基本思想同ULMFiT相同都是在尽量不改变模型结构的情况下将预训练的语言模型应用到各种任务。不同的是OpenAI Transformer主张用Transformer结构而ULMFiT中使用的是基于RNN的语言模型。文中所用的网络结构如下模型的训练过程分为两步1. Unsupervised pre-training第一阶段的目标是预训练语言模型给定tokens的语料目标函数为最大化似然函数该模型中应用multi-headed self-attention并在之后增加position-wise的前向传播层最后输出一个分布2. Supervised fine-tuning有了预训练的语言模型之后对于有标签的训练集给定输入序列和标签可以通过语言模型得到经过输出层后对进行预测则目标函数为整个任务的目标函数为4.2 模型训练注意事项- TF版源码解析# location: finetune-transformer-lm/train.pydef model(X, M, Y, trainFalse, reuseFalse):with tf.variable_scope(model, reusereuse):# n_special3作者把数据集分为三份# n_ctx 应该是 n_contextwe tf.get_variable(we, [n_vocabn_specialn_ctx, n_embd], initializertf.random_normal_initializer(stddev0.02))we dropout(we, embd_pdrop, train)X tf.reshape(X, [-1, n_ctx, 2])M tf.reshape(M, [-1, n_ctx])# 1. Embeddingh embed(X, we)# 2. transformer blockfor layer in range(n_layer):h block(h, h%d%layer, traintrain, scaleTrue)# 3. 计算语言模型losslm_h tf.reshape(h[:, :-1], [-1, n_embd])lm_logits tf.matmul(lm_h, we, transpose_bTrue)lm_losses tf.nn.sparse_softmax_cross_entropy_with_logits(logitslm_logits, labelstf.reshape(X[:, 1:, 0], [-1]))lm_losses tf.reshape(lm_losses, [shape_list(X)[0], shape_list(X)[1]-1])lm_losses tf.reduce_sum(lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)# 4. 计算classifier lossclf_h tf.reshape(h, [-1, n_embd])pool_idx tf.cast(tf.argmax(tf.cast(tf.equal(X[:, :, 0], clf_token), tf.float32), 1), tf.int32)clf_h tf.gather(clf_h, tf.range(shape_list(X)[0], dtypetf.int32)*n_ctxpool_idx)clf_h tf.reshape(clf_h, [-1, 2, n_embd])if train and clf_pdrop 0:shape shape_list(clf_h)shape[1] 1clf_h tf.nn.dropout(clf_h, 1-clf_pdrop, shape)clf_h tf.reshape(clf_h, [-1, n_embd])clf_logits clf(clf_h, 1, traintrain)clf_logits tf.reshape(clf_logits, [-1, 2])clf_losses tf.nn.sparse_softmax_cross_entropy_with_logits(logitsclf_logits, labelsY)return clf_logits, clf_losses, lm_losses复制代码4.3 模型的优缺点优点循环神经网络所捕捉到的信息较少而Transformer可以捕捉到更长范围的信息。计算速度比循环神经网络更快易于并行化实验结果显示Transformer的效果比ELMo和LSTM网络更好缺点对于某些类型的任务需要对输入数据的结构作调整4.4 适用任务Natural Language InferenceQuestion Answering and commonsense reasoningClassificationSemantic Similarity5. 总结从Wrod Embedding到OpenAI TransformerNLP中的迁移学习从最初使用word2vec、GLoVe进行字词的向量表示到ELMo可以提供前几层的权重共享再到ULMFiT和OpenAI Transformer的整个预训练模型的精调大大提高了NLP基本任务的效果。同时多项研究也表明以语言模型作为预训练模型不仅可以捕捉到文字间的语法信息更可以捕捉到语义信息为后续的网络层提供高层次的抽象信息。另外基于Transformer的模型在一些方面也展现出了优于RNN模型的效果。最后关于具体任务还是要进行多种尝试可以使用以上方法做出模型baseline再调整网络结构提升效果。

查看全文

http://www.zqtcl.cn/news/524578/