上海网站建设上海员君,应用软件和嵌入式软件的区别,合二为一的创意产品设计,专业的企业网站优化公司本文主要参考《A Neural Probabilistic Language Model》这是一篇很重要的语言模型论文,发表于2003年。主要贡献如下:
提出了一种基于神经网络的语言模型#xff0c;是较早将神经网络应用于语言模型领域的工作之一#xff0c;具有里程碑意义。采用神经网络模型预测下一个单词…本文主要参考《A Neural Probabilistic Language Model》这是一篇很重要的语言模型论文,发表于2003年。主要贡献如下:
提出了一种基于神经网络的语言模型是较早将神经网络应用于语言模型领域的工作之一具有里程碑意义。采用神经网络模型预测下一个单词的概率分布已经成为神经网络语言模型训练的标准方法之一。在论文中,作者训练了一个前馈神经网络同时学习词的特征表示和词序列的概率函数,并取得了比 trigram 语言模型更好的单词预测效果这证明了神经网络语言模型的有效性。论文提出的思想与模型为后续的许多神经网络语言模型研究奠定了基础。
模型结构 模型的目标是学到一个概率函数根据上下文预测下一个词的概率分布
import os
import time
import pandas as pd
from dataclasses import dataclassimport torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.tensorboard import SummaryWriter
加载数据
数据集来10中文外卖评价数据集
data pd.read_csv(./dataset/waimai_10k.csv)
data.dropna(subsetreview,inplaceTrue)
data[review_length] data.review.apply(lambda x:len(x))
data.sample(5)
labelreviewreview_length45450从说马上送餐到收到餐时间特别久给的送餐电话打不通2898550韩餐做得像川菜牛肉汤油得不能喝量也比实体少很多,送餐时间久得太久了1个半小时唉。4456640太糟了。等了两个小时,牛肉我吃的快吐了,再也不可能第二次2823231很好吃,就是粥撒了点,等了一个多小时1881170送餐员给我打电话比较粗鲁12
统计信息
data data[data.review_length 50]
words data.review.tolist()
chars sorted(list(set(.join(words))))
max_word_length max(len(w) for w in words)print(fnumber of examples: {len(words)})
print(fmax word length: {max_word_length})
print(fsize of vocabulary: {len(chars)})
number of examples: 10796
max word length: 50
size of vocabulary: 2272
划分训练/测试集
test_set_size min(1000, int(len(words) * 0.1))
rp torch.randperm(len(words)).tolist()
train_words [words[i] for i in rp[:-test_set_size]]
test_words [words[i] for i in rp[-test_set_size:]]
print(fsplit up the dataset into {len(train_words)} training examples and {len(test_words)} test examples)
split up the dataset into 9796 training examples and 1000 test examples
构造字符数据集[tensor] BLANK : 0token seqs : [1, 2, 3, 4, 5, 6]block_size : 3,上下文长度x : [[0, 0, 0],[0, 0, 1],[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]y : [1, 2, 3, 4, 5, 6, 0]
class CharDataset(Dataset):def __init__(self, words, chars, max_word_length, block_size1):self.words wordsself.chars charsself.max_word_length max_word_lengthself.block_size block_sizeself.char2i {ch:i1 for i,ch in enumerate(chars)}self.i2char {i:s for s,i in self.char2i.items()} def __len__(self):return len(self.words)def contains(self, word):return word in self.wordsdef get_vocab_size(self):return len(self.chars) 1 def get_output_length(self):return self.max_word_length 1def encode(self, word):ix torch.tensor([self.char2i[w] for w in word], dtypetorch.long)return ixdef decode(self, ix):word .join(self.i2char[i] for i in ix)return worddef __getitem__(self, idx):word self.words[idx]ix self.encode(word)x torch.zeros(self.max_word_length self.block_size, dtypetorch.long)y torch.zeros(self.max_word_length, dtypetorch.long)x[self.block_size:len(ix)self.block_size] ixy[:len(ix)] ixy[len(ix)1:] -1 if self.block_size 1:xs []for i in range(x.shape[0]-self.block_size):xs.append(x[i:iself.block_size].unsqueeze(0))return torch.cat(xs), yelse:return x, y
数据加载器[DataLoader]
class InfiniteDataLoader:def __init__(self, dataset, **kwargs):train_sampler torch.utils.data.RandomSampler(dataset, replacementTrue, num_samplesint(1e10))self.train_loader DataLoader(dataset, samplertrain_sampler, **kwargs)self.data_iter iter(self.train_loader)def next(self):try:batch next(self.data_iter)except StopIteration: self.data_iter iter(self.train_loader)batch next(self.data_iter)return batch
构建模型
context_tokens → \to → embedding → \to → concate feature vector → \to → hidden layer → \to → output layer
dataclass
class ModelConfig:block_size: int None vocab_size: int None n_embed : int Nonen_hidden: int None
class MLP(nn.Module):takes the previous block_size tokens, encodes them with a lookup table,concatenates the vectors and predicts the next token with an MLP.Reference:Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdfdef __init__(self, config):super().__init__()self.block_size config.block_sizeself.vocab_size config.vocab_sizeself.wte nn.Embedding(config.vocab_size 1, config.n_embed) self.mlp nn.Sequential(nn.Linear(self.block_size * config.n_embed, config.n_hidden),nn.Tanh(),nn.Linear(config.n_hidden, self.vocab_size))def get_block_size(self):return self.block_sizedef forward(self, idx, targetsNone):embs []for k in range(self.block_size):tok_emb self.wte(idx[:,:,k]) embs.append(tok_emb)x torch.cat(embs, -1) logits self.mlp(x)loss Noneif targets is not None:loss F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index-1)return logits, loss * 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
* 26
* 27
* 28
* 29
* 30
* 31
* 32
* 33
* 34
* 35
* 36
* 37
* 38
* 39
* 40
* 41
* 42
* 43
* 44
torch.inference_mode()
def evaluate(model, dataset, batch_size10, max_batchesNone):model.eval()loader DataLoader(dataset, shuffleTrue, batch_sizebatch_size, num_workers0)losses []for i, batch in enumerate(loader):batch [t.to(cuda) for t in batch]X, Y batchlogits, loss model(X, Y)losses.append(loss.item())if max_batches is not None and i max_batches:breakmean_loss torch.tensor(losses).mean().item()model.train() return mean_loss
训练模型
环境初始化
torch.manual_seed(seed12345)
torch.cuda.manual_seed_all(seed12345)work_dir ./Mlp_log
os.makedirs(work_dir, exist_okTrue)
writer SummaryWriter(log_dirwork_dir)
config ModelConfig(vocab_sizetrain_dataset.get_vocab_size(),block_size7,n_embed64,n_hidden128)
格式化数据
train_dataset CharDataset(train_words, chars, max_word_length, block_sizeconfig.block_size)
test_dataset CharDataset(test_words, chars, max_word_length, block_sizeconfig.block_size)train_dataset[0][0].shape, train_dataset[0][1].shape
(torch.Size([50, 7]), torch.Size([50]))
初始化模型
model MLP(config)model.to(cuda)
MLP((wte): Embedding(2274, 64)(mlp): Sequential((0): Linear(in_features448, out_features128, biasTrue)(1): Tanh()(2): Linear(in_features128, out_features2273, biasTrue))
) optimizer torch.optim.AdamW(model.parameters(), lr5e-4, weight_decay0.01, betas(0.9, 0.99), eps1e-8)batch_loader InfiniteDataLoader(train_dataset, batch_size64, pin_memoryTrue, num_workers4)best_loss None
step 0
train_losses, test_losses [],[]
while True:t0 time.time()batch batch_loader.next()batch [t.to(cuda) for t in batch]X, Y batchlogits, loss model(X, Y)model.zero_grad(set_to_noneTrue)loss.backward()optimizer.step()torch.cuda.synchronize()t1 time.time()if step % 1000 0:print(fstep {step} | loss {loss.item():.4f} | step time {(t1-t0)*1000:.2f}ms)if step 0 and step % 100 0:train_loss evaluate(model, train_dataset, batch_size100, max_batches10)test_loss evaluate(model, test_dataset, batch_size100, max_batches10)train_losses.append(train_loss)test_losses.append(test_loss)if best_loss is None or test_loss best_loss:out_path os.path.join(work_dir, model.pt)print(ftest loss {test_loss} is the best so far, saving model to {out_path})torch.save(model.state_dict(), out_path)best_loss test_lossstep 1if step 15100:break * 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
* 26
* 27
* 28
* 29
* 30
* 31
* 32
* 33
* 34
* 35
* 36
* 37
* 38
* 39
* 40
* 41
* 42
* 43
* 44
* 45
* 46
* 47
* 48
* 49
step 0 | loss 7.7551 | step time 13.09ms
test loss 5.533482551574707 is the best so far, saving model to ./Mlp_log/model.pt
test loss 5.163593292236328 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.864410877227783 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.6439409255981445 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.482759475708008 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.350367069244385 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.250306129455566 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.16674280166626 is the best so far, saving model to ./Mlp_log/model.pt
test loss 4.0940842628479 is the best so far, saving model to ./Mlp_log/model.pt
.......................
step 6000 | loss 2.8038 | step time 6.44ms
step 7000 | loss 2.7815 | step time 11.88ms
step 8000 | loss 2.6511 | step time 5.93ms
step 9000 | loss 2.5898 | step time 5.00ms
step 10000 | loss 2.6600 | step time 6.12ms
step 11000 | loss 2.4634 | step time 5.94ms
step 12000 | loss 2.5373 | step time 7.75ms
step 13000 | loss 2.4050 | step time 6.29ms
step 14000 | loss 2.5434 | step time 7.77ms
step 15000 | loss 2.4084 | step time 7.10ms * 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
测试评论生成器
torch.no_grad()
def generate(model, idx, max_new_tokens, temperature1.0, do_sampleFalse, top_kNone):block_size model.get_block_size()for _ in range(max_new_tokens):idx_cond idx if idx.size(2) block_size else idx[:, :,-block_size:]logits, _ model(idx_cond)logits logits[:,-1,:] / temperatureif top_k is not None:v, _ torch.topk(logits, top_k)logits[logits v[:, [-1]]] -float(Inf)probs F.softmax(logits, dim-1)if do_sample:idx_next torch.multinomial(probs, num_samples1)else:_, idx_next torch.topk(probs, k1, dim-1)idx torch.cat((idx, idx_next.unsqueeze(1)), dim-1)return idx * 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
* 11
* 12
* 13
* 14
* 15
* 16
* 17
* 18
* 19
* 20
* 21
* 22
* 23
* 24
* 25
def print_samples(num13, block_size3, top_k None):X_init torch.zeros((num, 1, block_size), dtypetorch.long).to(cuda)steps train_dataset.get_output_length() - 1 X_samp generate(model, X_init, steps, top_ktop_k, do_sampleTrue).to(cuda)new_samples []for i in range(X_samp.size(0)):row X_samp[i, :, block_size:].tolist()[0] crop_index row.index(0) if 0 in row else len(row)row row[:crop_index]word_samp train_dataset.decode(row)new_samples.append(word_samp)return new_samples
不同上下文长度的生成效果
block_size3
送餐大叔叔风怎么第一次点的1迷就没有需减改进,送餐很快菜品一般送到都等到了都很在店里吃不出肥肉第我地佩也不好意思了。第一次最爱付了凉面味道不,很不好进吧。。。。。这点一次都是卫生骑题调菜油腻真不太满意,11点送到指定地形,不知道他由、奶茶类应盒子幸好咸。。。,味道一般小份速度太难吃了。,快递小哥很贴心也吃不习惯。,非常慢。,为什么,4个盒子反正订的有点干,送餐速度把面洒了不超值很快少菜分量不够吃了味道很少餐,骑士剁疼倒还没给糖的,怎么吃正好吃便宜
block_size5
[味道不错,送餐大哥工餐大哥应不错。,配送很不满意,土豆炒几次一小时才没吃幸太多,粥不好吃没有病311小菜送到吃完太差了,太咸了很感谢到对这次送餐员辛苦服务很不好,真的很香菇沙,卷哪丝口气无语了,菜不怎么夹生若梦粥小伙n丁也没有收到餐。。。,一点不脆1个多小时才送到。等了那个小时。,就是送的太慢。。。。一京酱肉丝卷太不点了了,大份小太爱真心不难吃最后我的平时面没有听说什么呢,就,慢能再提前的好,牛肉好吃而且感觉适合更能事味道倒卷送的也很快]
block_size 7
[味道还不错但是酱也没给一点餐不差,都是肥肉,有差劲儿大的也太给了那么好给这么多后超难吃,少了一个半小时才吃到了,商务还菜很好的,慢慢了以后!点他家极支付30元分钟送过用了呢。,就是没送到就给送王一袋儿食吃起来掉了有点辣这油还这抄套,很好吃就是送餐师傅不错,包装好的牛肉卷糊弄错酱,重面太少了肉不新鲜就吃了,味道不错送得太慢...,非常好非常快递小哥态度极差一点也好菜和粥洒了一袋软以先订过哈哈哈哈]