asp网站连不上数据库,新农村建设举报网站,西安网站建设产品,郑州最好的精神病医院Byte Pair Encoding#xff08;BPE#xff09;算法
BPE算法是Transformer中构建词表的方法#xff0c;大致分为如下几个步骤#xff1a;
将语料中的文本切分为字符统计高频共现二元组将共现频率最高的二元组合并加入词表重复上述第二和第三直到词表规模达到预先设置的数量…Byte Pair EncodingBPE算法
BPE算法是Transformer中构建词表的方法大致分为如下几个步骤
将语料中的文本切分为字符统计高频共现二元组将共现频率最高的二元组合并加入词表重复上述第二和第三直到词表规模达到预先设置的数量或没有可以合并的二元组为止
以GPT-2中BPE相关的代码为例对代码进行整理
完整代码如下所示 BPE算法:字节对编码算法,将任意UTF-8字符串转换为整数索引序列,方便后续的神经网络运算。bpe is short for Byte Pair Encoder. It translates arbitrary utf-8 strings into
sequences of integers, where each integer represents small chunks of commonly
occuring characters. This implementation is based on openais gpt2 encoder.py:
https://github.com/openai/gpt-2/blob/master/src/encoder.py
but was mildly modified because the original implementation is a bit confusing.
I also tried to add as many comments as possible, my own understanding of whats
going on.
import os
import json
import regex as re
import requestsimport torch# -----------------------------------------------------------------------------def bytes_to_unicode():将字节(8bit-2**8-256个)转换为unicode表示的字符。有些字节表示的字符太丑了,比如chr(0)为\x00,OpenAI选择进行额外的转换。Every possible byte (really an integer 0..255) gets mapped by OpenAI to a unicodecharacter that represents it visually. Some bytes have their appearance preservedbecause they dont cause any trouble. These are defined in list bs. For example:chr(33) returns !, so in the returned dictionary we simply have d[33] - !.However, chr(0), for example, is \x00, which looks ugly. So OpenAI maps thesebytes, into new characters in a range where chr() returns a single nice character.So in the final dictionary we have d[0] - Ā instead, which is just chr(0 2**8).In particular, the space character is 32, which we can see by ord( ). Instead,this function will shift space (32) by 256 to 288, so d[32] - Ġ.So this is just a simple one-to-one mapping of bytes 0..255 into unicode charactersthat look nice, either in their original form, or a funny shifted characterlike Ā, or Ġ, etc.# the 188 integers that render fine in their original form and need no shiftingbs list(range(ord(!), ord(~)1))list(range(ord(¡), ord(¬)1))list(range(ord(®), ord(ÿ)1))cs bs[:] # all integers b in bs will simply map to chr(b) in the output dict# now get the representations of the other 68 integers that do need shifting# each will get mapped chr(256 n), where n will grow from 0...67 in the loopn 0for b in range(2**8):if b not in bs:# if this byte is ugly then map it to the next available nice characterbs.append(b)cs.append(2**8n)n 1cs [chr(n) for n in cs]d dict(zip(bs, cs))return ddef get_pairs(word):获取一个单词中所有可能的字符二元组Return all bigrams as a set of tuples, of consecutive elements in the iterable word.pairs set()prev_char word[0]for char in word[1:]:pairs.add((prev_char, char))prev_char charreturn pairsclass Encoder:def __init__(self, encoder, bpe_merges):# byte encoder/decoderself.byte_encoder bytes_to_unicode()self.byte_decoder {v:k for k, v in self.byte_encoder.items()}# bpe token encoder/decoderself.encoder encoder # 将字符串转换为整数索引self.decoder {v:k for k,v in self.encoder.items()} # 将整数索引转换为字符串# bpe merge list that defines the bpe tree, of tuples (a,b) that are to merge to token abself.bpe_ranks dict(zip(bpe_merges, range(len(bpe_merges))))# the splitting pattern used for pre-tokenization# Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions -- original openai commentok so what is this regex looking for, exactly?python re reference: https://docs.python.org/3/library/re.html- the vertical bars | is OR, so re.findall will chunkate text as the pieces match, from left to right- \s would split up things like Andrejs - (Andrej, s)- ?\p{L}: optional space followed by 1 unicode code points in the category letter- ?\p{N}: optional space followed by 1 unicode code points in the category number- ?[^\s\p{L}\p{N}]: optional space, then 1 things that are NOT a whitespace, letter or number- \s(?!\S): 1 whitespace characters (e.g. space or tab or etc) UNLESS they are followed by non-whitespaceso this will consume whitespace characters in a sequence but exclude the last whitespace inthat sequence. that last whitespace has the opportunity to then match the optional ? inearlier patterns.- \s: 1 whitespace characters, intended probably to catch a full trailing sequence of whitespaces at end of stringSo TLDR:- we are special casing a few common apostrophe constructs (s, t, re, ...) and making those into separate tokens- we then separate out strings into consecutive chunks of 1) letters, 2) numbers, 3) non-letter-numbers, 4) whitespacesself.pat re.compile(rs|t|re|ve|m|ll|d| ?\p{L}| ?\p{N}| ?[^\s\p{L}\p{N}]|\s(?!\S)|\s) # 预先使用一些正则表达式提前将字符串切分例如将字符串划分为连续的字母、数字、空格和其他字符。包括一些英文的规则。self.cache {}def bpe(self, token):对每个预先切分出来的token进行进一步的bpe切分,切分主要依赖于预先统计的bpe_ranks;bpe_ranks: 从大规模语料中统计的bi-gram共现频率this function uses self.bpe_ranks to iteratively merge all the possible bpe tokensup the tree. token is a string of one individual word (after regex tokenization)and after byte encoding, e.g. Ġthere.# token is a string of one individual word, after byte encoding, e.g. Ġthere# memoization, for efficiencyif token in self.cache: # cache缓存加速bpe算法return self.cache[token]word tuple(token) # individual characters that make up the token, in a tuplepairs get_pairs(word) # get all bigramsif not pairs:return tokenwhile True:# find the next lowest rank bigram that can be mergedbigram min(pairs, key lambda pair: self.bpe_ranks.get(pair, float(inf))) # 优先合并共现频率高的二元组if bigram not in self.bpe_ranks: # 如果剩下的二元组共现频率过低break # no more bigrams are eligible to be mergedfirst, second bigram# we will now replace all occurences of (first, second) in the list of current# words into one merged token first_second, in the output list new_wordsnew_word []i 0while i len(word): # 合并二元组(考虑多次出现的情况)# find the next occurence of first in the sequence of current wordstry:j word.index(first, i)new_word.extend(word[i:j])i jexcept:new_word.extend(word[i:])break# if this occurence is also followed by second, then merge them into oneif word[i] first and i len(word)-1 and word[i1] second:new_word.append(firstsecond)i 2else:new_word.append(word[i])i 1# all occurences of (first, second) have been merged to first_secondnew_word tuple(new_word)word new_wordif len(word) 1:breakelse:pairs get_pairs(word)# concat all words into a string, and use as the separator. Note that# by now all characters have been byte encoded, guaranteeing that is# not used in the actual data and is a special delimiter characterword .join(word)# cache the result and returnself.cache[token] wordreturn worddef encode(self, text): 字符串序列转整数索引序列string goes in, list of integers comes outbpe_idx []# pre-tokenize the input text into string tokens (words, roughly speaking)tokens re.findall(self.pat, text) # 预先使用正则表达式粗糙切分# process each token into BPE integersfor token in tokens: # 每个token内部使用bpe不断合并二元组# encode the token as a bytes (b) objecttoken_bytes token.encode(utf-8)# translate all bytes to their unicode string representation and flattentoken_translated .join(self.byte_encoder[b] for b in token_bytes)# perform all the applicable bpe merges according to self.bpe_rankstoken_merged self.bpe(token_translated).split( )# translate all bpe tokens to integerstoken_ix [self.encoder[bpe_token] for bpe_token in token_merged]# extend our running list of all output integersbpe_idx.extend(token_ix)return bpe_idxdef encode_and_show_work(self, text): debugging function, same as encode but returns all intermediate work bpe_idx []parts []tokens re.findall(self.pat, text)for token in tokens:token_bytes token.encode(utf-8)token_translated .join(self.byte_encoder[b] for b in token_bytes)token_merged self.bpe(token_translated).split( )token_ix [self.encoder[bpe_token] for bpe_token in token_merged]bpe_idx.extend(token_ix)parts.append({token: token,token_bytes: token_bytes,token_translated: token_translated,token_merged: token_merged,token_ix: token_ix,})out {bpe_idx: bpe_idx, # the actual output sequencetokens: tokens, # result of pre-tokenizationparts: parts, # intermediates for each token part}return outdef decode(self, bpe_idx): 整数索引序列恢复成字符串序列list of integers comes in, string comes out # inverse map the integers to get the tokenstokens_merged [self.decoder[token] for token in bpe_idx]# inverse the byte encoder, e.g. recovering Ġ - , and get the bytestokens_flat .join(tokens_merged)tokens_bytes bytearray([self.byte_decoder[c] for c in tokens_flat])# recover the full utf-8 stringtext tokens_bytes.decode(utf-8, errorsreplace)return textdef get_file(local_file, remote_file): downloads remote_file to local_file if necessary if not os.path.isfile(local_file):print(fdownloading {remote_file} to {local_file})response requests.get(remote_file)open(local_file, wb).write(response.content)def get_encoder():从OpenAI官方的GPT-2分词器cache文件初始化Returns an instance of the GPT BPE Encoder/Decoderand handles caching of database files.home_dir os.path.expanduser(~)cache_dir os.path.join(home_dir, .cache, mingpt)os.makedirs(cache_dir, exist_okTrue)# load encoder.json that has the raw mappings from token - bpe indexencoder_local_file os.path.join(cache_dir, encoder.json)encoder_remote_file https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.jsonget_file(encoder_local_file, encoder_remote_file)with open(encoder_local_file, r) as f:encoder json.load(f)assert len(encoder) 50257 # 256 individual byte tokens, 50,000 merged tokens, and 1 special |endoftext| token# load vocab.bpe that contains the bpe merges, i.e. the bpe tree structure# in the form tuples (a, b), that indicate that (a, b) is to be merged to one token abvocab_local_file os.path.join(cache_dir, vocab.bpe)vocab_remote_file https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpeget_file(vocab_local_file, vocab_remote_file)with open(vocab_local_file, r, encodingutf-8) as f:bpe_data f.read()# light postprocessing: strip the version on first line and the last line is a blankbpe_merges [tuple(merge_str.split()) for merge_str in bpe_data.split(\n)[1:-1]]assert len(bpe_merges) 50000 # 50,000 merged tokens# construct the Encoder object and returnenc Encoder(encoder, bpe_merges)return enc# -----------------------------------------------------------------------------class BPETokenizer: PyTorch-aware class that wraps the Encoder above def __init__(self):self.encoder get_encoder()def __call__(self, text, return_tensorspt):# PyTorch only; here because we want to match huggingface/transformers interfaceassert return_tensors pt# single string input for now, in the future potentially a list of stringsassert isinstance(text, str)# encode and create a batch dimension of 1idx [self.encoder.encode(text)]# wrap into PyTorch tensorout torch.tensor(idx, dtypetorch.long)return outdef decode(self, idx):# ensure a simple 1D tensor for nowassert idx.ndim 1# decode indices to texttext self.encoder.decode(idx.tolist())return text从Encoder类中bpe方法出发理解BPE的全过程以下为bpe方法代码
def bpe(self, token):# cache缓存加速bpe算法if token in self.cache: return self.cache[token]word tuple(token) # individual characters that make up the token, in a tuplepairs get_pairs(word) # get all bigramsif not pairs:return tokenwhile True:# find the next lowest rank bigram that can be mergedbigram min(pairs, key lambda pair: self.bpe_ranks.get(pair, float(inf))) # 优先合并共现频率高的二元组if bigram not in self.bpe_ranks: # 如果剩下的二元组共现频率过低break # no more bigrams are eligible to be mergedfirst, second bigram# we will now replace all occurences of (first, second) in the list of current# words into one merged token first_second, in the output list new_wordsnew_word []i 0while i len(word): # 合并二元组(考虑多次出现的情况)# find the next occurence of first in the sequence of current wordstry:j word.index(first, i)new_word.extend(word[i:j])i jexcept:new_word.extend(word[i:])break# if this occurence is also followed by second, then merge them into oneif word[i] first and i len(word)-1 and word[i1] second:new_word.append(firstsecond)i 2else:new_word.append(word[i])i 1# all occurences of (first, second) have been merged to first_secondnew_word tuple(new_word)word new_wordif len(word) 1:breakelse:pairs get_pairs(word)# concat all words into a string, and use as the separator. Note that# by now all characters have been byte encoded, guaranteeing that is# not used in the actual data and is a special delimiter characterword .join(word)# cache the result and returnself.cache[token] wordreturn word以下是对bpe方法代码分块进行解读 在Encoder类中初始化一个缓存空间在每次对token进行bpe操作时先验证缓存空间中是否包含若有包含则直接结束。# cache缓存加速bpe算法
if token in self.cache: return self.cache[token]将输入bpe方法的token进行切分此时输入的token是一个已将文本切分后的单词使用tuple对单词中所有字符进行拆分形成一个包含token中所有字符的元组。word tuple(token) # individual characters that make up the token, in a tuple使用get_pairs函数通过对已经拆分好的token字符元组获取所有可能的字符二元组pairs get_pairs(word) # get all bigrams输入的word是token中所有字符的有序元组从元组中的第一个字符开始每两个相邻的字符组成一个二元组def get_pairs(word):pairs set()prev_char word[0]for char in word[1:]:pairs.add((prev_char, char))prev_char charreturn pairs判断输入的token是否产生了二元组若没有产生二元组则结束if not pairs:return token找到生成的二元组中共现频率最高的其中使用bpe_ranks获得二元组频率排名通过排名找到排名最小也就是频率最高的二元组# find the next lowest rank bigram that can be merged
bigram min(pairs, key lambda pair: self.bpe_ranks.get(pair, float(inf))) # 优先合并共现频率高的二元组 形成二元组对应共现频率的字典其中bpe_merges是从已经统计好的文件中读取二元组频率数据self.bpe_ranks dict(zip(bpe_merges, range(len(bpe_merges))))读取的文件中每行是一个二元组行号即为频率行号越小频率越高vocab_local_file os.path.join(cache_dir, vocab.bpe)
vocab_remote_file https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe
get_file(vocab_local_file, vocab_remote_file)
with open(vocab_local_file, r, encodingutf-8) as f:bpe_data f.read()
# light postprocessing: strip the version on first line and the last line is a blank
bpe_merges [tuple(merge_str.split()) for merge_str in bpe_data.split(\n)[1:-1]]bpe_ranks中不存在的频率过低的二元组直接跳过first代表二元组中的第一个字符second代表二元组中第二个字符if bigram not in self.bpe_ranks: # 如果剩下的二元组共现频率过低break # no more bigrams are eligible to be merged
first, second bigram此部分代码是将token中所有的字符和最高频率二元组加入到new_word列表中# we will now replace all occurences of (first, second) in the list of current
# words into one merged token first_second, in the output list new_words
new_word []
i 0
while i len(word): # 合并二元组(考虑多次出现的情况)# find the next occurence of first in the sequence of current wordstry:j word.index(first, i)new_word.extend(word[i:j])i jexcept:new_word.extend(word[i:])break# if this occurence is also followed by second, then merge them into oneif word[i] first and i len(word)-1 and word[i1] second:new_word.append(firstsecond)i 2else:new_word.append(word[i])i 1如果新生成的字符只有一个则直接退出如果有多个则获得新的字符对继续执行# all occurences of (first, second) have been merged to first_second
new_word tuple(new_word)
word new_word
if len(word) 1:break
else:pairs get_pairs(word)最后将字符通过空格连接为一个字符串并存入缓存中word .join(word)# cache the result and return
self.cache[token] word注
本文以GPT-2中的BPE代码为例主要记录了其中Encoder类里的bpe方法相关代码的阅读笔记