当前位置: 首页 > news >正文

淘宝小网站怎么做的电商网站有哪些

淘宝小网站怎么做的,电商网站有哪些,培训行业网站建设,免费学生html网页制作成品前言 使用huggingface的Dataset加载数据集#xff0c;然后使用过tokenizer对文本数据进行编码#xff0c;但是此时的特征数据还不是tensor#xff0c;需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。 本文…前言 使用huggingface的Dataset加载数据集然后使用过tokenizer对文本数据进行编码但是此时的特征数据还不是tensor需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。 本文记录huggingface transformers中两种比较常用的data_collator一种是default_data_collator另一种是DataCollatorWithPadding。本文使用BertTokenizer作为基础tokenizer如下所示 from transformers import BertTokenizer from transformers import default_data_collator, DataCollatorWithPadding from datasets import Datasettokenizer BertTokenizer.from_pretrained(hfl/chinese-bert-wwm-ext)def func(exam):return tokenizer(exam[text])default_data_collator 如果使用pytorch框架default_data_collator本质是执行torch_default_data_collator。注意输入参数要求是List[Any]格式输出需满足Dict[str, Any]格式。 def default_data_collator(features: List[InputDataClass], return_tensorspt) - Dict[str, Any]:Very simple data collator that simply collates batches of dict-like objects and performs special handling forpotential keys named:- label: handles a single value (int or float) per object- label_ids: handles a list of values per objectDoes not do any additional preprocessing: property names of the input object will be used as corresponding inputsto the model. See glue and ner for example of how its useful.# In this function well make the assumption that all features in the batch# have the same attributes.# So we will look at the first element as a proxy for what attributes exist# on the whole batch.if return_tensors pt:return torch_default_data_collator(features)elif return_tensors tf:return tf_default_data_collator(features)elif return_tensors np:return numpy_default_data_collator(features)torch_default_data_collator 源码如下源码中假设所有features特征数据拥有相同的属性信息因此源码选择使用第一个样例数据进行逻辑判断。另外源码对特征数据中的label或者label_ids属性进行特殊处理 分别对应单标签分类 与 多标签分类。并且将特征属性更名为“labels”——大多数预训练模型的forward方法中定义的关键词参数名为labels。 def torch_default_data_collator(features: List[InputDataClass]) - Dict[str, Any]:import torchif not isinstance(features[0], Mapping):features [vars(f) for f in features]first features[0]batch {}# Special handling for labels.# Ensure that tensor is created with the correct type# (it should be automatically the case, but lets make sure of it.)if label in first and first[label] is not None:label first[label].item() if isinstance(first[label], torch.Tensor) else first[label]dtype torch.long if isinstance(label, int) else torch.floatbatch[labels] torch.tensor([f[label] for f in features], dtypedtype)elif label_ids in first and first[label_ids] is not None:if isinstance(first[label_ids], torch.Tensor):batch[labels] torch.stack([f[label_ids] for f in features])else:dtype torch.long if type(first[label_ids][0]) is int else torch.floatbatch[labels] torch.tensor([f[label_ids] for f in features], dtypedtype)# Handling of all other possible keys.# Again, we will use the first element to figure out which key/values are not None for this model.for k, v in first.items():if k not in (label, label_ids) and v is not None and not isinstance(v, str):if isinstance(v, torch.Tensor):batch[k] torch.stack([f[k] for f in features])elif isinstance(v, np.ndarray):batch[k] torch.tensor(np.stack([f[k] for f in features]))else:batch[k] torch.tensor([f[k] for f in features])return batch示例 x [{text: 我爱中国。, label: 1}, {text: 我爱中国。, label: 1}] ds Dataset.from_list(x) features ds.map(func, batchedFalse, remove_columns[text]) dataset default_data_collator(features)DataCollatorWithPadding 注意DataCollatorWithPadding是一个类首先需要实例化然后再将features转为dataset。与default_data_collator相比DataCollatorWithPadding会为接受到的特征数据进行padding操作——各个维度的size补全到相同值。其源码如下 dataclass class DataCollatorWithPadding:Data collator that will dynamically pad the inputs received.Args:tokenizer ([PreTrainedTokenizer] or [PreTrainedTokenizerFast]):The tokenizer used for encoding the data.padding (bool, str or [~utils.PaddingStrategy], *optional*, defaults to True):Select a strategy to pad the returned sequences (according to the models padding side and padding index)among:- True or longest (default): Pad to the longest sequence in the batch (or no padding if only a singlesequence is provided).- max_length: Pad to a maximum length specified with the argument max_length or to the maximumacceptable input length for the model if that argument is not provided.- False or do_not_pad: No padding (i.e., can output a batch with sequences of different lengths).max_length (int, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (int, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability 7.5 (Volta).return_tensors (str):The type of Tensor to return. Allowable values are np, pt and tf.tokenizer: PreTrainedTokenizerBasepadding: Union[bool, str, PaddingStrategy] Truemax_length: Optional[int] Nonepad_to_multiple_of: Optional[int] Nonereturn_tensors: str ptdef __call__(self, features: List[Dict[str, Any]]) - Dict[str, Any]:batch self.tokenizer.pad(features,paddingself.padding,max_lengthself.max_length,pad_to_multiple_ofself.pad_to_multiple_of,return_tensorsself.return_tensors,)if label in batch:batch[labels] batch[label]del batch[label]if label_ids in batch:batch[labels] batch[label_ids]del batch[label_ids]return batch在实例化过程中注意pad_to_multiple_of其含义是指将max_length扩充为指定值的整数倍。举例而言如果max_length510pad_to_multiple_of8则会将max_length设置为512。参考transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad源码 def _pad(self,encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],max_length: Optional[int] None,padding_strategy: PaddingStrategy PaddingStrategy.DO_NOT_PAD,pad_to_multiple_of: Optional[int] None,return_attention_mask: Optional[bool] None,) - dict:Pad encoded inputs (on left/right and up to predefined length or max length in the batch)Args:encoded_inputs:Dictionary of tokenized inputs (List[int]) or batch of tokenized inputs (List[List[int]]).max_length: maximum length of the returned list and optionally padding length (see below).Will truncate by taking into account the special tokens.padding_strategy: PaddingStrategy to use for padding.- PaddingStrategy.LONGEST Pad to the longest sequence in the batch- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)- PaddingStrategy.DO_NOT_PAD: Do not padThe tokenizer padding sides are defined in self.padding_side:- left: pads on the left of the sequences- right: pads on the right of the sequencespad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability 7.5 (Volta).return_attention_mask:(optional) Set to False to avoid returning attention mask (default: set to model specifics) ... ...if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of ! 0):max_length ((max_length // pad_to_multiple_of) 1) * pad_to_multiple_of ... ...DataCollatorWithPadding的__call__方法中同样将label或者label_ids重命名为labels。并且其实质是通过transformers.tokenization_utils_base.PreTrainedTokenizerBase.pad实现的。 def pad(self,encoded_inputs: Union[BatchEncoding,List[BatchEncoding],Dict[str, EncodedInput],Dict[str, List[EncodedInput]],List[Dict[str, EncodedInput]],],padding: Union[bool, str, PaddingStrategy] True,max_length: Optional[int] None,pad_to_multiple_of: Optional[int] None,return_attention_mask: Optional[bool] None,return_tensors: Optional[Union[str, TensorType]] None,verbose: bool True,) - BatchEncoding:Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence lengthin the batch.Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side,self.pad_token_id and self.pad_token_type_id).Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode thetext followed by a call to the pad method to get a padded encoding.TipIf the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, theresult will use the same type unless you provide a different tensor type with return_tensors. In the case ofPyTorch tensors, you will lose the specific device of your tensors however./TipArgs:encoded_inputs ([BatchEncoding], list of [BatchEncoding], Dict[str, List[int]], Dict[str, List[List[int]] or List[Dict[str, List[int]]]):Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch oftokenized inputs (list of [BatchEncoding], *Dict[str, List[List[int]]]* or *List[Dict[str,List[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloadercollate function.Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), seethe note above for the return type.padding (bool, str or [~utils.PaddingStrategy], *optional*, defaults to True):Select a strategy to pad the returned sequences (according to the models padding side and paddingindex) among:- True or longest: Pad to the longest sequence in the batch (or no padding if only a singlesequence if provided).- max_length: Pad to a maximum length specified with the argument max_length or to the maximumacceptable input length for the model if that argument is not provided.- False or do_not_pad (default): No padding (i.e., can output a batch with sequences of differentlengths).max_length (int, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (int, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability 7.5 (Volta).return_attention_mask (bool, *optional*):Whether to return the attention mask. If left to the default, will return the attention mask accordingto the specific tokenizers default, defined by the return_outputs attribute.[What are attention masks?](../glossary#attention-mask)return_tensors (str or [~utils.TensorType], *optional*):If set, will return tensors instead of list of python integers. Acceptable values are:- tf: Return TensorFlow tf.constant objects.- pt: Return PyTorch torch.Tensor objects.- np: Return Numpy np.ndarray objects.verbose (bool, *optional*, defaults to True):Whether or not to print more information and warnings.......# If we have a list of dicts, lets convert it in a dict of lists# We do this to allow using this method as a collate_fn function in PyTorch Dataloaderif isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):encoded_inputs {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()} ......首先注意pad方法对输入参数的要求其中EncodedInput是List[int]的别名。BatchEncoding可以看做是一个字典对象其格式满足Dict[str, Any]其数据存储在data属性中。并且BatchEncoding实例化过程中会调用convert_to_tensors方法该方法会将data属性中的数据转换成tensor类型。如果输入的特征数据是List[Dict[str, Any]]格式会将其转换为Dict[str, List]以满足pytorch Dataloader的要求。并且如果直接使用datasets.Dataset示例对象作为pad方法的输入会报错——datasets.Dataset示例没有keys属性。 示例 x [{text: 中国是一个伟大国家。, label: 1}] ds Dataset.from_list(x) features ds.map(func, batchedFalse, remove_columns[text]) data_collator DataCollatorWithPadding(tokenizertokenizer, paddingTrue) dataset data_collator(featuresfeatures.to_list()) # convert Dataset into List
http://www.zqtcl.cn/news/409829/

相关文章:

  • 哪些网站可以做画赚钱wordpress go跳转页
  • 怎么做新网站上线通稿深圳罗湖区网站建设公司
  • php 企业网站做网站可以赚钱吗
  • 局域网视频网站建设点播系统长沙3合1网站建设价格
  • 静态网站 服务器合肥做个网站什么价格
  • 宁阳网站设计家电网站设计方案
  • 网站备案icp秦皇岛黄金海岸
  • dedecms 金融类网站模板wordpress dux5.3
  • 学校网站源码wordpress向网站上传文件怎么做
  • 电子商务网站建设说课稿济南网站建设方案报价
  • 谈谈设计和建设网站体会wordpress header在哪
  • 360免费建站怎么进不去域名托管
  • 做网站视频存储网站建设有云端吗
  • 建网站如何上传南通 网站优化
  • 青海学会网站建设公司果汁网站模板
  • 10_10_网站建站怎么做网站链接支付
  • 九台网站甘肃网站优化
  • phpcms 网站源码建设银行科技中心网站首页
  • 营销型网站建设php源码无锡设计网站公司
  • 在线制作简历的网站绍兴seo全网营销
  • 个人做网站接装修活哪个网站好长沙企业网站建设分公司
  • 青岛网站制作辰星辰中国万网icp网站备案专题
  • 做淘宝相关网站上海网站建设做物流一
  • 服装配件网站建设 中企动力静态网站 后台
  • 做网站较好的框架网站建设的定位是什么
  • 如何保护自己的网站桂林医院网站建设
  • 产品品牌策划方案佛山网站优化美姿姿seo
  • 北京城建一建设发展有限公司网站大连在哪个省的什么位置
  • 北京知名网站建设公司排名学校诗歌网站建设
  • 个人做网站接装修活哪个网站好上海造价信息网官网