当前位置：首页 > news >正文

淘宝小网站怎么做的电商网站有哪些

news 2025/11/15 6:10:47

淘宝小网站怎么做的,电商网站有哪些,培训行业网站建设,免费学生html网页制作成品前言使用huggingface的Dataset加载数据集#xff0c;然后使用过tokenizer对文本数据进行编码#xff0c;但是此时的特征数据还不是tensor#xff0c;需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。本文…前言使用huggingface的Dataset加载数据集然后使用过tokenizer对文本数据进行编码但是此时的特征数据还不是tensor需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。本文记录huggingface transformers中两种比较常用的data_collator一种是default_data_collator另一种是DataCollatorWithPadding。本文使用BertTokenizer作为基础tokenizer如下所示 from transformers import BertTokenizer from transformers import default_data_collator, DataCollatorWithPadding from datasets import Datasettokenizer BertTokenizer.from_pretrained(hfl/chinese-bert-wwm-ext)def func(exam):return tokenizer(exam[text])default_data_collator 如果使用pytorch框架default_data_collator本质是执行torch_default_data_collator。注意输入参数要求是List[Any]格式输出需满足Dict[str, Any]格式。 def default_data_collator(features: List[InputDataClass], return_tensorspt) - Dict[str, Any]:Very simple data collator that simply collates batches of dict-like objects and performs special handling forpotential keys named:- label: handles a single value (int or float) per object- label_ids: handles a list of values per objectDoes not do any additional preprocessing: property names of the input object will be used as corresponding inputsto the model. See glue and ner for example of how its useful.# In this function well make the assumption that all features in the batch# have the same attributes.# So we will look at the first element as a proxy for what attributes exist# on the whole batch.if return_tensors pt:return torch_default_data_collator(features)elif return_tensors tf:return tf_default_data_collator(features)elif return_tensors np:return numpy_default_data_collator(features)torch_default_data_collator 源码如下源码中假设所有features特征数据拥有相同的属性信息因此源码选择使用第一个样例数据进行逻辑判断。另外源码对特征数据中的label或者label_ids属性进行特殊处理分别对应单标签分类与多标签分类。并且将特征属性更名为“labels”——大多数预训练模型的forward方法中定义的关键词参数名为labels。 def torch_default_data_collator(features: List[InputDataClass]) - Dict[str, Any]:import torchif not isinstance(features[0], Mapping):features [vars(f) for f in features]first features[0]batch {}# Special handling for labels.# Ensure that tensor is created with the correct type# (it should be automatically the case, but lets make sure of it.)if label in first and first[label] is not None:label first[label].item() if isinstance(first[label], torch.Tensor) else first[label]dtype torch.long if isinstance(label, int) else torch.floatbatch[labels] torch.tensor([f[label] for f in features], dtypedtype)elif label_ids in first and first[label_ids] is not None:if isinstance(first[label_ids], torch.Tensor):batch[labels] torch.stack([f[label_ids] for f in features])else:dtype torch.long if type(first[label_ids][0]) is int else torch.floatbatch[labels] torch.tensor([f[label_ids] for f in features], dtypedtype)# Handling of all other possible keys.# Again, we will use the first element to figure out which key/values are not None for this model.for k, v in first.items():if k not in (label, label_ids) and v is not None and not isinstance(v, str):if isinstance(v, torch.Tensor):batch[k] torch.stack([f[k] for f in features])elif isinstance(v, np.ndarray):batch[k] torch.tensor(np.stack([f[k] for f in features]))else:batch[k] torch.tensor([f[k] for f in features])return batch示例 x [{text: 我爱中国。, label: 1}, {text: 我爱中国。, label: 1}] ds Dataset.from_list(x) features ds.map(func, batchedFalse, remove_columns[text]) dataset default_data_collator(features)DataCollatorWithPadding 注意DataCollatorWithPadding是一个类首先需要实例化然后再将features转为dataset。与default_data_collator相比DataCollatorWithPadding会为接受到的特征数据进行padding操作——各个维度的size补全到相同值。其源码如下 dataclass class DataCollatorWithPadding:Data collator that will dynamically pad the inputs received.Args:tokenizer ([PreTrainedTokenizer] or [PreTrainedTokenizerFast]):The tokenizer used for encoding the data.padding (bool, str or [~utils.PaddingStrategy], *optional*, defaults to True):Select a strategy to pad the returned sequences (according to the models padding side and padding index)among:- True or longest (default): Pad to the longest sequence in the batch (or no padding if only a singlesequence is provided).- max_length: Pad to a maximum length specified with the argument max_length or to the maximumacceptable input length for the model if that argument is not provided.- False or do_not_pad: No padding (i.e., can output a batch with sequences of different lengths).max_length (int, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (int, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability 7.5 (Volta).return_tensors (str):The type of Tensor to return. Allowable values are np, pt and tf.tokenizer: PreTrainedTokenizerBasepadding: Union[bool, str, PaddingStrategy] Truemax_length: Optional[int] Nonepad_to_multiple_of: Optional[int] Nonereturn_tensors: str ptdef __call__(self, features: List[Dict[str, Any]]) - Dict[str, Any]:batch self.tokenizer.pad(features,paddingself.padding,max_lengthself.max_length,pad_to_multiple_ofself.pad_to_multiple_of,return_tensorsself.return_tensors,)if label in batch:batch[labels] batch[label]del batch[label]if label_ids in batch:batch[labels] batch[label_ids]del batch[label_ids]return batch在实例化过程中注意pad_to_multiple_of其含义是指将max_length扩充为指定值的整数倍。举例而言如果max_length510pad_to_multiple_of8则会将max_length设置为512。参考transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad源码 def _pad(self,encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],max_length: Optional[int] None,padding_strategy: PaddingStrategy PaddingStrategy.DO_NOT_PAD,pad_to_multiple_of: Optional[int] None,return_attention_mask: Optional[bool] None,) - dict:Pad encoded inputs (on left/right and up to predefined length or max length in the batch)Args:encoded_inputs:Dictionary of tokenized inputs (List[int]) or batch of tokenized inputs (List[List[int]]).max_length: maximum length of the returned list and optionally padding length (see below).Will truncate by taking into account the special tokens.padding_strategy: PaddingStrategy to use for padding.- PaddingStrategy.LONGEST Pad to the longest sequence in the batch- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)- PaddingStrategy.DO_NOT_PAD: Do not padThe tokenizer padding sides are defined in self.padding_side:- left: pads on the left of the sequences- right: pads on the right of the sequencespad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability 7.5 (Volta).return_attention_mask:(optional) Set to False to avoid returning attention mask (default: set to model specifics) ... ...if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of ! 0):max_length ((max_length // pad_to_multiple_of) 1) * pad_to_multiple_of ... ...DataCollatorWithPadding的__call__方法中同样将label或者label_ids重命名为labels。并且其实质是通过transformers.tokenization_utils_base.PreTrainedTokenizerBase.pad实现的。 def pad(self,encoded_inputs: Union[BatchEncoding,List[BatchEncoding],Dict[str, EncodedInput],Dict[str, List[EncodedInput]],List[Dict[str, EncodedInput]],],padding: Union[bool, str, PaddingStrategy] True,max_length: Optional[int] None,pad_to_multiple_of: Optional[int] None,return_attention_mask: Optional[bool] None,return_tensors: Optional[Union[str, TensorType]] None,verbose: bool True,) - BatchEncoding:Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence lengthin the batch.Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side,self.pad_token_id and self.pad_token_type_id).Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode thetext followed by a call to the pad method to get a padded encoding.TipIf the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, theresult will use the same type unless you provide a different tensor type with return_tensors. In the case ofPyTorch tensors, you will lose the specific device of your tensors however./TipArgs:encoded_inputs ([BatchEncoding], list of [BatchEncoding], Dict[str, List[int]], Dict[str, List[List[int]] or List[Dict[str, List[int]]]):Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch oftokenized inputs (list of [BatchEncoding], *Dict[str, List[List[int]]]* or *List[Dict[str,List[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloadercollate function.Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), seethe note above for the return type.padding (bool, str or [~utils.PaddingStrategy], *optional*, defaults to True):Select a strategy to pad the returned sequences (according to the models padding side and paddingindex) among:- True or longest: Pad to the longest sequence in the batch (or no padding if only a singlesequence if provided).- max_length: Pad to a maximum length specified with the argument max_length or to the maximumacceptable input length for the model if that argument is not provided.- False or do_not_pad (default): No padding (i.e., can output a batch with sequences of differentlengths).max_length (int, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (int, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability 7.5 (Volta).return_attention_mask (bool, *optional*):Whether to return the attention mask. If left to the default, will return the attention mask accordingto the specific tokenizers default, defined by the return_outputs attribute.[What are attention masks?](../glossary#attention-mask)return_tensors (str or [~utils.TensorType], *optional*):If set, will return tensors instead of list of python integers. Acceptable values are:- tf: Return TensorFlow tf.constant objects.- pt: Return PyTorch torch.Tensor objects.- np: Return Numpy np.ndarray objects.verbose (bool, *optional*, defaults to True):Whether or not to print more information and warnings.......# If we have a list of dicts, lets convert it in a dict of lists# We do this to allow using this method as a collate_fn function in PyTorch Dataloaderif isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):encoded_inputs {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()} ......首先注意pad方法对输入参数的要求其中EncodedInput是List[int]的别名。BatchEncoding可以看做是一个字典对象其格式满足Dict[str, Any]其数据存储在data属性中。并且BatchEncoding实例化过程中会调用convert_to_tensors方法该方法会将data属性中的数据转换成tensor类型。如果输入的特征数据是List[Dict[str, Any]]格式会将其转换为Dict[str, List]以满足pytorch Dataloader的要求。并且如果直接使用datasets.Dataset示例对象作为pad方法的输入会报错——datasets.Dataset示例没有keys属性。示例 x [{text: 中国是一个伟大国家。, label: 1}] ds Dataset.from_list(x) features ds.map(func, batchedFalse, remove_columns[text]) data_collator DataCollatorWithPadding(tokenizertokenizer, paddingTrue) dataset data_collator(featuresfeatures.to_list()) # convert Dataset into List

查看全文

http://www.zqtcl.cn/news/409829/